CSC/ECE 517 Spring 2024 - E2412. Testing for hamer.rb: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(40 intermediate revisions by the same user not shown)
Line 16: Line 16:
=== Files Involved ===
=== Files Involved ===


*reputation_web_server_hamer.rb
* reimplemented algorithm: /app/controllers/reputation_web_service_controller.rb
*reputation_mock_web_server_hamer.rb
* test file: /spec/controllers/reputation_mock_web_server_hamer.rb


=== Mentor ===
=== Mentor ===
Line 25: Line 25:
=== Team Members ===
=== Team Members ===


* Neha Vijay Patil(npatil2@ncsu.edu)
* Neha Vijay Patil (npatil2@ncsu.edu)
* Prachit Mhalgi (pmhalgi@ncsu.edu)
* Prachit Mhalgi (psmhalgi@ncsu.edu)
* Sahil Santosh Sawant (ssawant2@ncsu.edu)
* Sahil Santosh Sawant (ssawant2@ncsu.edu)


== Algorithms ==
== Hamer Algorithm ==


Reputation systems may take various factors into account:
The grading algorithm described in the paper is designed to provide a reward to reviewers who participate effectively by allocating a portion of the assignment mark to the review, with the review mark reflecting the quality of the grading. Here's an explanation of the algorithm:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?


There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
1. Review Allocation: Each reviewer is assigned a number of essays to grade. The paper suggests assigning at least five essays, with ten being ideal. Assuming each review takes 20 minutes, ten reviews can be completed in about three and a half hours.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].


=== Hamer Algorithm ===
2. Grading Process:
* Once the reviewing is complete, grades are generated for each essay and weights are assigned to each reviewer.
* The essay grades are computed by averaging the individual grades from all the reviewers assigned to that essay.
* Initially, all reviewers are given equal weight in the averaging process.
* The algorithm assumes that some reviewers will perform better than others. It measures this by comparing the grades assigned by each reviewer to the averaged grades. The larger the difference between the assigned and averaged grades, the more out of step the reviewer is considered with the consensus view of the class.
* The algorithm adjusts the weighting of the reviewers based on this difference. Reviewers who are closer to the consensus view are given higher weights, while those who deviate significantly are given lower weights.
 
3. Iterative Process:
* The calculation of grades and weights is an iterative process. Each time the grades are calculated, the weights need to be updated, and each change in the weights affects the grades.
* Convergence occurs quickly, typically requiring four to six iterations before a solution (a "fix-point") is reached.
 
4. Weight Adjustment:
* The weights assigned to reviewers are adjusted based on the difference between the assigned and averaged grades. Reviewers with larger discrepancies have their weights adjusted inversely proportional to this difference.
* To prevent excessively large weights, a logarithmic dampening function is applied, allowing weights to rise to twice the class average before further increases are awarded sparingly.
 
5. Properties:
* The algorithm aims to identify and diminish the impact of "rogue" reviewers who may inject random or arbitrary grades into the peer assessment process.
* By adjusting reviewer weights based on their grading accuracy, the algorithm aims to improve the reliability of the grading process in the presence of such rogue reviewers.
 
Overall, the algorithm seeks to balance the contributions of different reviewers based on the accuracy of their grading, ultimately aiming to produce reliable grades for each essay in a peer assessment scenario.
 
== Hamer value calculation ==
[[File:Step1.PNG|400px]]
[[File:Step1.PNG|400px]]
<br>
<br>
Line 49: Line 65:
[[File:Step4.PNG|400px]]
[[File:Step4.PNG|400px]]


We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Objective 1: Develop code testing scenarios ==
== Test Plan - Initial Phase ==
We assumed 9 reviewers to review 4 submissions each to cover the following test scenarios:
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
* 3 cases where reviewers are giving credible scores (passing1, passing2, passing3)
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
* case where reviewer is giving max scores (10) to all submissions (should be flagged)
correctly with Expertiza.
* case where reviewer is giving min scores (0) to all submissions (should be flagged)
* case where reviewer is giving median scores (5) to all submissions (should be flagged)
* case where reviewer is giving same scores to all submissions (should be flagged)


=== Initial Testing Plan, Object Creation ===
=== Object Creation ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
Below is the Input object for tests that cover all the above scenarios:
would only accept and respond with JSON data.


2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
<pre>
import math
INPUTS_new = {
    "submission1": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "incomplete_review": 4,
    "max_incomplete": 10,
    "sametoall":3,
    "passing1": 10,
    "passing2": 10,
    "passing3": 9
    },
      "submission2": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "incomplete_review": 2,
    "max_incomplete": 10,
    "min_incomplete": 1,
    "sametoall":3,
    "passing1": 3,
    "passing2": 2,
    "passing3": 4
    },
      "submission3": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "sametoall":3,
    "passing1": 7,
    "passing2": 4,
    "passing3": 5
    },
      "submission4": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "max_incomplete": 10,
    "min_incomplete": 1,
    "sametoall":3,
    "passing1": 6,
    "passing2": 4,
    "passing3": 5
    }
}.to_json
</pre>


# Parameters: reviews list
=== Expected Hamer Values ===
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay          Reviewer1 Reviewer2 Reviewer3
# Assignment1    5        5          4
# Assignment2    4        3          3
# Assignment3    4        4          4
# Assignment4    3        4          3
# Assignment5    2        2          2


# Reivewer's grades given to each assignment 2D array
<pre>
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
EXPECTED = {
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
     "Hamer": {
 
         "maxtoall": 2.65,
# Number of reviewers
        "mintoall": 2.41,
numReviewers = len(reviews)
         "mediantoall": 1.03,
# Number of assignments
        "incomplete_review": 2.31,
numAssig = len(reviews[0])
        "max_incomplete": 2.57,
# Initial empty grades for each assignment array
        "min_incomplete": 2.48,
grades = []
         "sametoall":1.58,
# Initial empty delta R array
        "passing1": 2.17,
deltaR = []
        "passing2": 1.73,
# Weight prime
        "passing3": 1.23,
weightPrime = []
    }
# Reviewer's reputation weight
}.to_json
weight= []
 
# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
     assignmentGradeAverage = 0
    for numReviewerIndex in range(numReviewers):
         assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
    grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)
 
# Calculating delta R
for numReviewerIndex in range(numReviewers):
    reviewerDeltaR = 0
    assignmentAverageGradeIndex = 0
    for reviewGrade in reviews[numReviewerIndex]:
        reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
         assignmentAverageGradeIndex += 1
    reviewerDeltaR /= numAssig
    deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)
 
# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
    averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)
 
# Calculating weight prime
for reviewerDeltaR in deltaR:
    weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)
   
# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
    if reviewerWeightPrime <= 2:
         weight.append(reviewerWeightPrime)
    else:
        weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
    print("Reputation of Reviewer ", i)
    print(round(reviewerWeight,1))
    i += 1
</pre>
</pre>


=== Initial Testing Conclusions ===
== Objective 2: Reimplement the algorithm if discrepancies arise in the reputation web server's Hamer values. ==
The results that are '''''actually''''' received from Peerlogic are presented below:
<br>
[[File:json-ss.jpeg|600px]]


The results from our python recreated Hamer Algorithm are as followed:<br>
As established before, the values returned by reputation server do not match the expected values. Hence, we concluded that the PeerLogic Webservice is implemented incorrectly. In this phase, we implemented the algorithm in Ruby as a function in a controller file : /app/controllers/reputation_web_service_controller.rb
[[File:Reputation web server hamer2.png|250px]]<br>


=== Changes made in implementation ===
* coded this algorithm in Ruby in a controller.
* Included a way for the algorithm to handle nil values.


As you can see, they do NOT match with expected results.
=== Code Snippet ===
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
<pre>
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==


=== Test Plan - Second Phase, Object Creation ===
# Method: calculate_reputation_score
We followed the testing thought process recommended by Dr. Gehringer:
# This method calculates the reputation scores for each reviewer based on the provided input data.
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.  
# It first parses the input JSON string to extract the submissions and their corresponding scores.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
# Then, it calculates the average weighted grades per reviewer and the delta R values.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
# Next, it calculates the weight prime values based on the delta R values.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
# Finally, it calculates the reputation weights for each reviewer using the weight prime values.
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
#
::b. In the form of an RSpec mock, in the second snippet below.
# Params
#  - input_json: a JSON string representing the input data with submission scores
#
# Returns
#  An array of reputation scores, one score per reviewer, indicating their reputation in the system.
def calculate_reputation_score(reviews)
  # Parse the input JSON string
  # reviews = JSON.parse(input_json)


The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms
  # Initialize arrays to store intermediate values
  grades = []
  delta_r = []
  weight_prime = []
  weight = []


As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
  # Calculate Average Weighted Grades per Reviewer
This is what we are supposed to reach in this project.
  reviews.each do |reviewer_marks|
<br>
    # Skip nil values when calculating the sum
    reviewer_marks_without_nil = reviewer_marks.compact
    assignment_grade_average = reviewer_marks_without_nil.sum.to_f / reviewer_marks_without_nil.length
    grades << assignment_grade_average
  end


Test Code Snippet:
  # Calculate delta R
  reviews.each do |reviewer_marks|
    reviewer_delta_r = 0
    # Skip nil values when calculating the sum
    reviewer_marks_without_nil = reviewer_marks.compact
    reviewer_marks_without_nil.each_with_index do |grade, student_index|
      reviewer_delta_r += (grade - grades[student_index]) ** 2
    end
    delta_r << reviewer_delta_r / reviewer_marks_without_nil.length
  end


<pre>
  # Calculate weight prime
require "net/http"
  average_delta_r = delta_r.sum / delta_r.length.to_f
require "json"
#The following contains 4 reviewers who have scored 4 reviewees
# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Input visualization:
# Corresponding reviewer and grade for each assignment table
# Reviewer->  stu9999 stu9998 stu9997 stu9996
# Assignment
# Assignment1    5        5          4
# Assignment2    4        3          3
# Assignment3    4        4          4
# Assignment4    3        4          3


INPUTS = {
  delta_r.each do |reviewer_delta_r|
    "submission9999": {
     weight_prime << average_delta_r / reviewer_delta_r
    "stu9999": 10,
  end
    "stu9998": 10,
    "stu9997": 9,
    "stu9996": 5
    },
      "submission9998": {
    "stu9999": 3,
    "stu9998": 2,
    "stu9997": 4,
    "stu9996": 5
    },
      "submission9997": {
    "stu9999": 7,
    "stu9998": 4,
    "stu9997": 5,
    "stu9996": 5
    },
      "submission9996": {
    "stu9999": 6,
    "stu9998": 4,
    "stu9997": 5,
    "stu9996": 5
    }
}.to_json
#
#Values that would be returned by a correct Hamer implementation
EXPECTED = {
    "Hamer": {
    "9996": 0.6,
    "9997": 3.6,
    "9998": 1.1,
    "9999": 1.1
     }
}.to_json


#sends API request to Peeerlogic
  # Calculate reputation weight
describe "Expertiza" do
  weight_prime.each do |reviewer_weight_prime|
    it "should return the correct Hamer calculation" do
    if reviewer_weight_prime <= 2
        uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
      weight << reviewer_weight_prime.round(2)
        req = Net::HTTP::Post.new(uri)
    else
        req.content_type = 'application/json'
      weight << (2 + Math.log(reviewer_weight_prime - 1)).round(2)
        req.body = INPUTS
       
        response = Net::HTTP.start(uri.hostname, uri.port) do |http|
          http.request(req)
        end
       
  #assertion fails, as expected
        expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
     end
     end
end
  end


#sends API request to Mock Hamer/Peerlogic Server
  # Return the reputation weights
describe "Expertiza Web Service" do
  weight
    it "should return the correct Hamer calculation" do
  end
        uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
        req = Net::HTTP::Post.new(uri)
        req.content_type = 'application/json'
        req.body = INPUTS
       
        response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
          http.request(req)
        end
        expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
    end
end
end
</pre>


</pre>
== Objective 3: Validate the accuracy of the newly implemented Hamer algorithm ==


<br>
We test the newly implemented Hamer algorithm function with our scenarios and verify if they match the expected values.  
=== Second Phase Output ===
<br>
[[File:Response_json_expected.jpeg]]


In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
=== Test Code Snippet ===
the accuracy of the implementation as a whole.


Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.


<pre>
<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0
describe ReputationWebServiceController do
 
    it "should calculate correct Hamer calculation" do
WebMock.disable_net_connect!(allow_localhost: true)
      weights = ReputationWebServiceController.new.calculate_reputation_score(reviews)
#Setting up test objects
      keys = ["maxtoall", "mintoall", "mediantoall", "incomplete_review", "sametoall", "passing1", "passing2", "passing3"]
INPUTS = {
       rounded_weights = weights.map { |w| w.round(1) }
    "submission9999": {
       result_hash = keys.zip(rounded_weights).to_h
    "stu9999": 10,
      expect(result_hash).to eq(JSON.parse(EXPECTED)["Hamer"])
    "stu9998": 10,
    "stu9997": 9,
    "stu9996": 5
    },
      "submission9998": {
    "stu9999": 3,
    "stu9998": 2,
    "stu9997": 4,
    "stu9996": 5
    },
       "submission9997": {
    "stu9999": 7,
    "stu9998": 4,
    "stu9997": 5,
    "stu9996": 5
    },
      "submission9996": {
    "stu9999": 6,
    "stu9998": 4,
    "stu9997": 5,
    "stu9996": 5
    }
}.to_json
#Result expectations are identical here, in order to maintain uniformity
EXPECTED = {
    "Hamer": {
    "9996": 0.6,
    "9997": 3.6,
    "9998": 1.1,
    "9999": 1.1
    }
}.to_json
#tests Peerlogic
describe "Expertiza" do
    before(:each) do
        stub_request(:post, /peerlogic.csc.ncsu.edu/).
          to_return(status: 200, body: EXPECTED, headers: {})
       end
    it "should return the correct Hamer calculation" do
        uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
#JSON conversion to ensure server compatibility
        req = Net::HTTP::Post.new(uri)
        req.content_type = 'application/json'
        req.body = INPUTS
       
        response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
          http.request(req)
        end
#parse the JSON response body to access values for algorithm of choice, which is Hamer   
        expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
     end
     end
end
end
#our assertion proves this mock works
</pre>
</pre>


== Edge Cases & Scenarios ==
=== Results===
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores <br>
2) Reviewer gives all min scores <br>
3) Reviewer completes no review <br>
::alternative scenario - reviewer gives max scores even if no inputs


These have not been implemented as there is no point in testing a system further when positive flows do not work.
[[File:Results hamer new.jpeg]]
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:


== Coverage ==
=== Conclusion ===
 
The observed results indicate a tendency towards lower values, primarily due to our decision to include nil values and treat them as zeros in our analysis. This treatment has led to a skew in the scores, favoring lower values and potentially impacting the accuracy of our findings. To address this issue and improve the robustness of our analysis, it is advisable to explore alternative approaches such as using median or random values instead of treating nil values as zeros. However, we must also carefully consider how to handle incomplete reviews that contain nil values in our input dataset, as this can significantly influence the overall integrity and reliability of our results and conclusions.
We believe that after our edge cases are implemented for a working Peerlogic, and the assertions pass, that test coverage can then be adequately measured.<br>
At this moment, test coverage is not a relevant statistic as no positive or negative flows functions correctly, as do any edge cases.


== Conclusion ==
== Conclusion ==
In this project, we aimed to test the accuracy of the Hamer algorithm used for assessing the credibility of reviewers in a peer assessment system. We began by developing code testing scenarios to validate the Hamer algorithm and ensure the accuracy of its output values. These scenarios covered various review scenarios, including cases where reviewers provided extreme scores.


We as a team figured out the algorithms and applications and write some test scenarios. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on Expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
It was established that the original reputation web server was implemented incorrectly.
<br>
::1. In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong ::::(https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the ::paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already ::have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by ::code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected ::values as can be seen in the picture. This can be what we are supposed to reach in this project.
 
::2. In addition, we also found out that the reputation_web_service_controller.rb currently is broken and needs refactoring. While the client side of the reputation web service page runs, any attempt to submit grades to the ::reputation web server side results in an error.  


::3. We provided scenarios for future teams to implement once Peerlogic is running correctly.
As a result, we proceeded to reimplement the Hamer algorithm in Ruby, incorporating adjustments to handle nil values appropriately. Subsequently, we validated the accuracy of the newly implemented algorithm using the same testing scenarios. While the results initially showed a skew towards lower values due to our treatment of nil values, we acknowledge the need for further refinement to handle these cases more effectively.


::4. We mocked an accurate webservice and showed what the expected JSON should be like.
In conclusion, this project highlights the importance of rigorous testing and implementation adjustments in ensuring the reliability of algorithms used in peer assessment systems. Moving forward, we recommend further refinements and validations to enhance the accuracy and robustness of the Hamer algorithm.


==GitHub Links==
==Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]


Link to the forked repository: [https://github.com/joshlin5/expertiza here]
Link to the forked repository: [https://github.com/Prachit99/expertiza/tree/main here]


Link to pull request: [https://github.com/expertiza/expertiza/pull/2357 here]
Link to pull request: [https://github.com/expertiza/expertiza/pull/2778 here]


Link to Github Project page: [https://github.com/joshlin5/expertiza/projects/2 here]
Link to Github Project page: [https://github.com/users/Prachit99/projects/1 here]


Link to Testing Video: [https://youtu.be/VyeGGpxymXk here]
Link to Testing Video: [https://drive.google.com/file/d/1gZ5iDgqMW3COOT9Uw-_yhJlwJLZDxz-s/view?usp=sharing here]


== References ==
== References ==

Latest revision as of 04:34, 25 March 2024

This page describes the changes made for the Spring 2024 Program 3: First OSS project E2412. Testing for hamer.rb

Project Overview

Problem Statement

The practice of using student feedback on assignments as a grading tool is gaining traction among university professors and courses. This approach not only saves instructors and teaching assistants considerable time but also fosters a deeper understanding of assignments among students as they evaluate their peers' work. However, there is a concern that some students may not take their reviewing responsibilities seriously, potentially skewing the grading process by assigning extreme scores such as 100 or 0 arbitrarily. To address this issue, the Hamer algorithm was developed to assess the credibility and accuracy of reviewers. It generates reputation weights for each reviewer, which instructors can use to gauge their reliability or incorporate into grading calculations. Our goal here is to test this Hamer algorithm.

Objectives

  • Develop code testing scenarios to validate the Hamer algorithm and ensure the accuracy of its output values.
  • Verify the correctness of the reputation web server's Hamer values by accessing the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
  • Reimplement the algorithm if discrepancies arise in the reputation web server's Hamer values.
  • Validate the accuracy of the newly implemented Hamer algorithm.

Files Involved

  • reimplemented algorithm: /app/controllers/reputation_web_service_controller.rb
  • test file: /spec/controllers/reputation_mock_web_server_hamer.rb

Mentor

  • Muhammet Mustafa Olmez (molmez@ncsu.edu)

Team Members

  • Neha Vijay Patil (npatil2@ncsu.edu)
  • Prachit Mhalgi (psmhalgi@ncsu.edu)
  • Sahil Santosh Sawant (ssawant2@ncsu.edu)

Hamer Algorithm

The grading algorithm described in the paper is designed to provide a reward to reviewers who participate effectively by allocating a portion of the assignment mark to the review, with the review mark reflecting the quality of the grading. Here's an explanation of the algorithm:

1. Review Allocation: Each reviewer is assigned a number of essays to grade. The paper suggests assigning at least five essays, with ten being ideal. Assuming each review takes 20 minutes, ten reviews can be completed in about three and a half hours.

2. Grading Process:

  • Once the reviewing is complete, grades are generated for each essay and weights are assigned to each reviewer.
  • The essay grades are computed by averaging the individual grades from all the reviewers assigned to that essay.
  • Initially, all reviewers are given equal weight in the averaging process.
  • The algorithm assumes that some reviewers will perform better than others. It measures this by comparing the grades assigned by each reviewer to the averaged grades. The larger the difference between the assigned and averaged grades, the more out of step the reviewer is considered with the consensus view of the class.
  • The algorithm adjusts the weighting of the reviewers based on this difference. Reviewers who are closer to the consensus view are given higher weights, while those who deviate significantly are given lower weights.

3. Iterative Process:

  • The calculation of grades and weights is an iterative process. Each time the grades are calculated, the weights need to be updated, and each change in the weights affects the grades.
  • Convergence occurs quickly, typically requiring four to six iterations before a solution (a "fix-point") is reached.

4. Weight Adjustment:

  • The weights assigned to reviewers are adjusted based on the difference between the assigned and averaged grades. Reviewers with larger discrepancies have their weights adjusted inversely proportional to this difference.
  • To prevent excessively large weights, a logarithmic dampening function is applied, allowing weights to rise to twice the class average before further increases are awarded sparingly.

5. Properties:

  • The algorithm aims to identify and diminish the impact of "rogue" reviewers who may inject random or arbitrary grades into the peer assessment process.
  • By adjusting reviewer weights based on their grading accuracy, the algorithm aims to improve the reliability of the grading process in the presence of such rogue reviewers.

Overall, the algorithm seeks to balance the contributions of different reviewers based on the accuracy of their grading, ultimately aiming to produce reliable grades for each essay in a peer assessment scenario.

Hamer value calculation




Objective 1: Develop code testing scenarios

We assumed 9 reviewers to review 4 submissions each to cover the following test scenarios:

  • 3 cases where reviewers are giving credible scores (passing1, passing2, passing3)
  • case where reviewer is giving max scores (10) to all submissions (should be flagged)
  • case where reviewer is giving min scores (0) to all submissions (should be flagged)
  • case where reviewer is giving median scores (5) to all submissions (should be flagged)
  • case where reviewer is giving same scores to all submissions (should be flagged)

Object Creation

Below is the Input object for tests that cover all the above scenarios:

INPUTS_new = {
    "submission1": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "incomplete_review": 4,
    "max_incomplete": 10,
    "sametoall":3,
    "passing1": 10,
    "passing2": 10,
    "passing3": 9
    },
      "submission2": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "incomplete_review": 2,
    "max_incomplete": 10,
    "min_incomplete": 1,
    "sametoall":3,
    "passing1": 3,
    "passing2": 2,
    "passing3": 4
    },
      "submission3": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "sametoall":3,
    "passing1": 7,
    "passing2": 4,
    "passing3": 5
    },
      "submission4": {
    "maxtoall": 10,
    "mintoall": 1,
    "mediantoall": 5,
    "max_incomplete": 10,
    "min_incomplete": 1,
    "sametoall":3,
    "passing1": 6,
    "passing2": 4,
    "passing3": 5
    }
}.to_json

Expected Hamer Values

EXPECTED = {
    "Hamer": {
        "maxtoall": 2.65,
        "mintoall": 2.41,
        "mediantoall": 1.03,
        "incomplete_review": 2.31,
        "max_incomplete": 2.57,
        "min_incomplete": 2.48,
        "sametoall":1.58,
        "passing1": 2.17,
        "passing2": 1.73,
        "passing3": 1.23,
    }
}.to_json

Objective 2: Reimplement the algorithm if discrepancies arise in the reputation web server's Hamer values.

As established before, the values returned by reputation server do not match the expected values. Hence, we concluded that the PeerLogic Webservice is implemented incorrectly. In this phase, we implemented the algorithm in Ruby as a function in a controller file : /app/controllers/reputation_web_service_controller.rb

Changes made in implementation

  • coded this algorithm in Ruby in a controller.
  • Included a way for the algorithm to handle nil values.

Code Snippet


# Method: calculate_reputation_score
# This method calculates the reputation scores for each reviewer based on the provided input data.
# It first parses the input JSON string to extract the submissions and their corresponding scores.
# Then, it calculates the average weighted grades per reviewer and the delta R values.
# Next, it calculates the weight prime values based on the delta R values.
# Finally, it calculates the reputation weights for each reviewer using the weight prime values.
#
# Params
#   - input_json: a JSON string representing the input data with submission scores
#
# Returns
#   An array of reputation scores, one score per reviewer, indicating their reputation in the system.
def calculate_reputation_score(reviews)
  # Parse the input JSON string
  # reviews = JSON.parse(input_json)

  # Initialize arrays to store intermediate values
  grades = []
  delta_r = []
  weight_prime = []
  weight = []

  # Calculate Average Weighted Grades per Reviewer
  reviews.each do |reviewer_marks|
    # Skip nil values when calculating the sum
    reviewer_marks_without_nil = reviewer_marks.compact
    assignment_grade_average = reviewer_marks_without_nil.sum.to_f / reviewer_marks_without_nil.length
    grades << assignment_grade_average
  end

  # Calculate delta R
  reviews.each do |reviewer_marks|
    reviewer_delta_r = 0
    # Skip nil values when calculating the sum
    reviewer_marks_without_nil = reviewer_marks.compact
    reviewer_marks_without_nil.each_with_index do |grade, student_index|
      reviewer_delta_r += (grade - grades[student_index]) ** 2
    end
    delta_r << reviewer_delta_r / reviewer_marks_without_nil.length
  end

  # Calculate weight prime
  average_delta_r = delta_r.sum / delta_r.length.to_f

  delta_r.each do |reviewer_delta_r|
    weight_prime << average_delta_r / reviewer_delta_r
  end

  # Calculate reputation weight
  weight_prime.each do |reviewer_weight_prime|
    if reviewer_weight_prime <= 2
      weight << reviewer_weight_prime.round(2)
    else
      weight << (2 + Math.log(reviewer_weight_prime - 1)).round(2)
    end
  end

  # Return the reputation weights
  weight
  end
end

Objective 3: Validate the accuracy of the newly implemented Hamer algorithm

We test the newly implemented Hamer algorithm function with our scenarios and verify if they match the expected values.

Test Code Snippet

describe ReputationWebServiceController do
    it "should calculate correct Hamer calculation" do
      weights = ReputationWebServiceController.new.calculate_reputation_score(reviews)
      keys = ["maxtoall", "mintoall", "mediantoall", "incomplete_review", "sametoall", "passing1", "passing2", "passing3"]
      rounded_weights = weights.map { |w| w.round(1) }
      result_hash = keys.zip(rounded_weights).to_h
      expect(result_hash).to eq(JSON.parse(EXPECTED)["Hamer"])
    end
end

Results

Conclusion

The observed results indicate a tendency towards lower values, primarily due to our decision to include nil values and treat them as zeros in our analysis. This treatment has led to a skew in the scores, favoring lower values and potentially impacting the accuracy of our findings. To address this issue and improve the robustness of our analysis, it is advisable to explore alternative approaches such as using median or random values instead of treating nil values as zeros. However, we must also carefully consider how to handle incomplete reviews that contain nil values in our input dataset, as this can significantly influence the overall integrity and reliability of our results and conclusions.

Conclusion

In this project, we aimed to test the accuracy of the Hamer algorithm used for assessing the credibility of reviewers in a peer assessment system. We began by developing code testing scenarios to validate the Hamer algorithm and ensure the accuracy of its output values. These scenarios covered various review scenarios, including cases where reviewers provided extreme scores.

It was established that the original reputation web server was implemented incorrectly.

As a result, we proceeded to reimplement the Hamer algorithm in Ruby, incorporating adjustments to handle nil values appropriately. Subsequently, we validated the accuracy of the newly implemented algorithm using the same testing scenarios. While the results initially showed a skew towards lower values due to our treatment of nil values, we acknowledge the need for further refinement to handle these cases more effectively.

In conclusion, this project highlights the importance of rigorous testing and implementation adjustments in ensuring the reliability of algorithms used in peer assessment systems. Moving forward, we recommend further refinements and validations to enhance the accuracy and robustness of the Hamer algorithm.

Links

Link to Expertiza repository: here

Link to the forked repository: here

Link to pull request: here

Link to Github Project page: here

Link to Testing Video: here

References

1. Expertiza on GitHub (https://github.com/expertiza/expertiza)
2. The live Expertiza website (http://expertiza.ncsu.edu/)
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)