Expertiza_Wiki - User contributions [en]

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:41:06Z

Jlin36:

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
To merge code for revision planning into current beta.

E2152's work was to develop a revision planning tool and merge it into the beta branch of Expertiza. They completed their task but developed based on a previous beta branch version and thus could not be merged into the current beta. Our goal is to reimplement their change so that it merges into the current beta branch. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb
* spec/controllers/advice_controller_spec.rb
* app/models/response.rb

The following diagram shows the the revision planning tool's use:

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this now fits within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. Authors are now able to leave a plan of work attached to their reviews in order to indicate their plan to move forward with the criticism given by the peer reviews. This is done by utilizing controller classes for advice and response that are tasked with creating the objects, as implemented in the advice and response model classes respectively, saving them to the database and displaying them in the html files. We also ensure that response can only be performed by the appropriate members by indicating that actions are only permitted by those with instructor privilege (teaching assistants, admins, and instructors) as well as the student who has been assigned the review can alter them (seen here).

<pre>
def action_allowed?
questionnaire = Questionnaire.find(params[:id])
if(user_logged_in? && questionnaire.owner?(session[:user].id))
return true
end
current_user_has_ta_privileges?
end
</pre>

Helper methods such as summary_helper.rb are used in order to receive values from existing objects, for example receiving the sentences as broken up into seperate array entries as is needed for the comments of the answers in the reviews.

<pre>
def get_sentences(answer)
if answer.nil?
return nil
end
sentences = answer.comments.split(/[.,?,!]/)
sentences.each{ |sentence| sentence.strip! }

sentences
end
</pre>

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you could view your work, but not the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not do the "Revision Planning", then "Your work" becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/advice_controller_spec.rb
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** rspec spec/controllers/response_controller_spec.rb
** rspec spec/controllers/revision_plan_questionnaires_controller_spec.rb
** rspec spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

Testing is done to ensure actions are allowed for teachers as well as the assigned student:
<pre>
describe '#action_allowed?' do
let(:questionnaire) { build(:questionnaire, id: 1) }
context 'when the role of current user is Super-Admin' do
# Checking for Super-Admin
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(super_admin, super_admin.role.name, super_admin.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Instructor' do
# Checking for Instructor
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(instructor1, instructor1.role.name, instructor1.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Student' do
# Checking for Student
it 'refuses certain action' do
controller.params = { id: '1' }
stub_current_user(student1, student1.role.name, student1.role)
expect(controller.send(:action_allowed?)).to be_falsey
end
end
end
</pre>

Testing is also done in order to ensure this in response controllers, and questionnaire controllers. Testing is done for model objects such as displaying as html (below) and ensuring fields are correctly returned
<pre>
context 'when prefix is not nil, which means view_score page in instructor end' do
it 'returns corresponding html code' do
allow(response).to receive(:questionnaire_by_answer).with(answer).and_return(questionnaire)
allow(questionnaire).to receive(:max_question_score).and_return(5)
allow(questionnaire).to receive(:id).and_return(1)
allow(assignment).to receive(:id).and_return(1)
allow(question).to receive(:view_completed_question).with(1, answer, 5, nil, nil).and_return('Question HTML code')
expect(response.display_as_html('Instructor end', 0)).to eq('<h4>Review 0</h4>Reviewer: no one (no name)   '\
"<a href=\"#\" name= \"review_Instructor end_1Link\" onClick=\"toggleElement('review_Instructor end_1','review');return false;\">"\
"hide review</a> <h5>Review Responses</h5><table id=\"review_Instructor end_1\" class=\"table table-bordered\">"\
"<tr class=\"warning\"><td>Question HTML code</td></tr></table><h5>Additional Comment</h5>"\
"<table id=\"review_Instructor end_1\" class=\"table table-bordered\"><tr><td></td></tr></table>")
end
end
</pre>

rspec tests are also needed to validate saves for controller classes, and ensuring model functionality in response objects.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

==Links==
* Pull request: https://github.com/expertiza/expertiza/pull/2395
* Github repo: https://github.com/wsun23/expertiza/tree/E2232
* VCL: http://152.7.99.215:8080/
* Screencast:

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:35:51Z

Jlin36: /* Develop New RSpec Tests */

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:32:31Z

Jlin36: /* Develop New RSpec Tests */

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
To merge code for revision planning into current beta.

E2152's work was to develop a revision planning tool and merge it into the beta branch of Expertiza. They completed their task but developed based on a previous beta branch version and thus could not be merged into the current beta. Our goal is to reimplement their change so that it merges into the current beta branch. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb
* spec/controllers/advice_controller_spec.rb
* app/models/response.rb

The following diagram shows the the revision planning tool's use:

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this now fits within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. Authors are now able to leave a plan of work attached to their reviews in order to indicate their plan to move forward with the criticism given by the peer reviews. This is done by utilizing controller classes for advice and response that are tasked with creating the objects, as implemented in the advice and response model classes respectively, saving them to the database and displaying them in the html files. We also ensure that response can only be performed by the appropriate members by indicating that actions are only permitted by those with instructor privilege (teaching assistants, admins, and instructors) as well as the student who has been assigned the review can alter them (seen here).

<pre>
def action_allowed?
questionnaire = Questionnaire.find(params[:id])
if(user_logged_in? && questionnaire.owner?(session[:user].id))
return true
end
current_user_has_ta_privileges?
end
</pre>

Helper methods such as summary_helper.rb are used in order to receive values from existing objects, for example receiving the sentences as broken up into seperate array entries as is needed for the comments of the answers in the reviews.

<pre>
def get_sentences(answer)
if answer.nil?
return nil
end
sentences = answer.comments.split(/[.,?,!]/)
sentences.each{ |sentence| sentence.strip! }

sentences
end
</pre>

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you could view your work, but not the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not do the "Revision Planning", then "Your work" becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/advice_controller_spec.rb
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** rspec spec/controllers/response_controller_spec.rb
** rspec spec/controllers/revision_plan_questionnaires_controller_spec.rb
** rspec spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

Testing is done to ensure actions are allowed for teachers as well as the assigned student:
<pre>
describe '#action_allowed?' do
let(:questionnaire) { build(:questionnaire, id: 1) }
context 'when the role of current user is Super-Admin' do
# Checking for Super-Admin
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(super_admin, super_admin.role.name, super_admin.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Instructor' do
# Checking for Instructor
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(instructor1, instructor1.role.name, instructor1.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Student' do
# Checking for Student
it 'refuses certain action' do
controller.params = { id: '1' }
stub_current_user(student1, student1.role.name, student1.role)
expect(controller.send(:action_allowed?)).to be_falsey
end
end
end
</pre>

Testing is also done in order ensure this in response controllers, and questionnaire controllers. Testing is done for model objects such as displaying as html (below) and ensuring fields are correctly returned
<pre>
context 'when prefix is not nil, which means view_score page in instructor end' do
it 'returns corresponding html code' do
allow(response).to receive(:questionnaire_by_answer).with(answer).and_return(questionnaire)
allow(questionnaire).to receive(:max_question_score).and_return(5)
allow(questionnaire).to receive(:id).and_return(1)
allow(assignment).to receive(:id).and_return(1)
allow(question).to receive(:view_completed_question).with(1, answer, 5, nil, nil).and_return('Question HTML code')
expect(response.display_as_html('Instructor end', 0)).to eq('<h4>Review 0</h4>Reviewer: no one (no name)   '\
"<a href=\"#\" name= \"review_Instructor end_1Link\" onClick=\"toggleElement('review_Instructor end_1','review');return false;\">"\
"hide review</a> <h5>Review Responses</h5><table id=\"review_Instructor end_1\" class=\"table table-bordered\">"\
"<tr class=\"warning\"><td>Question HTML code</td></tr></table><h5>Additional Comment</h5>"\
"<table id=\"review_Instructor end_1\" class=\"table table-bordered\"><tr><td></td></tr></table>")
end
end
</pre>

rspec tests are also needed for validing saves for controller classes, and ensuring model functionality in response objects.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

==Links==
* Pull request: https://github.com/expertiza/expertiza/pull/2395
* Github repo: https://github.com/wsun23/expertiza/tree/E2232
* VCL: http://152.7.99.215:8080/

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:31:50Z

Jlin36: /* Merge existing RSpec tests for revision planning into current beta */

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
To merge code for revision planning into current beta.

E2152's work was to develop a revision planning tool and merge it into the beta branch of Expertiza. They completed their task but developed based on a previous beta branch version and thus could not be merged into the current beta. Our goal is to reimplement their change so that it merges into the current beta branch. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb
* spec/controllers/advice_controller_spec.rb
* app/models/response.rb

The following diagram shows the the revision planning tool's use:

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this now fits within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. Authors are now able to leave a plan of work attached to their reviews in order to indicate their plan to move forward with the criticism given by the peer reviews. This is done by utilizing controller classes for advice and response that are tasked with creating the objects, as implemented in the advice and response model classes respectively, saving them to the database and displaying them in the html files. We also ensure that response can only be performed by the appropriate members by indicating that actions are only permitted by those with instructor privilege (teaching assistants, admins, and instructors) as well as the student who has been assigned the review can alter them (seen here).

<pre>
def action_allowed?
questionnaire = Questionnaire.find(params[:id])
if(user_logged_in? && questionnaire.owner?(session[:user].id))
return true
end
current_user_has_ta_privileges?
end
</pre>

Helper methods such as summary_helper.rb are used in order to receive values from existing objects, for example receiving the sentences as broken up into seperate array entries as is needed for the comments of the answers in the reviews.

<pre>
def get_sentences(answer)
if answer.nil?
return nil
end
sentences = answer.comments.split(/[.,?,!]/)
sentences.each{ |sentence| sentence.strip! }

sentences
end
</pre>

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you could view your work, but not the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not do the "Revision Planning", then "Your work" becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/advice_controller_spec.rb
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** rspec spec/controllers/response_controller_spec.rb
** rspec spec/controllers/revision_plan_questionnaires_controller_spec.rb
** rspec spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

Testing is done for ensuring actions are allowed for teachers as well as the assigned student:
<pre>
describe '#action_allowed?' do
let(:questionnaire) { build(:questionnaire, id: 1) }
context 'when the role of current user is Super-Admin' do
# Checking for Super-Admin
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(super_admin, super_admin.role.name, super_admin.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Instructor' do
# Checking for Instructor
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(instructor1, instructor1.role.name, instructor1.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Student' do
# Checking for Student
it 'refuses certain action' do
controller.params = { id: '1' }
stub_current_user(student1, student1.role.name, student1.role)
expect(controller.send(:action_allowed?)).to be_falsey
end
end
end
</pre>

Testing is also done in order ensure this in response controllers, and questionnaire controllers. Testing is done for model objects such as displaying as html (below) and ensuring fields are correctly returned
<pre>
context 'when prefix is not nil, which means view_score page in instructor end' do
it 'returns corresponding html code' do
allow(response).to receive(:questionnaire_by_answer).with(answer).and_return(questionnaire)
allow(questionnaire).to receive(:max_question_score).and_return(5)
allow(questionnaire).to receive(:id).and_return(1)
allow(assignment).to receive(:id).and_return(1)
allow(question).to receive(:view_completed_question).with(1, answer, 5, nil, nil).and_return('Question HTML code')
expect(response.display_as_html('Instructor end', 0)).to eq('<h4>Review 0</h4>Reviewer: no one (no name)   '\
"<a href=\"#\" name= \"review_Instructor end_1Link\" onClick=\"toggleElement('review_Instructor end_1','review');return false;\">"\
"hide review</a> <h5>Review Responses</h5><table id=\"review_Instructor end_1\" class=\"table table-bordered\">"\
"<tr class=\"warning\"><td>Question HTML code</td></tr></table><h5>Additional Comment</h5>"\
"<table id=\"review_Instructor end_1\" class=\"table table-bordered\"><tr><td></td></tr></table>")
end
end
</pre>

rspec tests are also needed for validing saves for controller classes, and ensuring model functionality in response objects.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

==Links==
* Pull request: https://github.com/expertiza/expertiza/pull/2395
* Github repo: https://github.com/wsun23/expertiza/tree/E2232
* VCL: http://152.7.99.215:8080/

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:27:37Z

Jlin36: /* Current Flow */

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
To merge code for revision planning into current beta.

E2152's work was to develop a revision planning tool and merge it into the beta branch of Expertiza. They completed their task but developed based on a previous beta branch version and thus could not be merged into the current beta. Our goal is to reimplement their change so that it merges into the current beta branch. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb
* spec/controllers/advice_controller_spec.rb
* app/models/response.rb

The following diagram shows the the revision planning tool's use:

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this now fits within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. Authors are now able to leave a plan of work attached to their reviews in order to indicate their plan to move forward with the criticism given by the peer reviews. This is done by utilizing controller classes for advice and response that are tasked with creating the objects, as implemented in the advice and response model classes respectively, saving them to the database and displaying them in the html files. We also ensure that response can only be performed by the appropriate members by indicating that actions are only permitted by those with instructor privilege (teaching assistants, admins, and instructors) as well as the student who has been assigned the review can alter them (seen here).

<pre>
def action_allowed?
questionnaire = Questionnaire.find(params[:id])
if(user_logged_in? && questionnaire.owner?(session[:user].id))
return true
end
current_user_has_ta_privileges?
end
</pre>

Helper methods such as summary_helper.rb are used in order to receive values from existing objects, for example receiving the sentences as broken up into seperate array entries as is needed for the comments of the answers in the reviews.

<pre>
def get_sentences(answer)
if answer.nil?
return nil
end
sentences = answer.comments.split(/[.,?,!]/)
sentences.each{ |sentence| sentence.strip! }

sentences
end
</pre>

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you could view your work, but not the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not do the "Revision Planning", then "Your work" becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** spec/controllers/response_controller_spec.rb
** spec/controllers/revision_plan_questionnaires_controller_spec.rb
** spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

Testing is done for ensuring actions are allowed for teachers as well as the assigned student:
<pre>
describe '#action_allowed?' do
let(:questionnaire) { build(:questionnaire, id: 1) }
context 'when the role of current user is Super-Admin' do
# Checking for Super-Admin
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(super_admin, super_admin.role.name, super_admin.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Instructor' do
# Checking for Instructor
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(instructor1, instructor1.role.name, instructor1.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Student' do
# Checking for Student
it 'refuses certain action' do
controller.params = { id: '1' }
stub_current_user(student1, student1.role.name, student1.role)
expect(controller.send(:action_allowed?)).to be_falsey
end
end
end
</pre>

Testing is also done in order ensure this in response controllers, and questionnaire controllers. Testing is done for model objects such as displaying as html (below) and ensuring fields are correctly returned
<pre>
context 'when prefix is not nil, which means view_score page in instructor end' do
it 'returns corresponding html code' do
allow(response).to receive(:questionnaire_by_answer).with(answer).and_return(questionnaire)
allow(questionnaire).to receive(:max_question_score).and_return(5)
allow(questionnaire).to receive(:id).and_return(1)
allow(assignment).to receive(:id).and_return(1)
allow(question).to receive(:view_completed_question).with(1, answer, 5, nil, nil).and_return('Question HTML code')
expect(response.display_as_html('Instructor end', 0)).to eq('<h4>Review 0</h4>Reviewer: no one (no name)   '\
"<a href=\"#\" name= \"review_Instructor end_1Link\" onClick=\"toggleElement('review_Instructor end_1','review');return false;\">"\
"hide review</a> <h5>Review Responses</h5><table id=\"review_Instructor end_1\" class=\"table table-bordered\">"\
"<tr class=\"warning\"><td>Question HTML code</td></tr></table><h5>Additional Comment</h5>"\
"<table id=\"review_Instructor end_1\" class=\"table table-bordered\"><tr><td></td></tr></table>")
end
end
</pre>

rspec tests are also needed for validing saves for controller classes, and ensuring model functionality in response objects.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

==Links==
* Pull request: https://github.com/expertiza/expertiza/pull/2395
* Github repo: https://github.com/wsun23/expertiza/tree/E2232
* VCL: http://152.7.99.215:8080/

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:25:47Z

Jlin36:

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
To merge code for revision planning into current beta.

E2152's work was to develop a revision planning tool and merge it into the beta branch of Expertiza. They completed their task but developed based on a previous beta branch version and thus could not be merged into the current beta. Our goal is to reimplement their change so that it merges into the current beta branch. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb
* spec/controllers/advice_controller_spec.rb
* app/models/response.rb

The following diagram shows the the revision planning tool's use:

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this now fits within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. Authors are now able to leave a plan of work attached to their reviews in order to indicate their plan to move forward with the criticism given by the peer reviews. This is done by utilizing controller classes for advice and response that are tasked with creating the objects, as implemented in the advice and response model classes respectively, saving them to the database and displaying them in the html files. We also ensure that response can only be performed by the appropriate members by indicating that actions are only permitted by those with instructor privilege (teaching assistants, admins, and instructors) as well as the student who has been assigned the review can alter them (seen here).

<pre>
def action_allowed?
questionnaire = Questionnaire.find(params[:id])
if(user_logged_in? && questionnaire.owner?(session[:user].id))
return true
end
current_user_has_ta_privileges?
end
</pre>

Helper methods such as summary_helper.rb are used in order to receive values from existing objects, for example receiving the sentences as broken up into seperate array entries as is needed for the comments of the answers in the reviews.

<pre>
def get_sentences(answer)
if answer.nil?
return nil
end
sentences = answer.comments.split(/[.,?,!]/)
sentences.each{ |sentence| sentence.strip! }

sentences
end
</pre>

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you can look into your work, but the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not deal with the "Revision Planning", then the "Your work" part becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** spec/controllers/response_controller_spec.rb
** spec/controllers/revision_plan_questionnaires_controller_spec.rb
** spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

Testing is done for ensuring actions are allowed for teachers as well as the assigned student:
<pre>
describe '#action_allowed?' do
let(:questionnaire) { build(:questionnaire, id: 1) }
context 'when the role of current user is Super-Admin' do
# Checking for Super-Admin
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(super_admin, super_admin.role.name, super_admin.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Instructor' do
# Checking for Instructor
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(instructor1, instructor1.role.name, instructor1.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Student' do
# Checking for Student
it 'refuses certain action' do
controller.params = { id: '1' }
stub_current_user(student1, student1.role.name, student1.role)
expect(controller.send(:action_allowed?)).to be_falsey
end
end
end
</pre>

Testing is also done in order ensure this in response controllers, and questionnaire controllers. Testing is done for model objects such as displaying as html (below) and ensuring fields are correctly returned
<pre>
context 'when prefix is not nil, which means view_score page in instructor end' do
it 'returns corresponding html code' do
allow(response).to receive(:questionnaire_by_answer).with(answer).and_return(questionnaire)
allow(questionnaire).to receive(:max_question_score).and_return(5)
allow(questionnaire).to receive(:id).and_return(1)
allow(assignment).to receive(:id).and_return(1)
allow(question).to receive(:view_completed_question).with(1, answer, 5, nil, nil).and_return('Question HTML code')
expect(response.display_as_html('Instructor end', 0)).to eq('<h4>Review 0</h4>Reviewer: no one (no name)   '\
"<a href=\"#\" name= \"review_Instructor end_1Link\" onClick=\"toggleElement('review_Instructor end_1','review');return false;\">"\
"hide review</a> <h5>Review Responses</h5><table id=\"review_Instructor end_1\" class=\"table table-bordered\">"\
"<tr class=\"warning\"><td>Question HTML code</td></tr></table><h5>Additional Comment</h5>"\
"<table id=\"review_Instructor end_1\" class=\"table table-bordered\"><tr><td></td></tr></table>")
end
end
</pre>

rspec tests are also needed for validing saves for controller classes, and ensuring model functionality in response objects.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

==Links==
* Pull request: https://github.com/expertiza/expertiza/pull/2395
* Github repo: https://github.com/wsun23/expertiza/tree/E2232
* VCL: http://152.7.99.215:8080/

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:20:53Z

Jlin36: /* Project Plan */

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
To merge code for revision planning into current beta.

The functionality of E2152 works well but it was developed based on the previous beta and cannot be merged into the current beta. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb
* spec/controllers/advice_controller_spec.rb
* app/models/response.rb

Merge code for revision planning with code for role based reviewing and topic specific rubrics

The functionality of E2261 works well and has been merged into the current beta. By merging revision planning tool and topic specific rubrics, in the peer review process,
In the first round of review, the rubric is designed by the instructor and varies by topic
In the second round of review, the rubric includes two parts: part 1 is designed by the instructor and varies by topic, part 2 is designed by the team based on the comments of the first round of review.

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this now fits within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. Authors are now able to leave a plan of work attached to their reviews in order to indicate their plan to move forward with the criticism given by the peer reviews. This is done by utilizing controller classes for advice and response that are tasked with creating the objects, as implemented in the advice and response model classes respectively, saving them to the database and displaying them in the html files. We also ensure that response can only be performed by the appropriate members by indicating that actions are only permitted by those with instructor privilege (teaching assistants, admins, and instructors) as well as the student who has been assigned the review can alter them (seen here).

<pre>
def action_allowed?
questionnaire = Questionnaire.find(params[:id])
if(user_logged_in? && questionnaire.owner?(session[:user].id))
return true
end
current_user_has_ta_privileges?
end
</pre>

Helper methods such as summary_helper.rb are used in order to receive values from existing objects, for example receiving the sentences as broken up into seperate array entries as is needed for the comments of the answers in the reviews.

<pre>
def get_sentences(answer)
if answer.nil?
return nil
end
sentences = answer.comments.split(/[.,?,!]/)
sentences.each{ |sentence| sentence.strip! }

sentences
end
</pre>

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you can look into your work, but the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not deal with the "Revision Planning", then the "Your work" part becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** spec/controllers/response_controller_spec.rb
** spec/controllers/revision_plan_questionnaires_controller_spec.rb
** spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

Testing is done for ensuring actions are allowed for teachers as well as the assigned student:
<pre>
describe '#action_allowed?' do
let(:questionnaire) { build(:questionnaire, id: 1) }
context 'when the role of current user is Super-Admin' do
# Checking for Super-Admin
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(super_admin, super_admin.role.name, super_admin.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Instructor' do
# Checking for Instructor
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(instructor1, instructor1.role.name, instructor1.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Student' do
# Checking for Student
it 'refuses certain action' do
controller.params = { id: '1' }
stub_current_user(student1, student1.role.name, student1.role)
expect(controller.send(:action_allowed?)).to be_falsey
end
end
end
</pre>

Testing is also done in order ensure this in response controllers, and questionnaire controllers. Testing is done for model objects such as displaying as html (below) and ensuring fields are correctly returned
<pre>
context 'when prefix is not nil, which means view_score page in instructor end' do
it 'returns corresponding html code' do
allow(response).to receive(:questionnaire_by_answer).with(answer).and_return(questionnaire)
allow(questionnaire).to receive(:max_question_score).and_return(5)
allow(questionnaire).to receive(:id).and_return(1)
allow(assignment).to receive(:id).and_return(1)
allow(question).to receive(:view_completed_question).with(1, answer, 5, nil, nil).and_return('Question HTML code')
expect(response.display_as_html('Instructor end', 0)).to eq('<h4>Review 0</h4>Reviewer: no one (no name)   '\
"<a href=\"#\" name= \"review_Instructor end_1Link\" onClick=\"toggleElement('review_Instructor end_1','review');return false;\">"\
"hide review</a> <h5>Review Responses</h5><table id=\"review_Instructor end_1\" class=\"table table-bordered\">"\
"<tr class=\"warning\"><td>Question HTML code</td></tr></table><h5>Additional Comment</h5>"\
"<table id=\"review_Instructor end_1\" class=\"table table-bordered\"><tr><td></td></tr></table>")
end
end
</pre>

rspec tests are also needed for validing saves for controller classes, and ensuring model functionality in response objects.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

==Links==
* Pull request: https://github.com/expertiza/expertiza/pull/2395
* Github repo: https://github.com/wsun23/expertiza/tree/E2232
* VCL: http://152.7.99.215:8080/

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-26T03:17:11Z

Jlin36: /* Files to be merged */

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
Merge code for revision planning into current beta

The functionality of E2152 works well but it was developed based on the previous beta and cannot be merged into the current beta. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb
* spec/controllers/advice_controller_spec.rb
* app/models/response.rb

Merge code for revision planning with code for role based reviewing and topic specific rubrics

The functionality of E2261 works well and has been merged into the current beta. By merging revision planning tool and topic specific rubrics, in the peer review process,
In the first round of review, the rubric is designed by the instructor and varies by topic
In the second round of review, the rubric includes two parts: part 1 is designed by the instructor and varies by topic, part 2 is designed by the team based on the comments of the first round of review.

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this now fits within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. Authors are now able to leave a plan of work attached to their reviews in order to indicate their plan to move forward with the criticism given by the peer reviews. This is done by utilizing controller classes for advice and response that are tasked with creating the objects, as implemented in the advice and response model classes respectively, saving them to the database and displaying them in the html files. We also ensure that response can only be performed by the appropriate members by indicating that actions are only permitted by those with instructor privilege (teaching assistants, admins, and instructors) as well as the student who has been assigned the review can alter them (seen here).

<pre>
def action_allowed?
questionnaire = Questionnaire.find(params[:id])
if(user_logged_in? && questionnaire.owner?(session[:user].id))
return true
end
current_user_has_ta_privileges?
end
</pre>

Helper methods such as summary_helper.rb are used in order to receive values from existing objects, for example receiving the sentences as broken up into seperate array entries as is needed for the comments of the answers in the reviews.

<pre>
def get_sentences(answer)
if answer.nil?
return nil
end
sentences = answer.comments.split(/[.,?,!]/)
sentences.each{ |sentence| sentence.strip! }

sentences
end
</pre>

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you can look into your work, but the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not deal with the "Revision Planning", then the "Your work" part becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** spec/controllers/response_controller_spec.rb
** spec/controllers/revision_plan_questionnaires_controller_spec.rb
** spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

Testing is done for ensuring actions are allowed for teachers as well as the assigned student:
<pre>
describe '#action_allowed?' do
let(:questionnaire) { build(:questionnaire, id: 1) }
context 'when the role of current user is Super-Admin' do
# Checking for Super-Admin
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(super_admin, super_admin.role.name, super_admin.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Instructor' do
# Checking for Instructor
it 'allows certain action' do
controller.params = { id: '1' }
stub_current_user(instructor1, instructor1.role.name, instructor1.role)
expect(controller.send(:action_allowed?)).to be_truthy
end
end
context 'when the role of current user is Student' do
# Checking for Student
it 'refuses certain action' do
controller.params = { id: '1' }
stub_current_user(student1, student1.role.name, student1.role)
expect(controller.send(:action_allowed?)).to be_falsey
end
end
end
</pre>

Testing is also done in order ensure this in response controllers, and questionnaire controllers. Testing is done for model objects such as displaying as html (below) and ensuring fields are correctly returned
<pre>
context 'when prefix is not nil, which means view_score page in instructor end' do
it 'returns corresponding html code' do
allow(response).to receive(:questionnaire_by_answer).with(answer).and_return(questionnaire)
allow(questionnaire).to receive(:max_question_score).and_return(5)
allow(questionnaire).to receive(:id).and_return(1)
allow(assignment).to receive(:id).and_return(1)
allow(question).to receive(:view_completed_question).with(1, answer, 5, nil, nil).and_return('Question HTML code')
expect(response.display_as_html('Instructor end', 0)).to eq('<h4>Review 0</h4>Reviewer: no one (no name)   '\
"<a href=\"#\" name= \"review_Instructor end_1Link\" onClick=\"toggleElement('review_Instructor end_1','review');return false;\">"\
"hide review</a> <h5>Review Responses</h5><table id=\"review_Instructor end_1\" class=\"table table-bordered\">"\
"<tr class=\"warning\"><td>Question HTML code</td></tr></table><h5>Additional Comment</h5>"\
"<table id=\"review_Instructor end_1\" class=\"table table-bordered\"><tr><td></td></tr></table>")
end
end
</pre>

rspec tests are also needed for validing saves for controller classes, and ensuring model functionality in response objects.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

==Links==
* Pull request: https://github.com/expertiza/expertiza/pull/2395
* Github repo: https://github.com/wsun23/expertiza/tree/E2232
* VCL: http://152.7.99.215:8080/

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-25T22:35:43Z

Jlin36:

==Project Goal==
The primary objective for this project is to create a tool that can be used for the revision of projects at a time after their original submission upon the delivery of constructive feedback from their peers or instructors. The revision planning tool is an important device that will be used to give students the ability to learn from the mistakes of their submissions, and improve the quality of their work prior to the due date. This will be done by completing the existing implementation for revision planning using the following project plan.

==Project Plan==
Merge code for revision planning into current beta

The functionality of E2152 works well but it was developed based on the previous beta and cannot be merged into the current beta. We will first merge the modification in the following files to the current beta and solve the conflicts.

===Files to be merged===
* app/controllers/revision_plan_questionnaires_controller.rb
* app/models/team.rb
* app/controllers/grades_controller.rb
* app/helpers/grades_helper.rb
* app/models/assignment_participant.rb
* app/models/response_map.rb
* app/views/grades/_participant_charts.html.erb
* app/views/grades/view_team.html.erb
* app/views/student_task/view.html.erb
* app/controllers/response_controller.rb
* app/views/response/response.html.erb
* config/routes.rb
* db/schema.rb
* spec/models/response_spec.rb
* spec/models/review_response_map_spec.rb
* spec/features/assignment_creation_general_tab_spec.rb
* app/models/revision_plan_team_map.rb

Merge code for revision planning with code for role based reviewing and topic specific rubrics

The functionality of E2261 works well and has been merged into the current beta. By merging revision planning tool and topic specific rubrics, in the peer review process,
In the first round of review, the rubric is designed by the instructor and varies by topic
In the second round of review, the rubric includes two parts: part 1 is designed by the instructor and varies by topic, part 2 is designed by the team based on the comments of the first round of review.

[[File:E2232_diagram.png]]

==Current Project Implementation==

The implementation of this needs to fit within the framework created by E2161 (Fall 2021).

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. We could carry the interaction one step further if we asked authors to make up a revision plan based on the first-round reviews. That is, authors would say what they were planning to do to improve their work. Then second-round reviewers would assess how well they did it. In essence, this means that authors would be adding criteria to the second-round rubric that applied only to their submission. We are interested in having this implemented and used in a class so that we can study its effect.

===Rationale===
The general workflow will be maintained from the previous iterations working on this project. The workflow used by past semesters is as follows.

[[File:E2152_Rationale.png|410px|center|Image from previous write up]]

===Previous implementation===
This project was last done in Fall 2021 (E2152). However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
* [https://github.com/expertiza/expertiza/pull/2131 Primary Pull Request for E2152]
* [https://github.com/expertiza/expertiza/pull/2152 Second Pull Request for Revision Planning]
* [https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool Previous write up for E2152]

====Current Flow====
Current flow is dictated by previous iterations. The following content and images are created using those previous write ups.

Prior to the round 2 submission, you can look into your work, but the revision plan.
[[File:211129-2.png|700px|thumb|center]]
If there is a round 2 submission, and we did not deal with the "Revision Planning", then the "Your work" part becomes gray.
[[File:211129-5.png|700px|thumb|center]]
After editing the "Revision Planning", we can submit our work.
[[File:211129-6.png|700px|thumb|center]]

=====Current User Interface=====
Current user interface has been put in place by the previous iterations, the following interface image is from those iterations.

[[File:after1.png|700px|thumb|center|Reviews cannot be done during the submission phase]]

===Implementation to be completed===
There are actually two E2152 pull requests in Expertiza right now - the PR we saw in the demo has less recent commits than the other. And the PR we did not see has less files changed as well. They started with last year's project, so the beta they started with was the beta from last year. The changes since then will show up as merge conflicts if this project is merged.

The functionality of this project seems to work well and would be a valuable addition to Expertiza, but it cannot be merged in its current state. There are many artifacts in their PR from an old version of beta. This is because the team merged the previous teams' code into current beta, but did not remove the differences unrelated to their project. The team knew of these problems before the demo, but did not fix them.

Because the existing functionality encompasses the intended instructions fairly clearly the work that needs to be done for our purposes would be to pass rspec tests that currently cause the build to fail. This functionality would involve us causing different flows based upon the type of user completing the review. Shown below. HTML changes must also be made in order to pass, specifically needing a change to the display_as_html, done_by_staff_participant and participant_scores methods to return the correct html values. Currently display_as_html is returning an unknown error causing a msitake in the html delivered. The done_by_staff_participant is not present in the code and therefore returns a method not found error. particpant_scores is returning an error that is the result of an incorrect calculation for total_scores. More work will need to be done for each of these bugs to investigate the root cause and establish a solution.

Pull request E2131 is failing rspec tests in ReviewMappingHelper due to a currently unknown OpenSSL error. More work will be needed to determine the specific cause and nature of this error.

More comments can be made in rspec tests as well. It is unclear what the coverage is of the individual tests. The file revision_plan_team_map_test.rb has nothing substantial in it.

[[File:Failing_rspec.PNG|1000px|center]]

====Design Changes====
Because the changes to the current implementation is limited to specific implementation, the UML design of the project will remain the same as the previous implementation.

[[File:E2152_Design.png|1000px|center]]

==Test Plan==
===Merge existing RSpec tests for revision planning into current beta===
We will first merge the existing RSpec tests of E2152 to the current beta, then run and pass these tests. More comments can be made in rspec tests as well. Observe the coverage of the individual tests.
Existing RSpec tests to be merged
* Controllers
** rspec spec/controllers/grades_controller_spec.rb
** rspec spec/controllers/questionnaires_controller_spec.rb
** rspec spec/controllers/questions_controller_spec.rb
** rspec spec/controllers/student_teams_controller_spec.rb
** spec/controllers/response_controller_spec.rb
** spec/controllers/revision_plan_questionnaires_controller_spec.rb
** spec/factories/revision_plan_factory.rb
* Models
** spec/models/response_spec.rb.
** spec/models/review_response_map_spec.rb
* Helpers
** rspec spec/heplers/grades_helper.rb

===Develop New RSpec Tests===
The RSpec tests are written to test both controllers and models. RSpec testing will be added in order to increase coverage. To do this we will test the flows associated with different user types. Currently the only passing tests are related to student flows and tests may be added that work with instructors. These may include editing the reviews once they are created, and ensuring that an instructor has the ability to make edits.

===Manual Testing===

* Instructor
** Can Review rubric varied by topic be enabled?
** Can different roles be chosen for each questionnaire?
** Can an assignment with revision planning enabled be created?
** Can an assignment with 2 rounds of review be set up?

* Assignment participant
** If the revision-planning rubric can be edited or not?
** Are participants allowed to create/edit revision plan when round 1+ (1 or greater than 1) reviews have finished?
** Is revision plan editing disabled when the assignment is in review stage?
** Does participants show a summary of score for revision plan after review deadline has expired?

* Assignment reviewer
** Does the rubric page show the topic-specific rubric?
** Does the rubric page show the revision plan rubric?

==Implementation By Team==
Halfway through the project deadline, there were new changes that were merged into Expertiza's beta branch. With these new changes in place, our code that was based off the old version of the beta branch started failing tests and causing errors. We fixed these changes that mainly occurred in advice_controller.rb. Some of these failures were also rolled over from the last team's implementation of this same project. Most of these errors occurred in response.rb

==Team Information==
* Lawrence O'Brien (lpobrien)
* Joshua Lin (jlin36)
* Weiqi Sun (wsun23)
* Wyatt Plaga (wgplaga)
* '''Mentor:''' Nicholas Himes (nnhimes)

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-05T22:37:49Z

Jlin36:

Project Goal:

Project Plan:

Other known information:

Stuff:

What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. We could carry the interaction one step further if we asked authors to make up a revision plan based on the first-round reviews. That is, authors would say what they were planning to do to improve their work. Then second-round reviewers would assess how well they did it. In essence, this means that authors would be adding criteria to the second-round rubric that applied only to their submission. We are interested in having this implemented and used in a class so that we can study its effect.

The implementation of this needs to fit within the framework created by E2161 (Fall 2021).

Previous implementation - E2152: This project was last done in Fall 2021. However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
https://github.com/expertiza/expertiza/pull/2152 (primary PR?)
https://github.com/expertiza/expertiza/pull/2131
https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool

Our comments on the implementation:
There are actually two E2152 pull requests in Expertiza right now - the PR we saw in the demo has less recent commits than the other. And the PR we did not see has less files changed as well. They started with last year's project, so the beta they started with was the beta from last year. The changes since then will show up as merge conflicts if this project is merged.

The functionality of this project seems to work well and would be a valuable addition to Expertiza, but it cannot be merged in its current state. There are many artifacts in their PR from an old version of beta. This is because the team merged the previous teams' code into current beta, but did not remove the differences unrelated to their project. The team knew of these problems before the demo, but did not fix them.

More comments can be made in rspec tests as well. It is unclear what the coverage is of the individual tests. The file revision_plan_team_map_test.rb has nothing substantial in it.

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-05T22:37:38Z

Jlin36:

Project Goal:

Project Plan:

Other known information:

Stuff:
What it does: In the first round of Expertiza reviews, we ask reviewers to give authors some guidance on how to improve their work. Then in the second round, reviewers rate how well authors have followed their suggestions. We could carry the interaction one step further if we asked authors to make up a revision plan based on the first-round reviews. That is, authors would say what they were planning to do to improve their work. Then second-round reviewers would assess how well they did it. In essence, this means that authors would be adding criteria to the second-round rubric that applied only to their submission. We are interested in having this implemented and used in a class so that we can study its effect.

The implementation of this needs to fit within the framework created by E2161 (Fall 2021).

Previous implementation - E2152: This project was last done in Fall 2021. However, related merged code from E2161 (link above) means the implementation this semester may need to be changed from how E2152 did it.
https://github.com/expertiza/expertiza/pull/2152 (primary PR?)
https://github.com/expertiza/expertiza/pull/2131
https://expertiza.csc.ncsu.edu/index.php/CSC/ECE_517_Fall_2021_-_E2152._Revision_planning_tool

Our comments on the implementation:
There are actually two E2152 pull requests in Expertiza right now - the PR we saw in the demo has less recent commits than the other. And the PR we did not see has less files changed as well. They started with last year's project, so the beta they started with was the beta from last year. The changes since then will show up as merge conflicts if this project is merged.

The functionality of this project seems to work well and would be a valuable addition to Expertiza, but it cannot be merged in its current state. There are many artifacts in their PR from an old version of beta. This is because the team merged the previous teams' code into current beta, but did not remove the differences unrelated to their project. The team knew of these problems before the demo, but did not fix them.

More comments can be made in rspec tests as well. It is unclear what the coverage is of the individual tests. The file revision_plan_team_map_test.rb has nothing substantial in it.

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-05T21:17:16Z

Jlin36:

Project Goal:

Project Plan:

Other known information:

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-05T21:17:04Z

Jlin36:

Project Goal:
Project Plan:
Other known information:

CSC/ECE 517 Spring 2022 - E2232: Revision planning tool

2022-04-05T21:12:58Z

Jlin36: Created page with "Testing"

Testing

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T02:35:57Z

Jlin36: /* Coverage */

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan, Object Creation ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:Reputation web server hamer2.png|250px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase, Object Creation ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 

Test Code Snippet:

<pre>
require "net/http"
require "json"
#The following contains 4 reviewers who have scored 4 reviewees
# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Input visualization:
# Corresponding reviewer and grade for each assignment table
# Reviewer-> stu9999 stu9998 stu9997 stu9996
# Assignment
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#
#Values that would be returned by a correct Hamer implementation
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

#sends API request to Peeerlogic
describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

#assertion fails, as expected
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

#sends API request to Mock Hamer/Peerlogic Server
describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

 
=== Second Phase Output ===
 
[[File:Response_json_expected.jpeg]]

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)
#Setting up test objects
INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#Result expectations are identical here, in order to maintain uniformity
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json
#tests Peerlogic
describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
#JSON conversion to ensure server compatibility
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
#parse the JSON response body to access values for algorithm of choice, which is Hamer
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end
#our assertion proves this mock works
</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
::alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Coverage ==

We believe that after our edge cases are implemented for a working Peerlogic, and the assertions pass, that test coverage can then be adequately measured. 
At this moment, test coverage is not a relevant statistic as no positive or negative flows functions correctly, as do any edge cases.

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
::1. In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong ::::(https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the ::paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already ::have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by ::code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected ::values as can be seen in the picture. This can be what we are supposed to reach in this project.

::2. In addition, we also found out that the reputation_web_service_controller.rb currently is broken and needs refactoring. While the client side of the reputation web service page runs, any attempt to submit grades to the ::reputation web server side results in an error.

::3. We provided scenarios for future teams to implement once Peerlogic is running correctly.

::4. We mocked an accurate webservice and showed what the expected JSON should be like.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2357 here]

Link to Github Project page: [https://github.com/joshlin5/expertiza/projects/2 here]

Link to Testing Video: [https://youtu.be/VyeGGpxymXk here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T02:34:57Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan, Object Creation ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:Reputation web server hamer2.png|250px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase, Object Creation ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 

Test Code Snippet:

<pre>
require "net/http"
require "json"
#The following contains 4 reviewers who have scored 4 reviewees
# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Input visualization:
# Corresponding reviewer and grade for each assignment table
# Reviewer-> stu9999 stu9998 stu9997 stu9996
# Assignment
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#
#Values that would be returned by a correct Hamer implementation
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

#sends API request to Peeerlogic
describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

#assertion fails, as expected
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

#sends API request to Mock Hamer/Peerlogic Server
describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

 
=== Second Phase Output ===
 
[[File:Response_json_expected.jpeg]]

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)
#Setting up test objects
INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#Result expectations are identical here, in order to maintain uniformity
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json
#tests Peerlogic
describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
#JSON conversion to ensure server compatibility
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
#parse the JSON response body to access values for algorithm of choice, which is Hamer
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end
#our assertion proves this mock works
</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
::alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Coverage ==

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
::1. In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong ::::(https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the ::paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already ::have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by ::code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected ::values as can be seen in the picture. This can be what we are supposed to reach in this project.

::2. In addition, we also found out that the reputation_web_service_controller.rb currently is broken and needs refactoring. While the client side of the reputation web service page runs, any attempt to submit grades to the ::reputation web server side results in an error.

::3. We provided scenarios for future teams to implement once Peerlogic is running correctly.

::4. We mocked an accurate webservice and showed what the expected JSON should be like.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2357 here]

Link to Github Project page: [https://github.com/joshlin5/expertiza/projects/2 here]

Link to Testing Video: [https://youtu.be/VyeGGpxymXk here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T02:20:44Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:Reputation web server hamer2.png|250px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"
#The following contains 4 reviewers who have scored 4 reviewees
# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Input visualization:
# Corresponding reviewer and grade for each assignment table
# Reviewer-> stu9999 stu9998 stu9997 stu9996
# Assignment
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#
#Values that would be returned by a correct Hamer implementation
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

#sends API request to Peeerlogic
describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

#assertion fails, as expected
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

#sends API request to Mock Hamer/Peerlogic Server
describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)
#Setting up test objects
INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#Result expectations are identical here, in order to maintain uniformity
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json
#tests Peerlogic
describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
#JSON conversion to ensure server compatibility
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
#parse the JSON response body to access values for algorithm of choice, which is Hamer
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end
#our assertion proves this mock works
</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

In addition, we also found out that the reputation_web_service_controller.rb currently is broken and needs refactoring. While the client side of the reputation web service page runs, any attempt to submit grades to the reputation web server side results in an error.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2357 here]

Link to Github Project page: [https://github.com/joshlin5/expertiza/projects/2 here]

Link to Testing Video: [https://youtu.be/VyeGGpxymXk here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:59:56Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:Reputation web server hamer2.png|250px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"
#The following contains 4 reviewers who have scored 4 reviewees
# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Input visualization:
# Corresponding reviewer and grade for each assignment table
# Reviewer-> stu9999 stu9998 stu9997 stu9996
# Assignment
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#
#Values that would be returned by a correct Hamer implementation
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

#sends API request to Peeerlogic
describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

#assertion fails, as expected
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

#sends API request to Mock Hamer/Peerlogic Server
describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

In addition, we also found out that the reputation_web_service_controller.rb currently is broken and needs refactoring. While the client side of the reputation web service page runs, any attempt to submit grades to the reputation web server side results in an error.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

Link to Github Project page: [https://github.com/joshlin5/expertiza/projects/2 here]

Link to Testing Video: [https://youtu.be/VyeGGpxymXk here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:50:18Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:Reputation web server hamer2.png|250px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"
#The following contains 4 reviewers who have scored 4 reviewees
# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Input visualization:
# Corresponding reviewer and grade for each assignment table
# Reviewer-> stu9999 stu9998 stu9997 stu9996
# Assignment
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#
#Values that would be returned by a correct Hamer implementation
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

#sends API request to Peeerlogic
describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

#assertion fails, as expected
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

#sends API request to Mock Hamer/Peerlogic Server
describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

Link to Github Project page: [https://github.com/joshlin5/expertiza/projects/2 here]

Link to Testing Video: [https://youtu.be/VyeGGpxymXk here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:38:36Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:Reputation web server hamer2.png|250px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"
#The following contains 4 reviewers who have scored 4 reviewees
# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Input visualization:
# Corresponding reviewer and grade for each assignment table
# Reviewer-> stu9999 stu9998 stu9997 stu9996
# Assignment
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json
#
#Values that would be returned by a correct Hamer implementation
EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

#sends API request to Peeerlogic
describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

#assertion fails, as expected
expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

#sends API request to Mock Hamer/Peerlogic Server
describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

Link to Github Project page: [https://github.com/joshlin5/expertiza/projects/2 here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:24:55Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:Reputation web server hamer2.png|250px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:23:58Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:rReputation web server hamer2.png]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

<pre>
require "webmock/rspec" # gem install webmock -v 2.2.0

WebMock.disable_net_connect!(allow_localhost: true)

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
before(:each) do
stub_request(:post, /peerlogic.csc.ncsu.edu/).
to_return(status: 200, body: EXPECTED, headers: {})
end
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

File:Reputation web server hamer2.png

2022-03-28T01:21:46Z

Jlin36:

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:21:34Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:reputation_web_server_hamer2.png]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:20:50Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed: 
[[File:reputation_web_server_hamer2.png|1000px]] 

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:19:38Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

The results from our python recreated Hamer Algorithm are as followed:
[[File:reputation_web_server_hamer.png|500px]]

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Test Plan - Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.
Additionally, this method was '''recommended by Dr. Gehringer''', our project mentor.
Finally, we also added a test case that ''mocks'' a webservice and asserts the output, done in 2 ways:
::a. In the first code snippet, we send JSON to a webservice that '''returns the correct Hamer output''', as Peerlogic should when fixed.
::b. In the form of an RSpec mock, in the second snippet below.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:12:11Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewer.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm: 
[[File:Hamer Algorithm Inputs Outputs.png|500px]] 

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|1000px]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Edge Cases & Scenarios ==
''We present these scenarios as possible test cases for an accurately working Peerlogic webservice.''
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

These have not been implemented as there is no point in testing a system further when positive flows do not work.
However, the code in the Initial Phase Section can be used to analytically calculate correct responses for future assertions.
We have provided outputs to these scenarios below:

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

File:Reputation web server hamer.png

2022-03-28T01:10:55Z

Jlin36:

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:10:34Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewr.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm:

[[File:Hamer Algorithm Inputs Outputs.png|500px]]

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values: 
[[File:Reputation web server hamer.png|thumb|]] 

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

We present these scenarios as possible test cases for an accurately working Peerlogic webservice.
These have not been implemented as there is no point in testing a system further when positive flows do not work.
== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:09:16Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewr.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm:
[[File:Hamer Algorithm Inputs Outputs.png|1000px]]

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values:
[[File:Reputation web server hamer.png|thumb|]]

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

We present these scenarios as possible test cases for an accurately working Peerlogic webservice.
These have not been implemented as there is no point in testing a system further when positive flows do not work.
== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:08:48Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewr.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm:
[[File:Hamer Algorithm Inputs Outputs.png]]

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values:
[[File:Reputation web server hamer.png|thumb|]]

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

We present these scenarios as possible test cases for an accurately working Peerlogic webservice.
These have not been implemented as there is no point in testing a system further when positive flows do not work.
== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

File:Hamer Algorithm Inputs Outputs.png

2022-03-28T01:07:06Z

Jlin36:

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T01:05:58Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewr.

=== System Design ===

The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm:
[[File:Hamer_Algorithm_Inputs_Outputs.png]]

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the peerlogic URL with a json body consisting of each assignment and the grades, and the reviewer who gave each grade. Once the request is sent, a json response is sent back with the corresponding hamer reputation weight values. The following is an example of a post request to the peerlogic URL to get back hamer reputation values:
[[File:Reputation web server hamer.png|thumb|]]

=== Objectives ===

* Calculate reputation scores based on paper "Pluggable reputation systems for peer review: A web-service approach"
* Assert the accuracy of the reputation web server's hamer values through the URL http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms.
* Create a mock web server to return the correct hamer values if the reputation web server's hamer algorithm returns incorrect values.

=== Files Involved ===

*reputation_web_server_hamer.rb
*reputation_mock_web_server_hamer.rb

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Algorithms ==

Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

=== Hamer Algorithm ===
[[File:Step1.PNG|400px]]
 
[[File:Step2.PNG|400px]]
 
[[File:Step3.PNG|400px]]
 
[[File:Step4.PNG|400px]]

We implemented the steps of this algorithm for our analytical validation, found in the section below.
== Test Plan - Initial Phase ==
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Plan ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 3 reviewees (fellow students)
::b. There are a total of 3 reviewers, who have all graded each other in some fashion for 5 assignments
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and receive a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic.
::f. This output would be compared against actual data that we calculated based on the Research Paper for the Hamer Algorithm
'''''The code for the last step is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Initial Testing Conclusions ===
The results that are '''''actually''''' received from Peerlogic are presented below:
 
[[File:json-ss.jpeg|600px]]

As you can see, they do NOT match with expected results.
'''Therefore, our first conclusion is that the PeerLogic Webservice is implemented incorrectly.'''
This has been documented in the Conclusion section as the first point.
== Changes to Project Scope ==

=== Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

We present these scenarios as possible test cases for an accurately working Peerlogic webservice.
These have not been implemented as there is no point in testing a system further when positive flows do not work.
== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T00:36:40Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===
Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewr.

=== System Design ===
The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer. The following is an example from the paper [1] that describes the hamer algorithm:
[[File:Hamer_Algorithm_Inputs_Outputs.png]]

This algorithm is currently deployed to the following web server: http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms. To use the algorithm, a post request is sent to the URL with

== Description ==
hamer.rb was the file that implemented one of the “reputation systems” that can be used to determine the reliability of peer reviewers. However, this file is no longer current, having been replaced by a web service in 2015. Therefore, we will be trying to describe and test this web service in the following sections.

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Reputation System ==
Online peer-review systems are now in common use in higher education. They free the instructor and course staff from having to provide personally all the feedbackthat students receive on their work. However, if we want to assure that all students receive competent feedback, or even use peer-assigned grades, we need a way tojudge which peer reviewers are most credible. The solution is to use a reputation system. The reputation system is meant to provide objective value to student assigned peer review scores. Students select from a list of tasks to be performed and then preparetheir work and submit it to a peer-review system. The work is then reviewed by other students who offer comments/graded feedback to help the submitters improvetheir work. During the peer review period it is important to determine which reviews are more accurate and show higher quality. Reputation is one way to achieve thisgoal; it is a quantization measurement to judge which peer reviewers are more reliable. Peer reviewers can use expertiza to score an author. If Expertiza shows aconfidence ratings for grades based upon the reviewers reputation then authors can more easily determine the legitimacy of the peer assigned score. In addition, theteaching staff can examine the quality of each peer review based on reputation values and, potentially, crowd-source a significant portion of the grading function.Currently the reputation system is implemented in Expertiza through web-service.
The service does not work all the time although expertiza employees can sometimes run the system, we could not reach the service and values even though we tried it on our own local computer and vcl as well. Nevertheless, we have implement some test scenerios based on the algorithms used in the web service.

== Algorithms ==
Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

== Test Plan ==
=== Initial Phase ===
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Outcomes ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.

2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
::a. Each reviewer has assigned scores to 4 reviewees (fellow students)
::b. There are a total of 4 reviewers, who have all graded each other in some fashion
::c. Convert this scenario to JSON
::d. Write code to PUT this to Peerlogic, and recieve a response
::e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic

'''''The code for this section is shown below'''''
<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

=== Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Simulation Code Segment to Test Web Service ==

<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T00:26:13Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===
Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm was create for such purposes and returns a reputation weight associated with each reviewer. The instructor can then use the reputation weight value to either assert the reliability of a reviewer or use these values to compute a grade for a reviewr.

=== System Design ===
The hamer algorithm [[File:Hamer_Algorithm_Inputs_Outputs.png]]takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer.

==About Expertiza==
Expertiza is a multi-purpose web application built using Ruby on Rails for Students and Instructors. Instructors enrolled in Expertiza can create and customize classes, teams, assignments, quizzes, and many more. On the other hand, Students are also allowed to form teams, attempt quizzes, and complete assignments. Apart from that, Expertiza also allows students to provide peer reviews enabling them to work together to improve others' learning experiences. It is an open-source application and its Github repository is [https://github.com/expertiza/expertiza Expertiza].

== Description ==
hamer.rb was the file that implemented one of the “reputation systems” that can be used to determine the reliability of peer reviewers. However, this file is no longer current, having been replaced by a web service in 2015. Therefore, we will be trying to describe and test this web service in the following sections.

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Reputation System ==
Online peer-review systems are now in common use in higher education. They free the instructor and course staff from having to provide personally all the feedbackthat students receive on their work. However, if we want to assure that all students receive competent feedback, or even use peer-assigned grades, we need a way tojudge which peer reviewers are most credible. The solution is to use a reputation system. The reputation system is meant to provide objective value to student assigned peer review scores. Students select from a list of tasks to be performed and then preparetheir work and submit it to a peer-review system. The work is then reviewed by other students who offer comments/graded feedback to help the submitters improvetheir work. During the peer review period it is important to determine which reviews are more accurate and show higher quality. Reputation is one way to achieve thisgoal; it is a quantization measurement to judge which peer reviewers are more reliable. Peer reviewers can use expertiza to score an author. If Expertiza shows aconfidence ratings for grades based upon the reviewers reputation then authors can more easily determine the legitimacy of the peer assigned score. In addition, theteaching staff can examine the quality of each peer review based on reputation values and, potentially, crowd-source a significant portion of the grading function.Currently the reputation system is implemented in Expertiza through web-service.
The service does not work all the time although expertiza employees can sometimes run the system, we could not reach the service and values even though we tried it on our own local computer and vcl as well. Nevertheless, we have implement some test scenerios based on the algorithms used in the web service.

== Algorithms ==
Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

== Test Plan ==
=== Initial Phase ===
In the initial phase, we were tasked with testing the reputation_web_service_controller. The work done by a previous project team was impacted by the web-service (peerlogic) not being available at that time.
This time, we were able to access the Peerlogic server at a late stage - therefore, our plan at this moment involved performing a series of unit tests to determine that the web-service was communicating
correctly with Expertiza.

=== Initial Testing Outcomes ===
1. Since our focus in this phase was to conduct exploratory testing of the system, we wrote some conventional tests to examine Peerlogic functionality. At this stage, we realized that Peerlogic
would only accept and respond with JSON data.
2. Therefore, a natural next step was to prepare a series of input data that simulated a general input scenario for the system, comprising of:
 a. Each reviewer has assigned scores to 4 reviewees (fellow students)
 b. There are a total of 4 reviewers, who have all graded each other in some fashion
 c. Convert this scenario to JSON
 d. Using an API testing software, PUT this to Peerlogic, and recieve a response
 e. Parse through this response to obtain the output values of the Hamer Algorithm, as calculated by Peerlogic

=== Second Phase ===
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Simulation Code Segment to Test Web Service ==

<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-28T00:20:38Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===
Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for student reviews to be accurate. The hamer algorithm takes in a set of grades for assignments by reviewer and also any reputation weights (optional) associated with each reviewer to compute a reputation weight value for each reviewer. These reputation weight values indicate the accuracy and reliability of each reviewer. For example, a reviewer with a reputation weight of 3.0 is more accurate and reliable in their reviews compared to a 0.5 reputation weight of another reviewer.

==About Expertiza==
Expertiza is a multi-purpose web application built using Ruby on Rails for Students and Instructors. Instructors enrolled in Expertiza can create and customize classes, teams, assignments, quizzes, and many more. On the other hand, Students are also allowed to form teams, attempt quizzes, and complete assignments. Apart from that, Expertiza also allows students to provide peer reviews enabling them to work together to improve others' learning experiences. It is an open-source application and its Github repository is [https://github.com/expertiza/expertiza Expertiza].

== Description ==
hamer.rb was the file that implemented one of the “reputation systems” that can be used to determine the reliability of peer reviewers. However, this file is no longer current, having been replaced by a web service in 2015. Therefore, we will be trying to describe and test this web service in the following sections.

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Reputation System ==
Online peer-review systems are now in common use in higher education. They free the instructor and course staff from having to provide personally all the feedbackthat students receive on their work. However, if we want to assure that all students receive competent feedback, or even use peer-assigned grades, we need a way tojudge which peer reviewers are most credible. The solution is to use a reputation system. The reputation system is meant to provide objective value to student assigned peer review scores. Students select from a list of tasks to be performed and then preparetheir work and submit it to a peer-review system. The work is then reviewed by other students who offer comments/graded feedback to help the submitters improvetheir work. During the peer review period it is important to determine which reviews are more accurate and show higher quality. Reputation is one way to achieve thisgoal; it is a quantization measurement to judge which peer reviewers are more reliable. Peer reviewers can use expertiza to score an author. If Expertiza shows aconfidence ratings for grades based upon the reviewers reputation then authors can more easily determine the legitimacy of the peer assigned score. In addition, theteaching staff can examine the quality of each peer review based on reputation values and, potentially, crowd-source a significant portion of the grading function.Currently the reputation system is implemented in Expertiza through web-service.
The service does not work all the time although expertiza employees can sometimes run the system, we could not reach the service and values even though we tried it on our own local computer and vcl as well. Nevertheless, we have implement some test scenerios based on the algorithms used in the web service.

== Algorithms ==
Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

== Test Plan ==
= Initial Phase =
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')
req = Net::HTTP::Post.new(uri)
req.content_type = 'application/json'
req.body = INPUTS

response = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => uri.scheme == 'https') do |http|
http.request(req)
end
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Simulation Code Segment to Test Web Service ==

<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-27T23:00:38Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===
Using student's reviews of a certain assignment as a more accurate grade has become more popular among professors and courses in universities. Not only does this method free the professor and TAs from days of work, but also allows for students to learn even more about an assignment through grading other's work. Unfortunately, many students may not take reviewing other's work seriously and may simply give 100 or 0 to other students. Since such reviews may skew a student's grade, a system to assert the correctness and credibility of a reviewer is necessary for this reviewing system to work.

==About Expertiza==
Expertiza is a multi-purpose web application built using Ruby on Rails for Students and Instructors. Instructors enrolled in Expertiza can create and customize classes, teams, assignments, quizzes, and many more. On the other hand, Students are also allowed to form teams, attempt quizzes, and complete assignments. Apart from that, Expertiza also allows students to provide peer reviews enabling them to work together to improve others' learning experiences. It is an open-source application and its Github repository is [https://github.com/expertiza/expertiza Expertiza].

== Description ==
hamer.rb was the file that implemented one of the “reputation systems” that can be used to determine the reliability of peer reviewers. However, this file is no longer current, having been replaced by a web service in 2015. Therefore, we will be trying to describe and test this web service in the following sections.

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Reputation System ==
Online peer-review systems are now in common use in higher education. They free the instructor and course staff from having to provide personally all the feedbackthat students receive on their work. However, if we want to assure that all students receive competent feedback, or even use peer-assigned grades, we need a way tojudge which peer reviewers are most credible. The solution is to use a reputation system. The reputation system is meant to provide objective value to student assigned peer review scores. Students select from a list of tasks to be performed and then preparetheir work and submit it to a peer-review system. The work is then reviewed by other students who offer comments/graded feedback to help the submitters improvetheir work. During the peer review period it is important to determine which reviews are more accurate and show higher quality. Reputation is one way to achieve thisgoal; it is a quantization measurement to judge which peer reviewers are more reliable. Peer reviewers can use expertiza to score an author. If Expertiza shows aconfidence ratings for grades based upon the reviewers reputation then authors can more easily determine the legitimacy of the peer assigned score. In addition, theteaching staff can examine the quality of each peer review based on reputation values and, potentially, crowd-source a significant portion of the grading function.Currently the reputation system is implemented in Expertiza through web-service.
The service does not work all the time although expertiza employees can sometimes run the system, we could not reach the service and values even though we tried it on our own local computer and vcl as well. Nevertheless, we have implement some test scenerios based on the algorithms used in the web service.

== Algorithms ==
Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

== Test Plan ==
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

response = Net::HTTP.post(uri, INPUTS, 'Content-Type' => 'application/json')

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')

response = Net::HTTP.post(uri, INPUTS, 'Content-Type' => 'application/json')
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Simulation Code Segment to Test Web Service ==

<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-27T22:52:23Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===
<q>Online peer-review systems are now in common use in higher education. They free the instructor and course staff from having to provide personally all the feedback that students receive on their work. However, if we want to assure that all students receive competent feedback, or even use peer-assigned grades, we need a way to judge which peer reviewers are most credible. The solution is to use a reputation system.</q> The reputation system is meant to provide objective value to student assigned peer review scores. Students select from a list of tasks to be performed and then prepare their work and submit it to a peer-review system. The work is then reviewed by other students who offer comments/graded feedback to help the submitters improve their work.
During the peer review period it is important to determine which reviews are more accurate and show higher quality. Reputation is one way to achieve this goal; it is a quantization measurement to judge which peer reviewers are more reliable.
Peer reviewers can use expertiza to score an author. If Expertiza shows a confidence ratings for grades based upon the reviewers reputation then authors can more easily determine the legitimacy of the peer assigned score. In addition, the teaching staff can examine the quality of each peer review based on reputation values and, potentially, crowd-source a significant portion of the grading function.
Currently the reputation system is implemented in Expertiza through web-service, but there's no test written for it. Thus our goal is to set up assignments and reviews that would produce specific reputation scores, and test that the correct reputations are in fact being produced.

==About Expertiza==
Expertiza is a multi-purpose web application built using Ruby on Rails for Students and Instructors. Instructors enrolled in Expertiza can create and customize classes, teams, assignments, quizzes, and many more. On the other hand, Students are also allowed to form teams, attempt quizzes, and complete assignments. Apart from that, Expertiza also allows students to provide peer reviews enabling them to work together to improve others' learning experiences. It is an open-source application and its Github repository is [https://github.com/expertiza/expertiza Expertiza].

== Description ==
hamer.rb was the file that implemented one of the “reputation systems” that can be used to determine the reliability of peer reviewers. However, this file is no longer current, having been replaced by a web service in 2015. Therefore, we will be trying to describe and test this web service in the following sections.

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Reputation System ==
Online peer-review systems are now in common use in higher education. They free the instructor and course staff from having to provide personally all the feedbackthat students receive on their work. However, if we want to assure that all students receive competent feedback, or even use peer-assigned grades, we need a way tojudge which peer reviewers are most credible. The solution is to use a reputation system. The reputation system is meant to provide objective value to student assigned peer review scores. Students select from a list of tasks to be performed and then preparetheir work and submit it to a peer-review system. The work is then reviewed by other students who offer comments/graded feedback to help the submitters improvetheir work. During the peer review period it is important to determine which reviews are more accurate and show higher quality. Reputation is one way to achieve thisgoal; it is a quantization measurement to judge which peer reviewers are more reliable. Peer reviewers can use expertiza to score an author. If Expertiza shows aconfidence ratings for grades based upon the reviewers reputation then authors can more easily determine the legitimacy of the peer assigned score. In addition, theteaching staff can examine the quality of each peer review based on reputation values and, potentially, crowd-source a significant portion of the grading function.Currently the reputation system is implemented in Expertiza through web-service.
The service does not work all the time although expertiza employees can sometimes run the system, we could not reach the service and values even though we tried it on our own local computer and vcl as well. Nevertheless, we have implement some test scenerios based on the algorithms used in the web service.

== Algorithms ==
Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

== Test Plan ==
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

response = Net::HTTP.post(uri, INPUTS, 'Content-Type' => 'application/json')

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')

response = Net::HTTP.post(uri, INPUTS, 'Content-Type' => 'application/json')
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Simulation Code Segment to Test Web Service ==

<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)

CSC/ECE 517 Spring 2022 - E2212: Testing for hamer.rb

2022-03-27T22:51:24Z

Jlin36:

This page describes the changes made for the Spring 2022 OSS Project E2212: Testing for hamer.rb
== Project Overview ==

=== Introduction ===

==About Expertiza==
Expertiza is a multi-purpose web application built using Ruby on Rails for Students and Instructors. Instructors enrolled in Expertiza can create and customize classes, teams, assignments, quizzes, and many more. On the other hand, Students are also allowed to form teams, attempt quizzes, and complete assignments. Apart from that, Expertiza also allows students to provide peer reviews enabling them to work together to improve others' learning experiences. It is an open-source application and its Github repository is [https://github.com/expertiza/expertiza Expertiza].

== Description ==
hamer.rb was the file that implemented one of the “reputation systems” that can be used to determine the reliability of peer reviewers. However, this file is no longer current, having been replaced by a web service in 2015. Therefore, we will be trying to describe and test this web service in the following sections.

=== Mentor ===

Ed Gehringer, efg@ncsu.edu

=== Team Members ===

* Joshua Lin (jlin36@ncsu.edu)
* Muhammet Mustafa Olmez (molmez@ncsu.edu)
* Soumyadeep Chatterjee (schatte5@ncsu.edu)

== Reputation System ==
Online peer-review systems are now in common use in higher education. They free the instructor and course staff from having to provide personally all the feedbackthat students receive on their work. However, if we want to assure that all students receive competent feedback, or even use peer-assigned grades, we need a way tojudge which peer reviewers are most credible. The solution is to use a reputation system. The reputation system is meant to provide objective value to student assigned peer review scores. Students select from a list of tasks to be performed and then preparetheir work and submit it to a peer-review system. The work is then reviewed by other students who offer comments/graded feedback to help the submitters improvetheir work. During the peer review period it is important to determine which reviews are more accurate and show higher quality. Reputation is one way to achieve thisgoal; it is a quantization measurement to judge which peer reviewers are more reliable. Peer reviewers can use expertiza to score an author. If Expertiza shows aconfidence ratings for grades based upon the reviewers reputation then authors can more easily determine the legitimacy of the peer assigned score. In addition, theteaching staff can examine the quality of each peer review based on reputation values and, potentially, crowd-source a significant portion of the grading function.Currently the reputation system is implemented in Expertiza through web-service.
The service does not work all the time although expertiza employees can sometimes run the system, we could not reach the service and values even though we tried it on our own local computer and vcl as well. Nevertheless, we have implement some test scenerios based on the algorithms used in the web service.

== Algorithms ==
Reputation systems may take various factors into account:
• Does a reviewer assign scores that are similar to scores assigned by the instructor (on work that they both grade)?
• Does a reviewer assign scores that match those assigned by other reviewers?
• Does the reviewer assign different scores to different work?
• How competent has the reviewer been on other work done for the class?

There are two algorithms used, the Hamer-peer algorithm has the lowest maximum absolute bias and the Lauw-peer algorithm has the lowest overall bias.This indicates, from theinstructor’s perspective, if there are further assignments of this kind, expert grading may not be necessary. It is observed in the article (https://ieeexplore.ieee.org/abstract/document/7344292) that the overall bias is a little bit higher, but the max. absolute bias is very high (more than 20). This indicates that for future similar courses, the instructor can trust most students’ peer grading, but should be aware that the students may give inflated grades. Therefore spot-checking is necessary. However, overall bias is quite low, as the students gave grades at least 16 points lower than expert grades. This may because either more training is needed, or the review rubric is inadequate. The results also suggest that for future courses of this kind, the instructor cannot trust the students' grades; expert grades are still necessary.
The main difference between the Hamer-peer and the Lauw-peer algorithm is that the Lauw-peer algorithm keeps track of the reviewer's leniency (“bias”), which can be either positive or negative. A positive leniency indicates the reviewer tends to give higher scores than average. Additionally, the range for Hamer’s algorithm is (0,∞) while for Lauw’s algorithm it is [0,1].

== Test Plan ==
We followed the testing thought process recommended by Dr. Gehringer:
In testing this service, we used an external program to send requests to a simulated service, and inspected the returned data.
This decision was reached since our program of test was unfortunately not running, and could not be inspected in an ideal manner.

The test below sends real JSON to both peerlogic and mock. http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms

As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture.
This is what we are supposed to reach in this project.
 
Proof of working:
 
[[File:Response_json_expected.jpeg]]

Test Code Snippet:

<pre>
require "net/http"
require "json"

INPUTS = {
"submission9999": {
"stu9999": 10,
"stu9998": 10,
"stu9997": 9,
"stu9996": 5
},
"submission9998": {
"stu9999": 3,
"stu9998": 2,
"stu9997": 4,
"stu9996": 5
},
"submission9997": {
"stu9999": 7,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
},
"submission9996": {
"stu9999": 6,
"stu9998": 4,
"stu9997": 5,
"stu9996": 5
}
}.to_json

EXPECTED = {
"Hamer": {
"9996": 0.6,
"9997": 3.6,
"9998": 1.1,
"9999": 1.1
}
}.to_json

describe "Expertiza" do
it "should return the correct Hamer calculation" do
uri = URI('http://peerlogic.csc.ncsu.edu/reputation/calculations/reputation_algorithms')

response = Net::HTTP.post(uri, INPUTS, 'Content-Type' => 'application/json')

expect(JSON.parse(response.body)["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

describe "Expertiza Web Service" do
it "should return the correct Hamer calculation" do
uri = URI('https://4dfaead4-a747-4be4-8683-3b10d1d2e0c0.mock.pstmn.io/reputation_web_service/default')

response = Net::HTTP.post(uri, INPUTS, 'Content-Type' => 'application/json')
expect(JSON.parse("#{response.body}}")["Hamer"]).to eq(JSON.parse(EXPECTED)["Hamer"])
end
end

</pre>

In addition, this plan enables us to test the current functionality by treating this system as a black box, and is able to provide conclusions on
the accuracy of the implementation as a whole.

Therefore, in the section below, we have provided code that showcases this plan in action. The values returned by the algorithm are to be inspected both by code and by hand.

== Simulation Code Segment to Test Web Service ==

<pre>
import math

# Parameters: reviews list
# reviews list - a list of each reviewer's grades for each assignment
# Example:
# reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]
# Corresponding reviewer and grade for each assignment table
# Essay Reviewer1 Reviewer2 Reviewer3
# Assignment1 5 5 4
# Assignment2 4 3 3
# Assignment3 4 4 4
# Assignment4 3 4 3
# Assignment5 2 2 2

# Reivewer's grades given to each assignment 2D array
# Each index of reviews is a reviewer. Each index in reviews[i] is a review grade
reviews = [[5,4,4,3,2],[5,3,4,4,2],[4,3,4,3,2]]

# Number of reviewers
numReviewers = len(reviews)
# Number of assignments
numAssig = len(reviews[0])
# Initial empty grades for each assignment array
grades = []
# Initial empty delta R array
deltaR = []
# Weight prime
weightPrime = []
# Reviewer's reputation weight
weight= []

# Calculating Average Weighted Grades per Reviewer
for numAssigIndex in range(numAssig):
assignmentGradeAverage = 0
for numReviewerIndex in range(numReviewers):
assignmentGradeAverage += reviews[numReviewerIndex][numAssigIndex]
grades.append(assignmentGradeAverage/numReviewers)
print("Average Grades:")
print(grades)

# Calculating delta R
for numReviewerIndex in range(numReviewers):
reviewerDeltaR = 0
assignmentAverageGradeIndex = 0
for reviewGrade in reviews[numReviewerIndex]:
reviewerDeltaR += ((reviewGrade - grades[assignmentAverageGradeIndex]) ** 2)
assignmentAverageGradeIndex += 1
reviewerDeltaR /= numAssig
deltaR.append(reviewerDeltaR)
print("deltaR:")
print(deltaR)

# Calculating weight prime
averageDeltaR = 0
for reviewerDeltaR in deltaR:
averageDeltaR += reviewerDeltaR
averageDeltaR /= numReviewers
print("averageDeltaR:")
print(averageDeltaR)

# Calculating weight prime
for reviewerDeltaR in deltaR:
weightPrime.append(averageDeltaR/reviewerDeltaR)
print("weightPrime:")
print(weightPrime)

# Calculating reputation weight
for reviewerWeightPrime in weightPrime:
if reviewerWeightPrime <= 2:
weight.append(reviewerWeightPrime)
else:
weight.append(2 + math.log(reviewerWeightPrime - 1))
print("reputation per reviewer:")
i = 1
for reviewerWeight in weight:
print("Reputation of Reviewer ", i)
print(round(reviewerWeight,1))
i += 1
</pre>

Output
<pre>
Reputation of Reviewer 1
1.0
Reputation of Reviewer 2
1.0
Reputation of Reviewer 3
1.0
</pre>

== Scenarios ==
1) Reviewer gives all max scores 
2) Reviewer gives all min scores 
3) Reviewer completes no review 
alternative scenario - reviewer gives max scores even if no inputs

== Conclusion ==

We as a team figured out the algorithms and applications and write some test scenarious. However, we did not have chance to work on web service since it does not work due to module errors. What we had is undefined method strip on Reputation Web Service Controller. Although sometimes it works on expertiza team side, we were not able to see the web service working. We created some test scenarios and write a python code for simulate the algorithm.
 
In the code segment written to simulate the hamer.rb algorithm as described in "A Method of Automatic Grade Calibration in Peer Assessment" by John Hamer Kenneth T.K. Ma Hugh H.F. Kwong (https://crpit.scem.westernsydney.edu.au/confpapers/CRPITV42Hamer.pdf), we take a list of reviewers and their grades for each assignment reviewed to compute the associated reputation weight. Since the algorithm described in the paper does not specify an original weight for first time reviewers, we coded it so the first time reviewers had an original weight of 1. In addition, this code does not have reviewer weights added in for reviewers who already have reputation weights but will be added in soon. Also, we followed the algorithm they mentioned in the paper to the dot, but even then the output values they wrote as the example did not match what we computed by hand and by code. In this situation, either we missed something completely or the algorithm has been changed. As we tested on the peerlogic and mock, current web-service is not correct, since the returned values do not match the expected values as can be seen in the picture. This can be what we are supposed to reach in this project.

==GitHub Links==
Link to Expertiza repository: [https://github.com/expertiza/expertiza here]

Link to the forked repository: [https://github.com/joshlin5/expertiza here]

Link to pull request: [https://github.com/expertiza/expertiza/pull/2355/checks here]

== References ==

1. Expertiza on GitHub (https://github.com/expertiza/expertiza) 
2. The live Expertiza website (http://expertiza.ncsu.edu/) 
3. Pluggable reputation systems for peer review: A web-service approach (https://doi.org/10.1109/FIE.2015.7344292)