CSC/ECE 517 Fall 2013/oss E816 cyy

From Expertiza_Wiki
Jump to navigation Jump to search

Introduction to Refactoring plagiarism_check.rb and sentence_state.rb

Expertiza is a web application, which allows students to submit assignments and do peer review of each other's work<ref> Expertiza github</ref>. Expertiza also supports team projects and any document type of submission is acceptable<ref> Expertiza wiki</ref>. Expertiza has been deployed for years to help professors and students engaging in the learning process. Expertiza is an open source project, for each year, students in the course of CSC517-Object Oriented Programmning of North Carolina State University will contributes to this project along with teaching assistant and professor.

For this year, we are responsible for refactoring plagiarism_check.rb and sentence_state.rb of the Expertiza project. Expertiza is built using Ruby on Rails with MVC design pattern. plagiarism_check.rb and sentence_state.rb are parts of the automated_metareview functionality inside models. The responsibility of sentence_state.rb is to determine the state of each clause of a sentence, and the responsibility of plagiarism_check.rb is to determine whether the reviews are just copied from other sources.

Project description

Design

sentence_state.rb

To see the original code please go to this link: https://github.com/expertiza/expertiza/blob/master/app/models/automated_metareview/sentence_state.rb

Design Smells

The original code had several design smells, mostly deeply-nested if-else statements and duplicated code. Another design smell was that SentenceState had too many responsibilities. It had to first parse the sentence into separate sentence clauses and then separate the sentence clauses into tokens before iterating through the tokens to determine the state of the sentence. The worst problem was the a deeply nested if-else statement which determined the next state of the sentence clause based on the previous state and the next sentence token. Instead of the SentenceState class being responsible for all of these relationships, it is better to have subclasses of SentenceState which each know the relationship between there own state and any sentence token type.

Refactor Steps

The first step in refactoring is to get the tests to pass. This required some debugging to find that some constants were defined in two different files, and one of the definitions was incomplete. After updating this, all 18 original tests in sentence_state_test.rb passed.

The first place to refactor was the longest method in SentenceState, the method sentence_state(str_with_pos_tags). There were three for loops, each with deeply nested if-else statements inside of them. To make this method more readable, I extracted three for or if-else statements into their own method to clean up the code. This changed the code from:

def sentence_state(str_with_pos_tags)
   state = POSITIVE
   #checking single tokens for negated words
   st = str_with_pos_tags.split(" ")
   count = st.length
   tokens = Array.new
   tagged_tokens = Array.new
   i = 0
   interim_noun_verb  = false #0 indicates no interim nouns or verbs
       
   #fetching all the tokens
   for k in (0..st.length-1)
     ps = st[k]
     #setting the tagged string
     tagged_tokens[i] = ps
     if(ps.include?("/"))
       ps = ps[0..ps.index("/")-1] 
     end
     #removing punctuations 
     if(ps.include?("."))
       tokens[i] = ps[0..ps.index(".")-1]
     elsif(ps.include?(","))
       tokens[i] = ps.gsub(",", "")
     elsif(ps.include?("!"))
       tokens[i] = ps.gsub("!", "")
     elsif(ps.include?(";"))
       tokens[i] = ps.gsub(";", "")
     else
       tokens[i] = ps
       i+=1
     end     
   end#end of the for loop
   
   #iterating through the tokens to determine state
   prev_negative_word =""
   for j  in (0..i-1)
     #checking type of the word
     #checking for negated words
     if(is_negative_word(tokens[j]) == NEGATED)  
       returned_type = NEGATIVE_WORD
     #checking for a negative descriptor (indirect indicators of negation)
     elsif(is_negative_descriptor(tokens[j]) == NEGATED)
       returned_type = NEGATIVE_DESCRIPTOR
     #2-gram phrases of negative phrases
     elsif(j+1 < count && !tokens[j].nil? && !tokens[j+1].nil? && 
       is_negative_phrase(tokens[j]+" "+tokens[j+1]) == NEGATED)
       returned_type = NEGATIVE_PHRASE
       j = j+1      
     #if suggestion word is found
     elsif(is_suggestive(tokens[j]) == SUGGESTIVE)
       returned_type = SUGGESTIVE
     #2-gram phrases suggestion phrases
     elsif(j+1 < count && !tokens[j].nil? && !tokens[j+1].nil? &&
        is_suggestive_phrase(tokens[j]+" "+tokens[j+1]) == SUGGESTIVE)
       returned_type = SUGGESTIVE
       j = j+1
     #else set to positive
     else
       returned_type = POSITIVE
     end
     
     #----------------------------------------------------------------------
     #comparing 'returnedType' with the existing STATE of the sentence clause
     #after returnedType is identified, check its state and compare it to the existing state
     #if present state is negative and an interim non-negative or non-suggestive word was found, set the flag to true
     if((state == NEGATIVE_WORD or state == NEGATIVE_DESCRIPTOR or state == NEGATIVE_PHRASE) and returned_type == POSITIVE)
       if(interim_noun_verb == false and (tagged_tokens[j].include?("NN") or tagged_tokens[j].include?("PR") or tagged_tokens[j].include?("VB") or tagged_tokens[j].include?("MD")))
         interim_noun_verb = true
       end
     end 
     
     if(state == POSITIVE and returned_type != POSITIVE)
       state = returned_type
     #when state is a negative word
     elsif(state == NEGATIVE_WORD) #previous state
       if(returned_type == NEGATIVE_WORD)
         #these words embellish the negation, so only if the previous word was not one of them you make it positive
         if(prev_negative_word.casecmp("NO") != 0 and prev_negative_word.casecmp("NEVER") != 0 and prev_negative_word.casecmp("NONE") != 0)
           state = POSITIVE #e.g: "not had no work..", "doesn't have no work..", "its not that it doesn't bother me..."
         else
           state = NEGATIVE_WORD #e.g: "no it doesn't help", "no there is no use for ..."
         end  
         interim_noun_verb = false #resetting         
       elsif(returned_type == NEGATIVE_DESCRIPTOR or returned_type == NEGATIVE_PHRASE)
         state = POSITIVE #e.g.: "not bad", "not taken from", "I don't want nothing", "no code duplication"// ["It couldn't be more confusing.."- anomaly we dont handle this for now!]
         interim_noun_verb = false #resetting
       elsif(returned_type == SUGGESTIVE)
         #e.g. " it is not too useful as people could...", what about this one?
         if(interim_noun_verb == true) #there are some words in between
           state = NEGATIVE_WORD
         else
           state = SUGGESTIVE #e.g.:"I do not(-) suggest(S) ..."
         end
         interim_noun_verb = false #resetting
       end
     #when state is a negative descriptor
     elsif(state == NEGATIVE_DESCRIPTOR)
       if(returned_type == NEGATIVE_WORD)
         if(interim_noun_verb == true)#there are some words in between
           state = NEGATIVE_WORD #e.g: "hard(-) to understand none(-) of the comments"
         else
           state = POSITIVE #e.g."He hardly not...."
         end
         interim_noun_verb = false #resetting
       elsif(returned_type == NEGATIVE_DESCRIPTOR)
         if(interim_noun_verb == true)#there are some words in between
           state = NEGATIVE_DESCRIPTOR #e.g:"there is barely any code duplication"
         else 
           state = POSITIVE #e.g."It is hardly confusing..", but what about "it is a little confusing.."
         end
         interim_noun_verb = false #resetting
       elsif(returned_type == NEGATIVE_PHRASE)
         if(interim_noun_verb == true)#there are some words in between
           state = NEGATIVE_PHRASE #e.g:"there is barely any code duplication"
         else 
           state = POSITIVE #e.g.:"it is hard and appears to be taken from"
         end
         interim_noun_verb = false #resetting
       elsif(returned_type == SUGGESTIVE)
         state = SUGGESTIVE #e.g.:"I hardly(-) suggested(S) ..."
         interim_noun_verb = false #resetting
       end
     #when state is a negative phrase
     elsif(state == NEGATIVE_PHRASE)
       if(returned_type == NEGATIVE_WORD)
         if(interim_noun_verb == true)#there are some words in between
           state = NEGATIVE_WORD #e.g."It is too short the text and doesn't"
         else
           state = POSITIVE #e.g."It is too short not to contain.."
         end
         interim_noun_verb = false #resetting
       elsif(returned_type == NEGATIVE_DESCRIPTOR)
         state = NEGATIVE_DESCRIPTOR #e.g."It is too short barely covering..."
         interim_noun_verb = false #resetting
       elsif(returned_type == NEGATIVE_PHRASE)
         state = NEGATIVE_PHRASE #e.g.:"it is too short, taken from ..."
         interim_noun_verb = false #resetting
       elsif(returned_type == SUGGESTIVE)
         state = SUGGESTIVE #e.g.:"I too short and I suggest ..."
         interim_noun_verb = false #resetting
       end
     #when state is suggestive
     elsif(state == SUGGESTIVE) #e.g.:"I might(S) not(-) suggest(S) ..."
       if(returned_type == NEGATIVE_DESCRIPTOR)
         state = NEGATIVE_DESCRIPTOR
       elsif(returned_type == NEGATIVE_PHRASE)
         state = NEGATIVE_PHRASE
       end
       #e.g.:"I suggest you don't.." -> suggestive
       interim_noun_verb = false #resetting
     end
     
     #setting the prevNegativeWord
     if(tokens[j].casecmp("NO") == 0 or tokens[j].casecmp("NEVER") == 0 or tokens[j].casecmp("NONE") == 0)
       prev_negative_word = tokens[j]
     end  
         
   end #end of for loop
   
   if(state == NEGATIVE_DESCRIPTOR or state == NEGATIVE_WORD or state == NEGATIVE_PHRASE)
     state = NEGATED
   end
   
   return state
end

to:

def sentence_state(str_with_pos_tags)
   state = POSITIVE
   interim_noun_verb  = false #0 indicates no interim nouns or verbs
       
   tagged_tokens, tokens = parse_sentence_tokens(str_with_pos_tags)
   
   #iterating through the tokens to determine state
   prev_negative_word =""
   for j  in (0..tokens.length-1)
     #checking type of the word
     #checking for negated words
     current_token_type = get_token_type(tokens[j..tokens.length-1)
     
     #comparing 'returnedType' with the existing STATE of the sentence clause
     #after returnedType is identified, check its state and compare it to the existing state
     #if present state is negative and an interim non-negative or non-suggestive word was found, set the flag to true
     if((state == NEGATIVE_WORD or state == NEGATIVE_DESCRIPTOR or state == NEGATIVE_PHRASE) and returned_type == POSITIVE)
       if(interim_noun_verb == false and (tagged_tokens[j].include?("NN") or tagged_tokens[j].include?("PR") or tagged_tokens[j].include?("VB") or tagged_tokens[j].include?("MD")))
         interim_noun_verb = true
       end
     end 
     
     state = next_state(state, current_token_type, prev_negative_word, interim_noun_verb)
     
     #setting the prevNegativeWord
     if(tokens[j].casecmp("NO") == 0 or tokens[j].casecmp("NEVER") == 0 or tokens[j].casecmp("NONE") == 0)
       prev_negative_word = tokens[j]
     end  
         
   end #end of for loop
   
   if(state == NEGATIVE_DESCRIPTOR or state == NEGATIVE_WORD or state == NEGATIVE_PHRASE)
     state = NEGATED
   end
   
   return state
end

plagiarism_check.rb

To see the original code please go to this link.


Main Responsibility

The main responsibility of Plagiarism_Check is to determine whether the reviews are just copied from other sources.


Basically, there are four kinds of plagiarism need to be check :

1. whether the review is copied from the submissions of the assignment

2. whether the review is copied from the review questions

3. whether the review is copied from other reviews

4. whether the review is copied from the Internet or other sources, this may be detected through google search


For example, in the test file: expertiza/test/unit/automated_metareview/plagiarism_check_test.rb,

The 1st test shows:

test "check for plagiarism true match" do
   review_text = ["The sweet potatoes in the vegetable bin are green with mold. These sweet potatoes in the vegetable bin are fresh."]
   subm_text = ["The sweet potatoes in the vegetable bin are green with mold. These sweet potatoes in the vegetable bin are fresh."]
  
   instance = PlagiarismChecker.new
   assert_equal(true, instance.check_for_plagiarism(review_text, subm_text))
end

The check_for_plagiarism method compares the review text with submission text. In this case, the review text does not quote the words as well as sentences properly and the reviewer just copies what the author says, which cause a plagiarism.

Design Ideas

From above point of view, the refactoring needs to be done with 4 fundamental methods and each method only does one thing correctly. So as the initial file Plagiarism_check.rb indicates, the compare_reviews_with_questions_responses method has roughly 2 functions : compare reviews with review questions as well as compare reviews with others’ responses, which makes us confused. As the refactoring goes, we need to split the two functions up, and make sure such bad smells disappear.

The first thing to do is based on the above statement, we need to define 4 methods with different functions.

They are: compare_reviews_with_submissions,

compare_reviews_with_questions,

compare_reviews_with_responses, and

compare_reviews_with_google_search, each method has its specific functions.


As showed above, we have to split the method compare_reviews_with_questions _responses up to 2 methods:

def compare_reviews_with_questions(auto_metareview, map_id)
…
end
def compare_reviews_with_responses(auto_metareview, map_id)
…
end

Refactor Steps

Next we need to extract the same part from the long method and make the part a individual method which can be called in class. For example in the method compare_reviews_with_questions and compare_reviews_with_responses they have the common parts: to check whether the reviews are copied fully from the responses/questions,

if(count_copies > 0) #resetting review_array only when plagiarism was found
      auto_metareview.review_array = rev_array
   end
   
   if(count_copies > 0 and count_copies == scores.length)
     return ALL_RESPONSES_PLAGIARISED #plagiarism, with all other metrics 0
   elsif(count_copies > 0)
     return SOME_RESPONSES_PLAGIARISED #plagiarism, while evaluating other metrics
end

To avoid such things to happen, we extract this part and let it be a method to check the state of plagiarism :

def check_plagiarism_state(auto_metareview, count_copies, rev_array, scores)
  if count_copies > 0 #resetting review_array only when plagiarism was found
    auto_metareview.review_array = rev_array
    if count_copies == scores.length
      return ALL_RESPONSES_PLAGIARISED #plagiarism, with all other metrics 0
    else
      return SOME_RESPONSES_PLAGIARISED #plagiarism, while evaluating other metrics
    end
  end
end

Next thing to do is extract the long loop or if-else sentence to a individual method in order to make the initial method too long or confused for others.

Take the 1st method compare_reviews_with_submissions as example, we noticed that the this part:

if(array[rev_len] == " ") #skipping empty
         rev_len+=1
         next
       end
       
       #generating the sentence segment you'd like to compare
rev_phrase = array[rev_len]

can be extracted and made a new method skip_empty_array, since these lines focus on the function of generating the array without backspaces to make comparisons. Once we extract the method, the initial code of the compare_reviews_with_submissions changed:

expertiza/app/models/automated_metareview/plagiarism_check.rb

...
review_text.each do |review_arr| #iterating through the review's sentences
   review = review_arr.to_s
   subm_text.each do |subm_arr|
     #iterating though the submission's sentences
     submission = subm_arr.to_s
     rev_len = 0
     #review's tokens, taking 'n' at a time
     array = review.split(" ")
     while(rev_len < array.length) do
       rev_len, rev_phrase = skip_empty_array(array, rev_len)
     ...
def skip_empty_array(array, rev_len)
 if (array[rev_len] == " ") #skipping empty
   rev_len+=1
end
 #generating the sentence segment you'd like to compare
 rev_phrase = array[rev_len]
 return rev_len, rev_phrase
end

Please see code after refactoring in detail on this page.

All the tests have been passed without failures since refactoring.

Test Our Code

Link to VCL

The purpose of running the VCL server is to let you make sure that expertiza is still working properly using our refactored code. The first VCL link is seeded with the expertiza-scrubbed.sql file which includes questionnaires and courses and assignments so that it is easy to verify that reviews work. You only need to make users and then have them do reviews on one another. The second link is only using the test.sql file but you can still verify that the functionality of expertiza works. If neither of these links work, please do not do your review in a hurry, shoot us an email, we will fix it as soon as possible. (yhuang25@ncsu.edu, ysun6@ncsu.edu, grimes.caroline@gmail.com). Thank you so much!

1. http://152.46.20.30:3000/ Username: admin, password:password

2. http://vclv99-129.hpc.ncsu.edu:3000 Username: admin, password: admin

Git Forked Repository URL

https://github.com/shanfangshuiyuan/expertiza <ref> Expertiza fork</ref>

Steps to Setup Project

1. Clone the git repository shown above.

2. Use ruby 1.9.3

3. Setup mysql and start server

4. Command line: bundle install

5. Download from http://dev.mysql.com/get/Downloads/Connector-C/mysql-connector-c-noinstall-6.0.2-win32.zip/from/pick and copy all files from the lib folder from the download into <Ruby193>\bin

6. Change /config/database.yml according your mysql root password and mysql port.

7. Command line: db:create:all

8. Command line: mysql -u root -p <YOUR_PASSWORD> pg_development < expertiza-scrubbed_2013_07_10.sql

9. Command line: rake db:migrate

10. Command line: rails server

Test Our Code

1. Set up the project following the steps above

2. Command line: db:test:prepare

3. Run plagiarism_check_test.rb and sentence_state_test.rb, they are under /test/unit/automated_metareview. After refactoring, all tests passed without error.

4. Review the refactored files: sentence_state.rb and plagiarism_check.rb are under /app/models/automated_metareview. Other changed files are shown below.

Files Changed

1. text_preprocessing.rb

2. plagiarism_check.rb

3. sentence_state.rb

4. tagged_sentence.rb

5. constants.rb

6. negations.rb

7. plagiarism_check_test.rb

Future work

References

<references/>