CSC/ECE 517 Fall 2013/oss E815 saa

From Expertiza_Wiki
Revision as of 23:09, 30 October 2013 by Anish (talk | contribs)
Jump to navigation Jump to search

Refactoring and Testing of wordnet_based_similarity.rb


Introduction to WordNet

Expertiza is an application that aims at 'reusable learning objects through peer review'. It allows students to submit their work and review other's work. Our OSS project aims at refactoring and testing of the wordnet_based_similarity.rb. The file determines how similar concepts are, based on the WordNet database and is used in NLP analysis of reviews. WordNet is widely used resource for measuring similarity. It is a network of nouns,verbs,adjectives and adverbs which are grouped into synsets(synonymous words), and linked by lexical relations. WordNet is faster to query and involves no additional pre-processing despite having limitations in terms of domains it covers and lack of real world knowledge when compared to Wikipedia.It allows comparisons across different word forms.

WordNet relations-based semantic metric

In order to identify similarity, a relations-based metric is used. Relatedness between two terms v and w, known as match(v, w) is one of those listed below. Each of these different types of matches is given a weight value based on the importance of the match that is found. Hence the matches are assigned values in the range of 0–6. A value of 6 is assigned when the best match (an exact match) occurs and a value of 0 is assigned when a distinct or non-match occurs. The following are the possible matches that can be found :

  • If v and w are exactly the same. This match is given a weight value of 6.
  • If v and w are synonymous. This match is given a weight of 5.
  • If v is a hypernym of w (i.e., v is more generic than the token w) or vice versa. Or v is a hyponym of w (i.e., v is a more specific form of w)or vice-versa. This match is given a weight of 4.
  • If v is a meronym of w (i.e., v is a sub-part of w) or vice versa. Or v is a holonym of w (i.e., v contains w as a sub-part) or vice-versa. For example, “leg” is a meronym of the token “body” and “body” is the holonym of the term “leg”. This type of match is given a weight of 3.
  • If v and w have common parents( excluding generic parents),the normalized distance between the two tokens' sysnets is calculated or identified.(1-distance)gives the similarity between the two tokens. This similarity is then given a weight value between 0-6. If the scale value is greater than 0,a common parents match exists between v and w and is given a weight of 2.
  • The match between the definition of a word, whose sense is to be determined and a word is already known is identified by Lesk. If a high degree of overlap is seen, then this indicates that the two tokens share the same sense.Context based similarity across token is determined by using overlapping definitions as a metric. We also find overlaps across examples of the words.IF two tokens have overlapping definitions or examples, then the match gets a weight value of 1.
  • If v and w contain distinct tokens or phrases .This distinct match is given a weight of 0.


POS

A combination of token and its POS information is used for comparison with another token.Such a match ensures that the right type of words are being compared as the POS of tokens are also taken into consideration. Appropriate set of synonyms,hypernyms etc are extracted with the help of the POS information for a particular token.Thus the meaning of the generated synonyms differs with the POS information of the token.Hence this metric determines relatedness using the POS values of the compared tokens.

Refactoring wordnet_based_similarity.rb

The methods in this file are long and complex, and contain a lot of duplicated code. Refactoring of the code ensures that code is written in an optimized manner and follows the DRY principle.Since these methods are important,they need to be written transparently and tested thoroughly.

Case 1:

In this case we have removed redundancy by converting a code that was repeating 4 times to a single function call.

Before Refactoring

Each of the arrays for synonyms, hypernyms, hyponyms and antonyms were declared and then the code checked for each of these conditions and resulted in a changed array that was declared initially.

4func_calls

The 4 functions are given below, other than the change in the parameter, they do the same functionality.

1) Synonyms

synonyms_func

2) Hypernyms

hypernyms_func


3) Hyponyms

hyponyms_func

4) Antonyms

antonyms_func

After Refactoring

We implemented the below code which resulted in eliminating the redundant occurrences of the same code over and again.

get_all_nyms_func

In the above function ‘get_all_nyms’ we use each of the arrays(syn_arr, hyper_arr, hypo_arr, anto_arr) locally within the function, checking within the function with respect to a particular parameter and returning the value to the respected array(1 of the 4).

Note that the call to each of the arrays must be within the ‘for’ loop which is in-turn within an ‘if’ condition. The function calls are as follows:

nyms_calls

The ‘opr’ variable in the ‘get_all_nyms’ method holds values “&”,”@”,”~”,”!”, depending on which operation is getting executed. This refactor has helped us bring 64 lines of code down to 15 lines.

Case 2:

In this case we were able to prevent the same code from being written twice and have written it in form of a function.

Before Refactoring

Previously, the two occurrences of the same code were in the below manner:

redundant1

and

redundant2

After Refactoring

After refactoring this code we defined a function as follows:

check_sub_rev_func

And the function call is as follows(the same function call is even at line 138 of the file):

check_sub_rev_call

This refactor reduced the code duplication.

Case 3:

Unlike the previous two scenarios, in scenario-3 we are giving an instance when although the code looks quite redundant, refactoring it might result in a less efficient code than it is in its present state. The code as present looks as follows:

stokRev

and

stokSub

Although the code looks quite similar, refactoring them into a single method is not advisable since the first set of statements is under a for loop (loop variable ‘i’) and the second set of statements are under a nested for loop(loop variable ‘j’) under the earlier for loop. Each of the pieces of code further use the iterator within them. Also, the variable within these pieces of code are further accessed for other functionality before that particular iteration has been completed. Therefore, even if we get a refactored code which converts these to a single function, the function will have to be called as many times as the iteration which results in inefficiency.