CSC/ECE 517 Fall 2013/oss E815 saa: Difference between revisions
No edit summary |
No edit summary |
||
Line 25: | Line 25: | ||
* If v and w contain distinct tokens or phrases .This distinct match is given a weight of 0. | * If v and w contain distinct tokens or phrases .This distinct match is given a weight of 0. | ||
==Exp== | |||
A combination of token and its POS information is used for comparison with another token.Such a match ensures that the right type of words are being compared as the POS of tokens are also taken into consideration. Appropriate set of synonyms,hypernyms etc are extracted with the help of the POS information for a particular token. |
Revision as of 21:28, 30 October 2013
Refactoring and Testing of wordnet_based_similarity.rb
Introduction to WordNet
Expertiza is an application that aims at 'reusable learning objects through peer review'. It allows students to submit their work and review other's work. Our OSS project aims at refactoring and testing of the wordnet_based_similarity.rb. The file determines how similar concepts are, based on the WordNet database and is used in NLP analysis of reviews. WordNet is widely used resource for measuring similarity. It is a network of nouns,verbs,adjectives and adverbs which are grouped into synsets(synonymous words), and linked by lexical relations. WordNet is faster to query and involves no additional pre-processing despite having limitations in terms of domains it covers and lack of real world knowledge when compared to Wikipedia.It allows comparisons across different word forms.
WordNet relations-based semantic metric
In order to identify similarity, a relations-based metric is used. Relatedness between two terms v and w, known as match(v, w) is one of those listed below. Each of these different types of matches is given a weight value based on the importance of the match that is found. Hence the matches are assigned values in the range of 0–6. A value of 6 is assigned when the best match (an exact match) occurs and a value of 0 is assigned when a distinct or non-match occurs. The following are the possible matches that can be found :
- If v and w are exactly the same. This match is given a weight value of 6.
- If v and w are synonymous. This match is given a weight of 5.
- If v is a hypernym of w (i.e., v is more generic than the token w) or vice versa. Or v is a hyponym of w (i.e., v is a more specific form of w)or vice-versa. This match is given a weight of 4.
- If v is a meronym of w (i.e., v is a sub-part of w) or vice versa. Or v is a holonym of w (i.e., v contains w as a sub-part) or vice-versa. For example, “leg” is a meronym of the token “body” and “body” is the holonym of the term “leg”. This type of match is given a weight of 3.
- If v and w have common parents( excluding generic parents),the normalized distance between the two tokens' sysnets is calculated or identified.(1-distance)gives the similarity between the two tokens. This similarity is then given a weight value between 0-6. If the scale value is greater than 0,a common parents match exists between v and w and is given a weight of 2.
- The match between the definition of a word, whose sense is to be determined and a word is already known is identified by Lesk. If a high degree of overlap is seen, then this indicates that the two tokens share the same sense.Context based similarity across token is determined by using overlapping definitions as a metric. We also find overlaps across examples of the words.IF two tokens have overlapping definitions or examples, then the match gets a weight value of 1.
- If v and w contain distinct tokens or phrases .This distinct match is given a weight of 0.
Exp
A combination of token and its POS information is used for comparison with another token.Such a match ensures that the right type of words are being compared as the POS of tokens are also taken into consideration. Appropriate set of synonyms,hypernyms etc are extracted with the help of the POS information for a particular token.