CSC/ECE 517 Fall 2009/wiki3 2 clone: Difference between revisions
(→Moss) |
|||
(58 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
== Introduction == | == Introduction == | ||
"Software clones are segments of code that are similar according to some definition of similarity" —Ira Baxter, 2002[1]. As per this definition, two code snippets may be similar based on text, syntactic structure or semantics or if they follow same pattern. Two code fragments are similar if their program text is similar. Code snippets may not be semantically equivalent. Such code snippets are also termed as redundant because if one changes then the other also needs to be changed. However, some clones cannot be replaced by another for example, two code snippets may be identical at the textual level but they refer to different variables declared in different places with the same name. Example of such clones is shown below: | "Software clones are segments of code that are similar according to some definition of similarity" —Ira Baxter, 2002[1]. As per this definition, two code snippets may be similar based on text, [http://en.wikipedia.org/wiki/Syntax syntactic structure] or [http://en.wikipedia.org/wiki/Semantics semantics] or if they follow same pattern. Two code fragments are similar if their program text is similar. [http://en.wikipedia.org/wiki/Snippet_(programming) Code snippets] may not be semantically equivalent. Such code snippets are also termed as redundant because if one changes then the other also needs to be changed. However, some clones cannot be replaced by another for example, two code snippets may be identical at the textual level but they refer to different variables declared in different places with the same name. Example of such clones is shown below: | ||
Example of such clones is shown below: | Example of such clones is shown below: | ||
Line 27: | Line 27: | ||
In the above code snippet, count in methodA refers to global variable whereas in methodB it refers to local count variable. | In the above code snippet, count in methodA refers to global variable whereas in methodB it refers to local count variable. | ||
== | = Clone Detection Tools = | ||
=== <b>Duploc</b> === | |||
This language independent tool helps in clone detection. It offers clickable matrix display that helps developers to locate the source code snippet that is cloned. This tool includes an [http://www.cc.gatech.edu/gvu/softviz/infoviz/information_mural.html information mural algorithm], and it helps the tool to show a matrix of 100,000 lines per side in its entirety on a 600x800 screen[2]. The performance this tool is good for systems of sizes below 1 MLOC. And for [http://en.wikipedia.org/wiki/Software_maintenance maintenance projects] those seldom change over time, duplication of code can be identified over night and can be interpreted on next day. | |||
=== Dup === | |||
This tool identifies cloned code in large softwares. It was developed with an intention that as software progresses in later stages of development mainly in maintenance phase, where new enhancements are added to the software or bugs are fixed, developers intend to use copy & paste existing code[4]. This makes software unmaintainable. The tool allows user to find exact clones or clones with exceptions of some set of variable names and constants. It helps in identifying code snippets that should be written as procedures. It also helps in Parameterized Matching. Consider below code snippet from [http://en.wikipedia.org/wiki/B-tree B-Tree] program implementation: | |||
1: fstream fp; | 1: fstream fp; | ||
Line 57: | Line 58: | ||
Lines 4-6 and 9-11 will be considered in parameterized matching, where rootOffset = 0, fstream::beg = fstream::end, test_node = pNode and Dup will produce that match in the report. This will be considered under parameterized matching because three lines are identical except for those three variables. | Lines 4-6 and 9-11 will be considered in parameterized matching, where rootOffset = 0, fstream::beg = fstream::end, test_node = pNode and Dup will produce that match in the report. This will be considered under parameterized matching because three lines are identical except for those three variables. | ||
<br> | === Moss === | ||
<br> | |||
<br> | Moss stands for "Measure of Software Similarity" and it is mainly used in academic institution to detect [http://en.wikipedia.org/wiki/Plagiarism_detection software plagiarism] in the programs submitted as part of homeworks. It currently can be used for following languages: [http://en.wikipedia.org/wiki/C_(programming_language) C], [http://en.wikipedia.org/wiki/Reference_(C%2B%2B) C++], [http://en.wikipedia.org/wiki/Java_(programming_language) Java], [http://en.wikipedia.org/wiki/C_Sharp_(programming_language) C#], [http://en.wikipedia.org/wiki/Pascal_(programming_language) Pascal], [http://en.wikipedia.org/wiki/MATLAB Matlab], [http://en.wikipedia.org/wiki/JavaScript_syntax Javascript] etc.[8] | ||
<br> | |||
Moss currently uses "winnowing" algorithm to detect plagiarism. Winnowing algorithm[9] is a two step algorihtm. In order to explain this algorithm, it is important to understand the term "document fingerprint ". Consider below example: | |||
{|class="wikitable" style="text-align:center; width:400px; height:200px" border="1" | |||
|- | |||
! Example: Fingerprinting some sample text. | |||
|- | |||
|A small can can, a small can can <br/><i>(i) sample text</i> | |||
|- | |||
|asmallcancanasmallcancan <br/><i>(ii) comma and space removed from sample text</i> | |||
|- | |||
|asmal small mallc allca llcanc lcanca canca <br/> ancan ncana canas anasm nasma <br/>asmal small mallc allca llcan lcanc canca ancan <br/> <i>(iii) The sequence of 5-grams derived from the text </i> | |||
|- | |||
|23 45 67 88 11 22 33 12 14 15 11 19 10 17 56 78 56 89 09 17 <br/> <i>(iv) A hypothetical hashed sequences of the 5-grams </i> | |||
|- | |||
|45 22 15 11 19 <br/> <i>(v) The sequence of hashes selected randomly</i> | |||
|} | |||
k-gram is a contiguous substring of length k. Document is divided into k-grams(as shown in iii), k can be specified by user. Each gram is hashed using some [http://en.wikipedia.org/wiki/Hash_function hashing function]. Practically some subset of hashes are selected from each document (as in v). Fingerprint also contains positional information, like document name and line numbers for word in the document. | |||
Now in the first step of winnowing algorithm, an index mapping fingerprints are built to locate all the documents. In the second step, some subset of selected fingerprints are searched in the index. | |||
At the end of second step we get list of matched fingerprints for a document d1. This matched fingerprints for d1 may contain fingerprints from various other documents d2, d3, ... Then pairs (d1,d2 ), (d1,d3) corresponding to matched fingerprints in the documents are formed with the number of matches and is sorted by number of matches in descending order. The one with highest number of matches is reported to the user. | |||
= Clone detection and manipulation tools = | |||
Software maintenance is by far the costliest stage in the lifecycle of a software. Clones increase the size of the source code as well as duplicate errors in the two [http://en.wikipedia.org/wiki/Codebase codebases]. This forces us to use tools that can not only detect the clones but also edit code to get rid of the problem of maintenance. | |||
Before editing code to remove clones, a number of factors have to be considered like the percentage of cloning when compared to the entire project, Percentage change after removal of the clone, it's affect on [http://en.wikipedia.org/wiki/Readability readability] and more. If found favourable, only then must editing be done to removal or alter clones. | |||
=== CloneDR === | |||
Clone doctor is a tool that both detects and removes clones from [http://en.wikipedia.org/wiki/Application_software application software]. CloneDR detects clones using [http://library.readscheme.org/page8.html compiler technology] and not through simple string matching. It includes settings to discard white spaces, line breaks, modified variable names and case changes. | |||
CodeDR also has the ability to filter and automatically remove clones. This is in addition to the interactive clone removal editing it already provides. CloneDR intelligently replaces clones with [http://en.wikipedia.org/wiki/Subroutine subroutine] calls, [http://en.wikipedia.org/wiki/Directive_(programming) directives] or [http://en.wikipedia.org/wiki/Declaration_(computer_science) declarations]. | |||
[[Image:CloneDR.GIF|thumb|center|550px]] | |||
As shown in the figure, duplicate statements are factored into a #define macro. This macro is referred to whenever the sorting needs to be done | |||
= Comparison With Refactoring = | |||
Clone detection, as mentioned before, is a technique for searching duplicate patterns of code. Refactoring on the other hand, is concerned with modifying internal structure while keeping the external behavior the same. Currently, these two tools are loosely coupled. A clone detection tool usually strives to detect duplicate patterns while a refactoring tool behaves as a separate entity / tool that does refactoring only and no detection. | |||
The two tools in a way complement each other. Refining code using clone detection and refactoring can be a two step process. In the first stage, clones are identified based on patterns. The detected patterns can then be handed over to a [http://www.aivosto.com/vbtips/refactoring.html refactoring tool] that can, based on various parameters, decide whether to keep a single copy or modify it to retain better readability. | |||
= Conclusion = | |||
Clone detection and clone manipulation tools are tools that need high precision. Reducing false positives and false negatives during clone detection is an important factor to be considered. CLone detection is not a simple straight-forward string matching approach. The detection algorithms can run at different levels and can match different clone patterns. Clone manipulation can be both automatic and interactive. There is a lot of research on building clone detection algorithms and patterns and identifying ways to remove clones. | |||
=See Also= | |||
1. [http://www.ccfinder.net/doc/10.2/en/whats.html CCFinderX]<br/> | |||
2. [http://www.usenix.org/publications/library/proceedings/sf94/manber.1.html sif]<br/> | |||
3. [http://portal.acm.org/citation.cfm?id=1094903 SDD]<br/> | |||
4. [http://ews.uiuc.edu/~cchen37/papers/kdd06_gplag.pdf GPLAG] <br/> | |||
5. [http://clonedigger.sourceforge.net/ CloneDigger]<br/> | |||
6. [http://blue-edge.bg/simscan/simscan_help_r1.htm SimScan]<br/> | |||
7. [http://www.codeplex.com/CloneDetectiveVS Clone Detective]<br/> | |||
8. [https://www.ipd.uni-karlsruhe.de/jplag/ JPlag]<br/> | |||
=References= | |||
1. http://drops.dagstuhl.de/opus/volltexte/2007/962/pdf/06301.KoschkeRainer.962.pdf <br/> | 1. http://drops.dagstuhl.de/opus/volltexte/2007/962/pdf/06301.KoschkeRainer.962.pdf <br/> | ||
2. http://scg.unibe.ch/archive/papers/Duca99bCodeDuplication.pdf<br/> | 2. http://scg.unibe.ch/archive/papers/Duca99bCodeDuplication.pdf<br/> | ||
3. Clone Detection and Refactoring by Robert Tairas | 3. Clone Detection and Refactoring by Robert Tairas <br/> | ||
4. http://eprints.kfupm.edu.sa/20225/1/20225.pdf | 4. http://eprints.kfupm.edu.sa/20225/1/20225.pdf <br/> | ||
5. http://www.semdesigns.com/Products/Clone/ <br/> | |||
6. http://sel.ist.osaka-u.ac.jp/cdtools/index-e.html <br/> | |||
7. http://en.wikipedia.org/wiki/Code_refactoring <br/> | |||
8. http://theory.stanford.edu/~aiken/moss/ <br/> | |||
9. http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf <br/> |
Latest revision as of 03:53, 19 November 2009
Clone Detection and Clone Manipulation
Introduction
"Software clones are segments of code that are similar according to some definition of similarity" —Ira Baxter, 2002[1]. As per this definition, two code snippets may be similar based on text, syntactic structure or semantics or if they follow same pattern. Two code fragments are similar if their program text is similar. Code snippets may not be semantically equivalent. Such code snippets are also termed as redundant because if one changes then the other also needs to be changed. However, some clones cannot be replaced by another for example, two code snippets may be identical at the textual level but they refer to different variables declared in different places with the same name. Example of such clones is shown below:
Example of such clones is shown below:
int count = 0; int methodA(string str) { for (int i=0;i<str.length;i++) { count = count + 1; } } int methodB(string str) { int count=0; for (int i=0;i<str.length;i++) { count = count + 1; } }
In the above code snippet, count in methodA refers to global variable whereas in methodB it refers to local count variable.
Clone Detection Tools
Duploc
This language independent tool helps in clone detection. It offers clickable matrix display that helps developers to locate the source code snippet that is cloned. This tool includes an information mural algorithm, and it helps the tool to show a matrix of 100,000 lines per side in its entirety on a 600x800 screen[2]. The performance this tool is good for systems of sizes below 1 MLOC. And for maintenance projects those seldom change over time, duplication of code can be identified over night and can be interpreted on next day.
Dup
This tool identifies cloned code in large softwares. It was developed with an intention that as software progresses in later stages of development mainly in maintenance phase, where new enhancements are added to the software or bugs are fixed, developers intend to use copy & paste existing code[4]. This makes software unmaintainable. The tool allows user to find exact clones or clones with exceptions of some set of variable names and constants. It helps in identifying code snippets that should be written as procedures. It also helps in Parameterized Matching. Consider below code snippet from B-Tree program implementation:
1: fstream fp; 2: long rightOffset; 3: fp.open(s,fstream::in | fstream::out | fstream::binary); 4: fp.seekp(rootOffset,fstream::beg); 5: fp.write((char*)&test_node,sizeof(test_node)); 6: rightOffset = fp.tellp(); 7: fp.write((char*)&rNode,sizeof(rNode)); 8: btree_node pNode = CreateParentNode(temp[BTREE_ORDER/2],rootOffset,rightOffset); 9: fp.seekp(0,fstream::end); 10: fp.write((char*)&pNode,sizeof(pNode)); 11: rootOffset = fp.tellp(); 12: fp.clear(); 13: fp.close();
Lines 4-6 and 9-11 will be considered in parameterized matching, where rootOffset = 0, fstream::beg = fstream::end, test_node = pNode and Dup will produce that match in the report. This will be considered under parameterized matching because three lines are identical except for those three variables.
Moss
Moss stands for "Measure of Software Similarity" and it is mainly used in academic institution to detect software plagiarism in the programs submitted as part of homeworks. It currently can be used for following languages: C, C++, Java, C#, Pascal, Matlab, Javascript etc.[8]
Moss currently uses "winnowing" algorithm to detect plagiarism. Winnowing algorithm[9] is a two step algorihtm. In order to explain this algorithm, it is important to understand the term "document fingerprint ". Consider below example:
Example: Fingerprinting some sample text. |
---|
A small can can, a small can can (i) sample text |
asmallcancanasmallcancan (ii) comma and space removed from sample text |
asmal small mallc allca llcanc lcanca canca ancan ncana canas anasm nasma asmal small mallc allca llcan lcanc canca ancan (iii) The sequence of 5-grams derived from the text |
23 45 67 88 11 22 33 12 14 15 11 19 10 17 56 78 56 89 09 17 (iv) A hypothetical hashed sequences of the 5-grams |
45 22 15 11 19 (v) The sequence of hashes selected randomly |
k-gram is a contiguous substring of length k. Document is divided into k-grams(as shown in iii), k can be specified by user. Each gram is hashed using some hashing function. Practically some subset of hashes are selected from each document (as in v). Fingerprint also contains positional information, like document name and line numbers for word in the document.
Now in the first step of winnowing algorithm, an index mapping fingerprints are built to locate all the documents. In the second step, some subset of selected fingerprints are searched in the index.
At the end of second step we get list of matched fingerprints for a document d1. This matched fingerprints for d1 may contain fingerprints from various other documents d2, d3, ... Then pairs (d1,d2 ), (d1,d3) corresponding to matched fingerprints in the documents are formed with the number of matches and is sorted by number of matches in descending order. The one with highest number of matches is reported to the user.
Clone detection and manipulation tools
Software maintenance is by far the costliest stage in the lifecycle of a software. Clones increase the size of the source code as well as duplicate errors in the two codebases. This forces us to use tools that can not only detect the clones but also edit code to get rid of the problem of maintenance.
Before editing code to remove clones, a number of factors have to be considered like the percentage of cloning when compared to the entire project, Percentage change after removal of the clone, it's affect on readability and more. If found favourable, only then must editing be done to removal or alter clones.
CloneDR
Clone doctor is a tool that both detects and removes clones from application software. CloneDR detects clones using compiler technology and not through simple string matching. It includes settings to discard white spaces, line breaks, modified variable names and case changes.
CodeDR also has the ability to filter and automatically remove clones. This is in addition to the interactive clone removal editing it already provides. CloneDR intelligently replaces clones with subroutine calls, directives or declarations.
As shown in the figure, duplicate statements are factored into a #define macro. This macro is referred to whenever the sorting needs to be done
Comparison With Refactoring
Clone detection, as mentioned before, is a technique for searching duplicate patterns of code. Refactoring on the other hand, is concerned with modifying internal structure while keeping the external behavior the same. Currently, these two tools are loosely coupled. A clone detection tool usually strives to detect duplicate patterns while a refactoring tool behaves as a separate entity / tool that does refactoring only and no detection.
The two tools in a way complement each other. Refining code using clone detection and refactoring can be a two step process. In the first stage, clones are identified based on patterns. The detected patterns can then be handed over to a refactoring tool that can, based on various parameters, decide whether to keep a single copy or modify it to retain better readability.
Conclusion
Clone detection and clone manipulation tools are tools that need high precision. Reducing false positives and false negatives during clone detection is an important factor to be considered. CLone detection is not a simple straight-forward string matching approach. The detection algorithms can run at different levels and can match different clone patterns. Clone manipulation can be both automatic and interactive. There is a lot of research on building clone detection algorithms and patterns and identifying ways to remove clones.
See Also
1. CCFinderX
2. sif
3. SDD
4. GPLAG
5. CloneDigger
6. SimScan
7. Clone Detective
8. JPlag
References
1. http://drops.dagstuhl.de/opus/volltexte/2007/962/pdf/06301.KoschkeRainer.962.pdf
2. http://scg.unibe.ch/archive/papers/Duca99bCodeDuplication.pdf
3. Clone Detection and Refactoring by Robert Tairas
4. http://eprints.kfupm.edu.sa/20225/1/20225.pdf
5. http://www.semdesigns.com/Products/Clone/
6. http://sel.ist.osaka-u.ac.jp/cdtools/index-e.html
7. http://en.wikipedia.org/wiki/Code_refactoring
8. http://theory.stanford.edu/~aiken/moss/
9. http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf