CSC/ECE 517 Fall 2009/wiki3 2 pp: Difference between revisions
Line 27: | Line 27: | ||
==<span class="apple-style-span"><span style="mso-bidi-font-size: 12.0pt; line-height: 115%"><font color="windowtext">JPlag</font></span></span>== | ==<span class="apple-style-span"><span style="mso-bidi-font-size: 12.0pt; line-height: 115%"><font color="windowtext">JPlag</font></span></span>== | ||
JPlag [Pre00] uses tokenised substring matching to determine similarity in source code. Its specific purpose is to detect plagiarism within academic institutions. Firstly the source code is translated into tokens (this requires a language dependent process). JPlag aims to tokenise in such way that the "essence" of a program is captured and so can be effective for catching copied functionality. Once converted the tokenised strings are compared to detect the percentage of matching tokens which is used as a similarity value. JPlag is an online service freely available to academia. | |||
==<span class="apple-style-span"><span style="mso-bidi-font-size: 12.0pt; line-height: 115%"><font color="windowtext">MOSS</font></span></span>== | ==<span class="apple-style-span"><span style="mso-bidi-font-size: 12.0pt; line-height: 115%"><font color="windowtext">MOSS</font></span></span>== |
Revision as of 22:09, 15 November 2009
Clone detection and clone manipulation
The DRY principle says that a particular code fragment should not be repeated more than once in a program. But it happens. And when it does, it is good to be able to find the multiple code "clones" because Software cloning complicates the maintenance process by giving the maintainers unnecessary code to examine. As per Burd, it seems that when presented with the challenge of adding new functionality the natural instinct of a programmer is to copy, paste and modify the existing code to meet the new requirements and thus creating a software clone. While the basis behind such an approach is uncertain, one possible reason is due to time restrictions on maintainers to complete the maintenance change. Duccase points out that “making a code fragment is simpler and faster than writing from scratch” and that if a programmer’s pay is related to the amount of code they produce then the proliferation of software clones will continue.
Once a clone is created it is effectively lost within the source code and so both clones must therefore be maintained as separate units despite their similarities. Komondoor states that if errors are identified within one clone then it is likely that modifications may be necessary to the other counter-part clones. Detection is therefore required if any of the clones are to be re-identified to assist the maintenance process. So if clones can be detected then the similarities can be exploited and replaced during preventative maintenance with a new single code unit this will eliminate the problems identified above.
There are a good number of clone detection tools available both commercially and within academia. Within these tools several different approaches to software clone detection have been implemented, including string analysis, program slicing, metric analysis and abstract tree comparisons. This page will survey the a set of clone detection tools and compare them.
Clone Detection Technique
This article will be primarily focusing on below five established detection tools; JPlag, MOSS, Covet, CCFinder and CloneDr. JPlag and MOSS are web-based academic tools for detecting plagiarism in student's source code. CloneDr and CCFinder are stand alone tools looking at code duplication in general.
Figure 1 summarizes the clone detection tools. The languages supported by the analysis process are highlighted, as is the analysis approach. The column labeled domain highlights the main purpose of the tools for either clone detection or for plagiarism detection.
CCFinder
CCFinder focuses on analyzing large-scale systems with a limited amount of language dependence. It transforms the source code into tokens. CCFinder aims to identify "portions of interest (but syntactically not exactly identical structures)". After the string is tokenised a token-by-token matching algorithms is performed. CCFinder also provides a dotplotting visualisation tool that allows visual recognition of matches within large amounts of code.
CloneDr
CloneDr analyses software at the syntactic level to produce abstract syntax tree (AST) representations. A series of algorithms are then applied to the tree to detect clones. The first algorithm searches for sub-tree matches within the ASTs. Then a “sequence detection” algorithm attempts to detect “variable size sequences of sub-tree clones”. A third algorithm uses combinations of previously detected clones and looks for “more complex near-miss clones”. The final clone set includes the clones detected in the second and third algorithms. CloneDr can automatically replace cloned code by producing a functionally equivalent subroutine or macro.
Covet
Covet uses a number of the metrics as defined by Mayrand. These metrics were selected by taking known clones and identifying which of the Datrix metrics best highlighted the known clone set. Covet does not apply the same scale of clone likelihood classification as Mayrand. Rather within Covet this is simplified; there is no scale of clone, functions are either classed as clones or distinct. The tool is still in the prototype stages and is not capable of processing industrial sized programs.
JPlag
JPlag [Pre00] uses tokenised substring matching to determine similarity in source code. Its specific purpose is to detect plagiarism within academic institutions. Firstly the source code is translated into tokens (this requires a language dependent process). JPlag aims to tokenise in such way that the "essence" of a program is captured and so can be effective for catching copied functionality. Once converted the tokenised strings are compared to detect the percentage of matching tokens which is used as a similarity value. JPlag is an online service freely available to academia.
MOSS
References
[1] Evaluating clone detection tools for use during preventative maintenance
[2] Investigating the maintenance implications of the replication of code
[3] A Language Independent Approach for Detecting Duplicated Code
[4] Using Slicing to Identify Duplication in Source Code
[5] Experiment on the automatic detection of function clones in a software system using metrics
[6] Clone detection using abstract syntax trees
[7] Maintenance Support Tools for JAVA Programs: CCFinder and JAAT