Software Plagiarism Detection Using Abstract Syntax Tree and Graph-based Data Mining
Abstract
This study is using a graph-based data mining technique to discover cases of software plagiarism. We hypothesize that repetitive patterns found in the abstract syntax tree (AST) representation of source code will only match such patterns of other source code if the author of both are the same. A graph-based data mining technique was used for analyzing the AST and extracting the patterns. The results from the data miner were compared using a graph matching algorithm, which provided the measure of similarity. We used artificial test sets and actual student assignments for evaluation. The experiments identified plagiarism behaviors in both artificial and real-world data. These findings proved the system to be feasible. This system can be applied to every kind of programming language that use abstract syntax trees for compilation, and these ASTs can easily be extracted using the compiler. An advantage of this system over other plagiarism detectors is that it can deal with partial source code plagiarism behavior, which others do not currently do. Disadvantages of our approach include slow speed because of the graph-based data mining system used, and dependence on compilers to provide the AST. Also, if a source code cannot be compiled, the compiler will not provide a full AST, and the results will be inaccurate.
Collections
- OSU Theses [15752]