A robust system for detecting code similarities and potential plagiarism in programming assignments.
This plagiarism detection system identifies similar code submissions using a multi-stage pipeline:
- Parsing & Tokenization: Transforms code files into normalized token sequences, filtering common elements.
- Similarity Detection: Uses the Rabin-Karp algorithm to find matching token sequences between submissions.
- Graph Construction: Creates a weighted graph where nodes are submissions and edges represent similarity scores.
- Clustering: Applies BFS/DFS algorithms to group similar submissions into clusters.
- Representative Selection: Uses a greedy algorithm with merge sort to select representative submissions from each cluster.
- Metadata Management: Stores and retrieves submission metadata efficiently using a B+ Tree structure.
- Real-time Updates: Processes new submissions and updates clusters incrementally.
PlagiarismDetector/
│
├── src/ # Source code
│ ├── __init__.py # Makes the directory a package
│ ├── bplus_tree.py # B+ Tree implementation for metadata storage
│ ├── clustering.py # BFS/DFS implementation for finding clusters
│ ├── greedy.py # Greedy algorithms for representative selection
│ ├── main.py # Main orchestration of the detection process
│ ├── parser.py # Code parsing and tokenization
│ ├── rabin_karp.py # Rabin-Karp string matching algorithm
│ └── similarity.py # Similarity score calculation and graph building
│
├── tests/ # Test cases
│ ├── __init__.py # Makes the directory a package
│ └── test_cases.py # Test scripts
│
├── inputs/ # Directory for input files
│ ├── file_a.py to file_y.py # Sample submissions
│ ├── file_1.py to file_50.py # Scalability test submissions
│ ├── file_z.py # Real-time update test file
│ └── metadata.json # Metadata for all submissions
│
└── outputs/ # Directory for output files
├── test_basic.txt # Results of basic functionality test
├── test_constraint.txt # Results of constraint & edge case test
├── test_integration.txt # Results of algorithm integration test
├── test_realtime.txt # Results of real-time update test
└── results.txt # Default output file
- Python 3.7 or higher
-
Basic usage:
python -m src.main inputs 0.7
This runs the detector on all files in the
inputs
directory with a similarity threshold of 0.7. -
Run test cases:
python -m tests.test_cases
This runs all test cases and generates output files in the
outputs
directory. -
Command-line arguments:
- First argument: Input directory (default: "inputs")
- Second argument: Similarity threshold (default: 0.5)
The detector outputs a text file with the following sections:
-
Cluster information:
- List of files in each cluster
- Representative file(s) for each cluster
- Similarity matrix showing similarity scores between files
- Metadata for representative files
-
Summary statistics:
- Total submissions processed
- Number of clusters found
- Number of suspicious clusters (>1 submission)
- Number of isolated submissions
- Largest cluster size
- Parameters used (threshold, k-gram size, clustering method)
- Execution time
-
Basic Functionality Test: Tests core detection with simple files (A, B, C, D).
- Expected outcome: Files A, B, C form a cluster with high similarity, D is separate.
-
Constraint & Edge Case Test: Tests handling of edge cases:
- Files E, F, G: Tests handling of comments, whitespace, and reordered code.
- File H: JavaScript file (.js extension) - should be processed but remain isolated.
- File I: Empty file - should be processed but remain isolated.
- 50+ files: Tests B+ Tree efficiency with a large number of submissions.
- Expected outcome: E, F clustered; G may cluster with E/F; H, I are separate and well-handled.
-
Algorithm Integration Test: Tests complete system with larger dataset:
- 75+ files including multiple similarity groups.
- Expected outcome: J, K, L cluster; M, N cluster; W, X, Y cluster; most short files separate.
-
Real-Time Update Test: Tests incremental processing of new submissions.
- Expected outcome: New file Z clusters with similar files A, B, C.
-
Threshold Sensitivity Test: Tests system with multiple similarity thresholds.
- Expected outcome: Different thresholds produce different clustering results, showing sensitivity.
The system handles several important edge cases:
- Different File Types: Processes both Python (.py) and JavaScript (.js) files, treating them as distinct.
- Empty Files: Properly processes empty files, marking them with a special token.
- Short Files: Uses token-based similarity for files too short for k-gram comparison.
- Code Comments: Strips comments to focus on code structure and logic.
- Variable Renaming: Normalizes identifiers to detect similarity despite variable name changes.
- Whitespace Changes: Normalizes whitespace to detect similarity despite formatting changes.
- Rabin-Karp: Uses rolling hash function to efficiently find matching token sequences.
- Graph Algorithms: BFS/DFS find connected components in the similarity graph.
- Greedy Selection: Chooses representatives based on highest average similarity within clusters.
- B+ Tree: Enables efficient storage and retrieval of metadata.
- Merge Sort: Used to rank submissions by similarity scores.
The system efficiently processes 75+ submissions in seconds, with customizable similarity thresholds to balance precision and recall when detecting potential plagiarism.
- Multi-language Support: Handles both Python and JavaScript files with language-specific tokenization.
- Adaptive Similarity: Adjusts similarity calculation based on file type and content.
- Real-time Updates: Processes new submissions incrementally without recomputing everything.
- Detailed Statistics: Provides comprehensive analysis of clustering results.
- Threshold Testing: Supports testing with multiple thresholds to find optimal settings.