`mosster`: Git-friendly Plagiarism Detection

mosster is a re-implementation of existing plagiriasm detection tools like

Why do this? For the full answer, see below. But briefly: we needed something that would integrate with Git and check multiple submissions with large amounts of self-similarity. We also were fairly sure that implementing mosster in Go would open up some interesting optimization opportunities and performance improvements.

Terminology

Note that mosster normalizes, fingerprints, compares, and filters entire commits but does so function-by-function. But it can be convenient in examples to talk about these operations on a per-file basis.

foo.c is normalized: foo.c is converted to a compact representation that preserves identifying structural features but removes names and other non-identifying features. This conversion is language-specific.
foo.c is fingerprinted: foo.c is converted to a set of fingerprints that are used to compare it with other files.
foo.c is filtered by bar.c: Fingerprints from bar.c are removed from fingerprint set for bar.c.
foo.c is compared with bar.c: The fingerprints for foo.c and bar.c are used to determine whether the files share incriminating features. This is done to make the fingerprint for foo.c more interesting or unique, possibly by removing common code that was provided to all students.

Features

mosster must support the following features:

Git integration

mosster integrates with Git to identify objects to check. mosster checks Git commits. For each commit, it identifies all of the blobs that are included in the commit. mosster then fingerprints all of the blobs that have not yet been fingerprinted, and combines all fingerprints for commit blobs into a fingerprint for that commit.

Iterative checking

Many existing plagiarism detection tools check all submissions in a single pass. This is both inefficient and unnecessary when iteratively checking new submissions. mosster should allow new submissions to be checked without repeatedly performing computations on existing files that have not changed.

mosster maintains a library of commits and blobs that each commit points to. Each time a commit is added to the library, mosster both fingerprints it and checks it against all other submissions. Even commits that do not need to be checked for plagiarism can be the other side of a match against a commit that does need to be checked.

Aggressive filtering

Because we are checking all commits, mosster anticipates that a great deal of any particular commit will be duplicate content from previous commits. Duplicated content can take many forms:

Entire files: commits may point to blobs that have already been checked. Note that, happily, renaming files doesn’t change the Git file blob itself. So simple renames that Git can identify don’t generate more content to check. This filtering should handle these cases:
- Modifications to one file that leave many others unchanged. Only the modified file is checked.
- Commits that move files but do not change their contents. Files are only checked if the contents change.
Entire functions: even if a file has changed, many of the functions inside the file may not have changed. This is particularly true when students commit often but commits contain small numbers of changes. This feature of Git usage is one of the motivation to do function-level hashing and fingerprinting. In all of the following cases, only modified functions are checked:
- Modifications isolated to part of a single file, like one function.
- Modifications that move or rename functions but do not change function contents.
- Purely cosmetic changes to files or functions, including modifying comments, altering whitespace, and changing variable or function names.
Matches against provided code: students may have been provided code by the course staff as a starting point. Both whole files and functions will be checked against staff-provided code in addition to each students previous submissions. Any matches with staff-provided code can be ignored.

Design

Overview

Normalization: all content from a commit is normalized to remove features that should not be included in the fingerprint. These could include whitespace, certain punctuation, function and variable names, or brace placement. One way of doing this is generating and normalizing an abstract syntax tree (AST) and then serializing it in a compact standardized binary format.
Fingerprinting: normalized content is then fingerprinted to extract a set of meaningful features. One way of doing this is by hashing n-grams from the normalized output.
Filtering: the set of fingerprints can then be filtered to remove fingerprints from other commits. This makes the fingerprint set for that commit more unique and interesting, and smaller which makes later comparisons faster. Filtering can be done using staff-provided code, previous commits by the same user, or very common fingerprints that probably result from over-constrained problems.
Comparison: filtered fingerprint sets are then used to compare multiple commits. Commits are similar if they share many distinctive fingerprints. This argues for weighting individual fingerprints based on their overall commonality, while still having an additive metric.

Library

mosster adds submissions to and checks them against a library. A library should contain all submissions that are being checked against each other for plagiarism. Staff-provided code or other code that students are allowed to use can also be added to the library to perform pre-checking filtering.

Many of the mosster configuration parameters are also configured on a per-library basis. All library content is preprocessed and fingerprinted using the same parameters. Additional content can be added iteratively to an existing library, but library configuration parameters that affect fingerprinting cannot be changed once the library is created. Reconfiguring the fingerprinting process is equivalent to creating an entire new library containing all of the files from the existing library.

The mosster library is read-mostly, so an embedded key-value store like Bolt is probably sufficient for library persistence.

Inputs

mosster should be run from within an existing Git repository. It takes one or more commits to check. mosster requires the following information for each commit:

The library to use. This is probably a path to an embedded database file. All commits are added to this library and all comparisons are done within this library.
One or more users that own the commit. Users should be identifier by valid email addresses.
One or more source code languages for each file in each commit. These are used to determine the language parser to use. (Eventually we could incorporate something like Linguist to automate this. Although it looks like it needs training.)

Identifying commits

mosster relies on inputs to identify commits. This allows the user to label each commit in a way sensitive to their own knowledge of how student groups have formed and dissolved.

mosster has no way to identify who authored repository content. So all users listed when a commit is added assume ownership and access to all of the files identified by a particular commit. This includes files that students may have created before they began working together, since mosster assumes that partners or larger groups share complete access to each others code.

Here is an example of a group coming together, then disolving, from the perspective of mosster.

Alice and Bob begin working separately:

Alice	Bob
Adds and commits `foo.c`	Adds and commits `bar.c`
`$ mosster add . --owners=Alice`	`$ mosster add . --owners=Bob`

At this point Alice owns foo.c and Bob owns bar.c:

foo.c:
  - Alice
bar.c:
  - Bob

Both files are fingerprinted and compared to each other.

Alice	Bob
Adds `bar.c` (from Bob) and `new.c` (developed jointly)	Shares `bar.c`
Commits `foo.c`, `bar.c`, and `new.c`
`$ mosster add . --owners=Alice,Bob`

At this point Alice and Bob share ownership of foo.c, bar.c, and new.c.

foo.c:
  - Alice
  - Bob
bar.c:
  - Bob
  - Alice
new.c:
  - Alice
  - Bob

new.c is fingerprinted and added to the library, but not compared against foo.c or bar.c due to overlapping ownership. Ownership for foo.c and bar.c is adjusted, but the files do not need to be fingerprinted again.

Note that Bob has assumed ownership and access to foo.c despite the fact that it was originally committed and created by Alice. mosster assumes he has had access to that file and could have saved a copy.

Alice and Bob stop working together

Alice	Bob
Adds `goo.c`	Adds `baz.c`
Commits `goo.c` (new) and `bar.c` (from Bob)	Commits `baz.c` (new) and `foo.c` (from Alice)
`$ mosster add . --owners=Alice`	`$ mosster add . --owners=Bob`

At this point Alice and Bob share ownership of foo.c, bar.c, and new.c, but retain individual ownership of their new files:

foo.c:
  - Alice
  - Bob
bar.c:
  - Bob
  - Alice
new.c:
  - Alice
  - Bob
goo.c:
  - Alice
baz.c
  - Bob

goo.c and baz.c are fingerprinted, added to the library, and compared with each other. Note that they are not compared against foo.c, bar.c, and new.c due to overlapping ownership.

Library Format

Examples below are shown in YAML format.

Commit to user mapping

c408d43a9619778f6d23b9b4d0e4572b3b021c440bab249315c39709c75c412e:
  - me@me.com
63d621e805b0a7a8466f8b62b0ad60b60511f66a6a38ec1d9fae3b7969217e24:
  - me@me.com
  - you@you.com

Written: as commits are added to the library, user entries are created or added to this table.
Used:
- New commits will not be checked if they are included in staff commits for a particular repository.
- Commit-level similarity is one aspect of how users are compared.

Given that Git includes a timestamp in the commit ID, even commits of the identical content by two different users in different repositories will not produce identical commits. However, identical commits can occur when a repository is forked from another repository. Students may be dumb enough to fork their repository from another student or group, in which case plagiarism is really obvious. But this also allows us to avoid checking commits that came from a staff repository, or identical commits from group repositories that have diverged slightly.

Why `mosster`?

Given that tools like MOSS, Algae, and SIM exist, what is mosster for? mosster tries to address some of the problems or limitations of existing tools that we encountered where checking ops-class.org assignments. Specifically:

After collecting several years of large OS/161 assignments, we were uploading enough code to MOSS that it was crashing before it completed checking our submission. Problems with MOSS were not resolved in a timely manner—at least not timely enough for someone with end-of-semester grading deadlines to worry about. Despite several requests, Professor Aiken was never willing to provide us with a site license. (Although I know that other universities run MOSS locally.)
Unlike MOSS, mosster is open source. From a transparency perspective, it seems appropriate to check submissions using a tool that itself can be checked.
mosster borrows many ideas from Algae. We like Algae and have benefitted greatly from conversations with Jonathan Pierce. However, we need to extend Algae in several ways and relax some of its assumptions. For example, Algae assumes that students submit once in single files, whereas we need to multiple submissions each comprising a complete source tree. We also need the ability to ignore large amounts of staff-provided code and self-similarity between multiple submissions by the same student.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.adoc		README.adoc
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.adoc

README.adoc

Vagrantfile

Vagrantfile

Repository files navigation

`mosster`: Git-friendly Plagiarism Detection

Terminology

Features

Git integration

Iterative checking

Aggressive filtering

Design

Overview

Library

Inputs

Identifying commits

Alice and Bob begin working separately:

Alice and Bob stop working together

Library Format

Commit to user mapping

Why `mosster`?

About

Releases

Packages

gchallen/mosster

Folders and files

Latest commit

History

Repository files navigation

mosster: Git-friendly Plagiarism Detection

Terminology

Features

Git integration

Iterative checking

Aggressive filtering

Design

Overview

Library

Inputs

Identifying commits

Sharing scenario

Alice and Bob begin working separately:

Alice and Bob start working together sharing Alice’s repository.

Alice and Bob stop working together

Library Format

Commit to user mapping

Why mosster?

About

Resources

Stars

Watchers

Forks

`mosster`: Git-friendly Plagiarism Detection

Why `mosster`?