Skip to content

SERG-Delft/atoms-of-confusion-detector

Repository files navigation

Atoms of Confusion Detector

This is a tool for detecting atoms of confusion in the Java language as showcased in this paper by Langhout and Aniche.

Usage

Usage: tool [OPTIONS] COMMAND [ARGS]...

  Analyze Java source code for the presence of atoms of confusion

Options:
  -d, --disabled TEXT  Space separated list of disabled atoms
  -v, -V, --verbose    Print the results of its analysis on the console
  -l, -L, --log        Save the progress of the analysis to a log file
  -h, --help           Show this message and exit

Commands:
  files  Analyze the provided files for atoms of confusion
  pr     Analyze the provided github pull request for atoms of confusion

Running the detector on files

Usage: tool files [OPTIONS] FILES...

  Analyze the provided files for atoms of confusion

Options:
  -r, -R, --recursive  Recursively search any input directory for Java files
  -h, --help           Show this message and exit

Arguments:
  FILES  Space separated list of files/directories to analyzej

For example:

# run the detector on File1.java, File2.java and all of the files in dir
tool files -r File1.java File2.java ./dir/

Running the detector on a GitHub pull request

Usage: tool pr [OPTIONS] URL

  Analyze the provided github pull request for atoms of confusion

Options:
  -dl, -DL, --download  Download all of the affected files in the pull request
                        both before and after the merge
  -t, --token TEXT      Github API key you can obtain one at
                        https://github.com/settings/tokens
  -h, --help            Show this message and exit

Arguments:
  URL  The github pr URL

For example:

# analyze pull request 1926 of the mockito project
tool pr https://github.com/mockito/mockito/pull/1926
# you can provide a token, this will allow you to do 5000 runs per hour rather than 60
tool pr --token <token> <url>
# assing the -dl flag will download the analyzed files both before and after, to make manually finding the detected atoms simpler
tool pr -dl <url>

Implementation

In this section you can find information about the implementation of the different parts of the tool. For more details, feel free to also check the documentation of the classes and the methods in the source code.

Input

The tool is a CLI tool. To parse CLI arguments the CLIKT library is used. You can run the tool on local files or alternatively, you can pass a github pull request and anlyze the code both before and after the merge. All of the CLI logic is implemented in the file Cli.kt.

Running the detector on files

When running the detector on files, the InputParser class is responsible for retreiving the individual files provided by the user and parsing them. Next the detector is ran and the results are provided to the user.

Running the detector on pull requests

In order to run the tool on pull requests, the github API is used to find the commit SHA for the code before and after applying the PR. Next, .diff file for the PR is downloaded and parsed to get the affected filenames before and after the merge, as well as the ranges of line numbers which are added/deleted. Lastly, the before and after files are downloaded and the detector executes on them. This produces two sets of atoms. Now, for each atom in the before set with a line number which is "removed" we mark this atom as being removed in the PR. Likewise for each atom in the after set with a line number which is "added" the atom is marked as "added" in the PR. All remaining atoms in the after set are those which remain.

Analysis

Here you can find high-level descriptions of the different parts of the analysis pipeline of the tool.

Parsing

To parse the code the tool uses a parser generated using ANTLR v4. The grammar we used, as well as the generated parser and lexer can be found under src/main/java.

Detecting Atoms of Confusion

To detect the atoms in the code the listener infrastructure provided by ANTLR has been heavily utilized. Using this we implemented the AtomsListener class, which can be found under the parsing package in the code base. This listener is responsible for traversing the parse tree generated by the parser. During the traversal the listener can pass certain nodes of the tree to different Detectors to check for atoms.

Detectors are the classes responsible for actually analysing a part of the source code for atoms. In general each detector corresponds to one specific atom. All Detectors can be found in the parsing.detectors package. Each detector is annotated with the Visit annotation which specifies on what nodes of the parse tree this detector should be called. Then the detectors are registered to the AtomsListener who uses the annotation to know when to call a specific detector.

Scoping

To detect some of the atoms, identifier and symbol resolution was required. That is why the tool also keeps scoping information on the code that's being analysed. To implement this we have extended the symtab library provided by the Antlr team. The classes that we have added to extend the library's functionality can be found under the parsing.symtab package. The logic related with scoping is implemented by the AtomsListener.

Output

In this section you can find information on how the tool internally represents the results of the analysis as well as to how the tool outputs them.

The Confusion Graph

The confusion graph is a specialised data structure developed for the purposes of this tool that allows for quickly storing the atoms found as well for efficient queries. The main idea behind it is that there are 2 different types of nodes (Atoms nodes representing a type of atom and Source nodes representing an input file) and they are connected to each other with edges that include information about where the atom appears. For example if we have file Hello.java in which the Type Conversion atom exists on lines 10, 32 and 50 then in the graph we would have the Type Conversion atom node connected to the Hello.java source node with an edge containing the set {10, 32, 50}. Keep in mind that Atom nodes can only be connected to Source nodes and vice versa. This constraint is enforced by the code and exceptions will arise if you try to connect 2 nodes of the same type. One last thing to add, is that due to the implementation of the graph which is based on hash maps and some duplication of information most operations are of O(1) complexity. This allows the tool to remain fast even when analysing large sets of files. The code of the graph can be found in the output.graph package.

Pull Request Deltas

The tool also provides support for seeing how atoms have changed between pull requests. This is implemented in two steps. Firstly the diff file associated with the pull request to get information about which lines have been removed and added. This is implemented by the DiffParser class found in the github package. Secondly, the information retrieved is compared with the information in the graphs generated by analysing the "from" and "to" branches of the pull request to see what atoms have been removed, added and remain. The logic for this is implemented in the PRDelta class under the aforementioned package.

Writing the output

Finally, to write the output to CSV files we have used the kotlin-csv library to implement the CsvWriter which provides methods for writing both the csv graph and the PRDelta to CSV files. The code for this class can be found in the output.writers package.