Skip to content

Search Criteria Constraints

Defexts edited this page Jun 13, 2019 · 5 revisions

Defexts contains real-world bugs from public GitHub projects. With our available processing power, it was infeasible to test every commit across every project for its suitability within Defexts. Below describes the search criteria + constraints we used to exclude commits unlikely to yield a bug fix.

Repair-related Keywords

Since the goal of Defexts is to include bugs and their patches, we decided to constrain the search space by looking for program repair related keywords in the commit title or commit description. These keywords are as follows (all case insensitive): error, fix, issue, problem, remove, repair, solve

Commits which did not include any of these keywords in the commit title or commit description were excluded from Defexts.

Build System Constraints

All entries in Defexts currently employ the Maven build system or Gradle build system. We use these build systems simply because of our familiarity and their popularity among other developers. As needed, Defexts may include projects which support other build systems, e.g. sbt for Scala projects.

Projects which use neither Gradle or Maven are excluded.

Language-specific Files

As part of the search process, we only include commits which modify JVM specific files. We filter based on this criteria to ensure any potential bugs are rooted in some JVM language, as opposed to build system errors, environment misconfigurations, or other similar problems. The current Defexts version allows modifications of files in the four most prominent JVM languages, Java, Scala, Kotlin, and Groovy. Defexts exclude commits which modify any file not ending in the following file extensions: .kt, .kts, .groovy, .java. scala. These correlate to the following JVM languages:

  • Scala - .scala
  • Java - .java
  • Kotlin - .kt, .kts
  • Groovy - .groovy

Though a project may be functionality composed of one language (e.g. Kotlin), its tests may be composed in another language (e.g. Scala or Groovy). Thus, the inclusion of files from the aforementioned JVM languages allows us to capture commits which modify files from different languages, so called hybrid projects.

Constraints for Source Files and Test Files

Source File Constraints

Defexts enforces the following constraints on source files:

  1. Modified files must have src/main/ in its filename (only files matching this criteria are considered source files)
  2. Source file filenames must end in one of the file extensions specified in "Language-specific Files"
  3. Source files must have <= 6 line modifications, as computed by diff fileA fileB
  4. Accepted commits must modify exactly 1 source file
  • Constraint (1) follows our observations on the typical location of source files in supported build systems. This constraint helps ensure modified files affect system functionality.
  • Constraint (2) helps ensure Defexts include bugs / patches directly pertaining system functionality.
  • Constraint (3) attempts to exclude commits which employ other changes beyond patching a bug, such as commits which fix some bug and also refactor some code. We acknowledge that only a small portion of bugs do not require more than 6 lines to patch, and this constraint thus prevents complex bugs from inclusion into Defexts. As our processing methodology and processing power improves, we aim to loosen or even eliminate this constraint.
  • Similar to Constraint (3), Constraint (4) attempt to exclude commits which perform changes other than fixing some bug. Even more than Constraint (3), this constraint artificially limits the types of bugs possible within Defexts. Without this constraint, too much human effort and time would be needed to isolate bug patch changes from all other modifications for each potential Defexts entry. As our processing methodology improves, we similarly aim to fully eliminate this constraint.

Test Files Constraints

Defexts enforces the following constraints on test files:

  1. Modified files must have src/test/ in its filename (only files matching this criteria are considered test files)
  2. Test file filenames must end in one of the file extensions specified in "Language-specific Files"

Note we impose no limit on the number of test files modified or the number of modifications within each test file.

  • Constraint (1) follows our observations on the typical location of test files in supported build systems. This constraint helps ensure modified files affect system test functionality.
  • Constraint (2) helps ensure Defexts include bugs / patches directly affecting system test functionality.

Source File Constraints vs Test File Constraints

Optimally, we would prefer to enforce no limits on the number of modified source files or the number of modifications within each source file. We intend for Defexts to contain commits which isolate bugs patches from any other form of modification to reduce the effort needed by Defexts end users. Unfortunately, we determined manually isolating every commit infeasible and have yet determined a method to automate this process, hence Source Constraint 3 and Source Constraint 4. Though some filtered commits are not fully isolated, they are small in number and are human processable.

Test files modifications, by contrast, can help reveal otherwise undetected bugs in projects or otherwise have no impact on existing test functionality (e.g. refactoring). Thus, we enforce fewer constraints for modified test files.