Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"szz_find_bug_introducers-0.1.jar" stalls for a long time Kafka 2.1.1 #28

Open
djaekim opened this issue Nov 6, 2019 · 10 comments
Open

Comments

@djaekim
Copy link

djaekim commented Nov 6, 2019

First of all, I really appreciate your work on making SZZ algorithm public. This is truly helpful for researchers and practitioners.

Secondly, I am not using it with docker, and I am windows user.

Questions:
[1] I noticed that annotation.json is created quickly, however, the command line still shows "trying to find potential bug introducing commit" and stalls for a very long time. Based on the documentation, if "annotation.json" has same information as "fix_and_introducers_pairs.json", but shows details about bug introducing file rather commit, I do not understand why it stalls for a long time to get commits.

As soon as I run the program following happens
A.
image

B. inside each one I already have
image

C. however, it stalls for a very long time here.
image

[2] I was wondering how would i be able to get file introducing the bug rather than commit level? Can I traverse the annotation.json and look for filePath?

image

thank you!

@djaekim
Copy link
Author

djaekim commented Nov 6, 2019

  • this happened after I used xmx10G
  • Took 4 hours to show this

image

@milocyning
Copy link

wait, a long long time...

[main] INFO Main - Checking available processors...
[main] INFO Main - Found 8 processes!
[Thread-0] INFO parser.GitParserThread - Started process...
[Thread-3] INFO parser.GitParserThread - Started process...
[Thread-4] INFO parser.GitParserThread - Started process...
[Thread-5] INFO parser.GitParserThread - Started process...
[Thread-6] INFO parser.GitParserThread - Started process...
[Thread-7] INFO parser.GitParserThread - Started process...
[Thread-8] INFO parser.GitParserThread - Started process...
[Thread-9] INFO parser.GitParserThread - Started process...
[Thread-7] INFO parser.GitParserThread - Found 1941 number of commits.
[Thread-7] INFO parser.GitParserThread - Checking each commits diff...
[Thread-7] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-8] INFO parser.GitParserThread - Found 1839 number of commits.
[Thread-8] INFO parser.GitParserThread - Checking each commits diff...
[Thread-8] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-3] INFO parser.GitParserThread - Found 1949 number of commits.
[Thread-3] INFO parser.GitParserThread - Checking each commits diff...
[Thread-3] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-6] INFO parser.GitParserThread - Found 1977 number of commits.
[Thread-6] INFO parser.GitParserThread - Checking each commits diff...
[Thread-6] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-4] INFO parser.GitParserThread - Found 2015 number of commits.
[Thread-4] INFO parser.GitParserThread - Checking each commits diff...
[Thread-4] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-5] INFO parser.GitParserThread - Found 2047 number of commits.
[Thread-5] INFO parser.GitParserThread - Checking each commits diff...
[Thread-5] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-0] INFO parser.GitParserThread - Found 1973 number of commits.
[Thread-0] INFO parser.GitParserThread - Checking each commits diff...
[Thread-0] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-9] INFO parser.GitParserThread - Found 1926 number of commits.
[Thread-9] INFO parser.GitParserThread - Checking each commits diff...
[Thread-9] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-3] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-8] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-9] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-0] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-6] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-7] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-4] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-5] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-3] INFO parser.GitParserThread - Building line mapping graph.
[Thread-8] INFO parser.GitParserThread - Building line mapping graph.
[Thread-9] INFO parser.GitParserThread - Building line mapping graph.
[Thread-7] INFO parser.GitParserThread - Building line mapping graph.
[Thread-6] INFO parser.GitParserThread - Building line mapping graph.
[Thread-4] INFO parser.GitParserThread - Building line mapping graph.
[Thread-5] INFO parser.GitParserThread - Building line mapping graph.
[Thread-0] INFO parser.GitParserThread - Building line mapping graph.

@djaekim
Copy link
Author

djaekim commented Dec 15, 2019

Hi, so you didn't get any exception like the one I showed?

wait, a long long time...

[main] INFO Main - Checking available processors...
[main] INFO Main - Found 8 processes!
[Thread-0] INFO parser.GitParserThread - Started process...
[Thread-3] INFO parser.GitParserThread - Started process...
[Thread-4] INFO parser.GitParserThread - Started process...
[Thread-5] INFO parser.GitParserThread - Started process...
[Thread-6] INFO parser.GitParserThread - Started process...
[Thread-7] INFO parser.GitParserThread - Started process...
[Thread-8] INFO parser.GitParserThread - Started process...
[Thread-9] INFO parser.GitParserThread - Started process...
[Thread-7] INFO parser.GitParserThread - Found 1941 number of commits.
[Thread-7] INFO parser.GitParserThread - Checking each commits diff...
[Thread-7] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-8] INFO parser.GitParserThread - Found 1839 number of commits.
[Thread-8] INFO parser.GitParserThread - Checking each commits diff...
[Thread-8] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-3] INFO parser.GitParserThread - Found 1949 number of commits.
[Thread-3] INFO parser.GitParserThread - Checking each commits diff...
[Thread-3] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-6] INFO parser.GitParserThread - Found 1977 number of commits.
[Thread-6] INFO parser.GitParserThread - Checking each commits diff...
[Thread-6] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-4] INFO parser.GitParserThread - Found 2015 number of commits.
[Thread-4] INFO parser.GitParserThread - Checking each commits diff...
[Thread-4] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-5] INFO parser.GitParserThread - Found 2047 number of commits.
[Thread-5] INFO parser.GitParserThread - Checking each commits diff...
[Thread-5] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-0] INFO parser.GitParserThread - Found 1973 number of commits.
[Thread-0] INFO parser.GitParserThread - Checking each commits diff...
[Thread-0] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-9] INFO parser.GitParserThread - Found 1926 number of commits.
[Thread-9] INFO parser.GitParserThread - Checking each commits diff...
[Thread-9] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-3] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-8] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-9] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-0] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-6] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-7] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-4] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-5] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-3] INFO parser.GitParserThread - Building line mapping graph.
[Thread-8] INFO parser.GitParserThread - Building line mapping graph.
[Thread-9] INFO parser.GitParserThread - Building line mapping graph.
[Thread-7] INFO parser.GitParserThread - Building line mapping graph.
[Thread-6] INFO parser.GitParserThread - Building line mapping graph.
[Thread-4] INFO parser.GitParserThread - Building line mapping graph.
[Thread-5] INFO parser.GitParserThread - Building line mapping graph.
[Thread-0] INFO parser.GitParserThread - Building line mapping graph.
```Hello, so you didn't get any exceptions like the one I showed?

@milocyning
Copy link

@djaekim Ran for 24 hours

$ ./run_pairs.sh
=== mahout ===
[main] INFO Main - Checking available processors...
[main] INFO Main - Using 1 cpus!
[Thread-0] INFO parser.GitParserThread - Started process...
[Thread-0] INFO parser.GitParserThread - Found 513 number of commits.
[Thread-0] INFO parser.GitParserThread - Checking each commits diff...
[Thread-0] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-0] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-0] INFO parser.GitParserThread - Building line mapping graph.
[Thread-0] INFO parser.GitParserThread - Saving results to file
[Thread-0] INFO parser.GitParserThread - Trying to find potential bug introducing commits...
./run_pairs.sh: 行 8: 15973 已杀死               java -Xmx16g -jar ${JAR_PATH}/szz_find_bug_introducers-0.1.jar -i ${ISSUE_LIST_PATH}/mahout.json -r ${REPOS_PATH}/mahout/ -c 1
=== kafka ===
[main] INFO Main - Checking available processors...
[main] INFO Main - Using 1 cpus!
[Thread-0] INFO parser.GitParserThread - Started process...
[Thread-0] INFO parser.GitParserThread - Found 2129 number of commits.
[Thread-0] INFO parser.GitParserThread - Checking each commits diff...
[Thread-0] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-0] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-0] INFO parser.GitParserThread - Building line mapping graph.
[Thread-0] INFO parser.GitParserThread - Saving results to file
[Thread-0] INFO parser.GitParserThread - Trying to find potential bug introducing commits...
Exception in thread "Thread-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.LinkedList.linkLast(LinkedList.java:142)
	at java.util.LinkedList.add(LinkedList.java:338)
	at util.RevisionCombinationGenerator.generateRevIssuePairs(RevisionCombinationGenerator.java:166)
	at heuristics.SimpleBugIntroducerFinder.findBugIntroducingCommits(SimpleBugIntroducerFinder.java:212)
	at parser.GitParserThread.run(GitParserThread.java:105)
java.io.FileNotFoundException: results/result0/fix_and_introducers_pairs.json (没有那个文件或目录)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at java.io.FileReader.<init>(FileReader.java:58)
	at diff.SimplePartition.mergeFiles(SimplePartition.java:145)
	at Main.main(Main.java:65)
=== flink ===
[main] INFO Main - Checking available processors...
[main] INFO Main - Using 1 cpus!
[Thread-0] INFO parser.GitParserThread - Started process...
[Thread-0] INFO parser.GitParserThread - Found 3076 number of commits.
[Thread-0] INFO parser.GitParserThread - Checking each commits diff...
[Thread-0] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-0] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-0] INFO parser.GitParserThread - Building line mapping graph.
[Thread-0] INFO parser.GitParserThread - Saving results to file
[Thread-0] INFO parser.GitParserThread - Trying to find potential bug introducing commits...
Exception in thread "Thread-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.LinkedList.linkLast(LinkedList.java:142)
	at java.util.LinkedList.add(LinkedList.java:338)
	at util.RevisionCombinationGenerator.generateRevIssuePairs(RevisionCombinationGenerator.java:166)
	at heuristics.SimpleBugIntroducerFinder.findBugIntroducingCommits(SimpleBugIntroducerFinder.java:212)
	at parser.GitParserThread.run(GitParserThread.java:105)
java.io.FileNotFoundException: results/result0/fix_and_introducers_pairs.json (没有那个文件或目录)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at java.io.FileReader.<init>(FileReader.java:58)
	at diff.SimplePartition.mergeFiles(SimplePartition.java:145)
	at Main.main(Main.java:65)
=== storm ===
[main] INFO Main - Checking available processors...
[main] INFO Main - Using 1 cpus!
[Thread-0] INFO parser.GitParserThread - Started process...
[Thread-0] INFO parser.GitParserThread - Found 956 number of commits.
[Thread-0] INFO parser.GitParserThread - Checking each commits diff...
[Thread-0] INFO parser.GitParserThread - Parsing difflines for all found commits.
[Thread-0] INFO parser.GitParserThread - Saving parsed commits to file
[Thread-0] INFO parser.GitParserThread - Building line mapping graph.
[Thread-0] INFO parser.GitParserThread - Saving results to file
[Thread-0] INFO parser.GitParserThread - Trying to find potential bug introducing commits...

@milocyning
Copy link

Get an empty [] result of results/fix_and_introducers_pairs.json.

$ tree
.
├── flink
│   ├── issues
│   │   └── fix_and_introducers_pairs_0.json
│   └── results
│       ├── annotations.json
│       ├── commits.json
│       ├── fix_and_introducers_pairs.json
│       └── result0
│           ├── annotations.json
│           └── commits.json
├── kafka
│   ├── issues
│   │   └── fix_and_introducers_pairs_0.json
│   └── results
│       ├── annotations.json
│       ├── commits.json
│       ├── fix_and_introducers_pairs.json
│       └── result0
│           ├── annotations.json
│           └── commits.json
├── mahout
│   ├── issues
│   │   └── fix_and_introducers_pairs_0.json
│   └── results
│       └── result0
│           ├── annotations.json
│           └── commits.json
├── run_pairs.sh
├── storm
│   ├── issues
│   │   └── fix_and_introducers_pairs_0.json
│   └── results
│       └── result0
│           ├── annotations.json
│           └── commits.json
└── szz_find_bug_introducers-0.1.jar

@wogscpar
Copy link
Owner

Hello,

Nice to see that you find the project interesting and wants to use it. And great that you've found some projects where this algorithm needs to get it's hands dirty. Just to confirm, I've runned the algorithm onto the apache/mahout project and I got the same result as you.

So I've tried a couple of different ways to solve it. For this particular project(apache/mahout) I've tried to split the issues to run on several instances of the algorithm to:

  1. make the graphs smaller since they can be rather big especially if the project that you want to analyze has commits that are big(which could be an indicator that that commit was a bug introducer since it could have made a huge impact on the code base).
  2. make it faster by splitting the work on several machines.

So my result for the mahout project there where around 350 issues or something, right? Running the algorithm on the second half of the issues got a result in about 30 min. However, the first part took longer time and I'm suspecting that one commit is huge(maybe one or several huge files were deleted?).

Anyway, one way to solve it, until I've optimized the step which check for partial fixes, you can try to split the issues.json into several files and let the algorithm run on them separately. And as said, to speed up things, distribute it on several computers to let them run in parallel. After they're done, merge the results and you have the same as you would have with a single instance runned.

Another way, with the penalty of evaluating lesser amounts of commits, is to lower the depth for the graphs like java -jar szz_find_bug_introducers-0.1.jar -d 1 -r -i . It's defaulting to 3 which makes it go two commits back in history for each line.

My spec for the runs was a Macbook Pro 13' , Intel i5, 16GB Ram and I used the configuration java -Xmx=6G ....

@milocyning
Copy link

@wogscpar Thank you for your prompt reply. Now, most projects can be analyzed normally, except strom. By the way, what impact will -d 1 have?

$ java -Xmx16g -jar ${JAR_PATH}/szz_find_bug_introducers-0.1.jar -d 1 -i ${ISSUE_LIST_PATH}/storm.json -r ${REPOS_PATH}/storm/

=== storm ===
[main] INFO Main - Checking available processors...
[main] INFO Main - Found 8 processes!
[Thread-0] INFO parser.GitParserThread - Started process...

...

[Thread-9] INFO parser.GitParserThread - Trying to find potential bug introducing commits...
[Thread-9] INFO parser.GitParserThread - Saving found bug introducing commits...
Exception in thread "Thread-5" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.eclipse.jgit.diff.SimilarityIndex.<init>(SimilarityIndex.java:134)
	at org.eclipse.jgit.diff.SimilarityRenameDetector.hash(SimilarityRenameDetector.java:371)
	at org.eclipse.jgit.diff.SimilarityRenameDetector.buildMatrix(SimilarityRenameDetector.java:285)
	at org.eclipse.jgit.diff.SimilarityRenameDetector.compute(SimilarityRenameDetector.java:138)
	at org.eclipse.jgit.diff.RenameDetector.findContentRenames(RenameDetector.java:508)
	at org.eclipse.jgit.diff.RenameDetector.compute(RenameDetector.java:392)
	at org.eclipse.jgit.diff.RenameDetector.compute(RenameDetector.java:362)
	at org.eclipse.jgit.diff.RenameDetector.compute(RenameDetector.java:339)
	at org.eclipse.jgit.diff.RenameDetector.compute(RenameDetector.java:323)
	at org.eclipse.jgit.blame.BlameGenerator.findRename(BlameGenerator.java:1028)
	at org.eclipse.jgit.blame.BlameGenerator.processOne(BlameGenerator.java:627)
	at org.eclipse.jgit.blame.BlameGenerator.next(BlameGenerator.java:511)
	at org.eclipse.jgit.blame.BlameResult.computeAll(BlameResult.java:249)
	at org.eclipse.jgit.blame.BlameGenerator.computeBlameResult(BlameGenerator.java:465)
	at org.eclipse.jgit.api.BlameCommand.call(BlameCommand.java:238)
	at parser.GitParser.traceFileChanges(GitParser.java:200)
	at parser.GitParser.buildLineMappingGraph(GitParser.java:274)
	at parser.GitParser.annotateCommits(GitParser.java:316)
	at parser.GitParserThread.run(GitParserThread.java:98)
Exception in thread "Thread-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.eclipse.jgit.util.FileUtils.lastModified(FileUtils.java:683)
	at org.eclipse.jgit.util.FS.lastModified(FS.java:319)
	at org.eclipse.jgit.internal.storage.file.FileSnapshot.isModified(FileSnapshot.java:165)
	at org.eclipse.jgit.internal.storage.file.ObjectDirectory.searchPacksAgain(ObjectDirectory.java:776)
	at org.eclipse.jgit.internal.storage.file.ObjectDirectory.resolve(ObjectDirectory.java:395)
	at org.eclipse.jgit.internal.storage.file.ObjectDirectory.resolve(ObjectDirectory.java:373)
	at org.eclipse.jgit.internal.storage.file.WindowCursor.resolve(WindowCursor.java:150)
	at org.eclipse.jgit.lib.ObjectReader.abbreviate(ObjectReader.java:134)
	at org.eclipse.jgit.diff.DiffFormatter.format(DiffFormatter.java:709)
	at org.eclipse.jgit.diff.DiffFormatter.formatIndexLine(DiffFormatter.java:1174)
	at org.eclipse.jgit.diff.DiffFormatter.formatHeader(DiffFormatter.java:1155)
	at org.eclipse.jgit.diff.DiffFormatter.createFormatResult(DiffFormatter.java:969)
	at org.eclipse.jgit.diff.DiffFormatter.toFileHeader(DiffFormatter.java:951)
	at util.CommitUtil.getDiffEditList(CommitUtil.java:151)
	at util.CommitUtil.diffFile(CommitUtil.java:168)
	at util.CommitUtil.getCommitDiffingLines(CommitUtil.java:198)
	at parser.GitParser.traceFileChanges(GitParser.java:251)
	at parser.GitParser.buildLineMappingGraph(GitParser.java:274)
	at parser.GitParser.annotateCommits(GitParser.java:316)
	at parser.GitParserThread.run(GitParserThread.java:98)
Exception in thread "Thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
java.io.FileNotFoundException: results/result0/annotations.json (没有那个文件或目录)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at java.io.FileReader.<init>(FileReader.java:58)
	at diff.SimplePartition.mergeFiles(SimplePartition.java:142)
	at Main.main(Main.java:65)
java.io.FileNotFoundException: results/result2/fix_and_introducers_pairs.json (没有那个文件或目录)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at java.io.FileReader.<init>(FileReader.java:58)
	at diff.SimplePartition.mergeFiles(SimplePartition.java:145)
	at Main.main(Main.java:65)
java.io.FileNotFoundException: results/result3/annotations.json (没有那个文件或目录)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at java.io.FileReader.<init>(FileReader.java:58)
	at diff.SimplePartition.mergeFiles(SimplePartition.java:142)
	at Main.main(Main.java:65)

@wogscpar
Copy link
Owner

wogscpar commented Jan 5, 2020

I've runned this on the apache/hadoop and from what I found out is that it include quite many code changes which makes the commit.json to be huge. I've implemented an option to not include the rows of code in the result and just save the linenumbering(since the code is only essential if you want to try out the experimental distanceintroducerfinder). In the simplefinder, only the lines are useful.

As for the -d 1 option, it makes the algorithm to only look one commit behind in history. If you increase it, it will try to go through the as many commits as you specify with the -d option.

I also implemented a optimization which makes the revisioncombinationgenerator not generate a full list with combinations but instead do it on the fly with a loop.

@qingdujun
Copy link

I've implemented an option to not include the rows of code in the result and just save the linenumbering

Looking forward to update

@wogscpar
Copy link
Owner

There's now an update where it's possible to omit the rows of code. It will, as stated in the commit remove the outofmemory(which means that FileNotFoundexception will not occur either) but will not solve the issue of the algorithm being slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants