Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #344: add multi-thread support to CheckstyleReportsParser #521

Closed
wants to merge 1 commit into from

Conversation

pbludov
Copy link
Member

@pbludov pbludov commented Oct 31, 2020

Issue #344

@rnveach @romani @strkkk
Here is my attempt to speed up the diff report generation

  • CheckstyleRecord now implements Comparable
  • The method specificEquals now Comparable::compareTo as we need the order of records
  • The list of records to process is pre-sorted (it has to be already sorted, so it does not take much time)
  • The method produceDiff now is O(N) as both lists are sorted
  • Difference generation is threaded to utilize more CPU and speed up the process
  • PositionOrderComparator removed
  • produceDiff now is a generic method and extracted to DiffUtils class
  • CliPaths renamed to CliOptions
  • New option for threading mode: SINGLE or MULTI

TODO:

  • Do some regression for regression testing
  • Add an option to disable multithreading to easily diagnose any problems with threading that may occur in the future

@pbludov
Copy link
Member Author

pbludov commented Oct 31, 2020

Running diff on openjdk14 with modules

<module name="Indentation"/>
<module name="MagicNumber"/>
<module name="MissingCtor"/>
<module name="RequireThis"/>
<module name="ReturnCount">
  <property name="max" value="1"/>
  <property name="maxForVoid" value="0"/>
</module>
<module name="SimplifyBooleanExpression"/>
<module name="SimplifyBooleanReturn"/>
<module name="StringLiteralEquality"/>
<module name="VariableDeclarationUsageDistance"/>
<module name="AvoidStarImport"/>
<module name="AvoidStaticImport"/>
<module name="MissingJavadocMethod">
  <property name="allowMissingPropertyJavadoc" value="true"/>
</module>
<module name="MissingJavadocPackage"/>
<module name="MissingJavadocType"/>
<module name="JavaNCSS"/>
<module name="NPathComplexity"/>

produces checkstyle-result.xml with size 140M

Diff generation timings (best result from 10 runs)

Master:

real    0m13,796s
user    0m16,818s
sys     0m0,251s

Patch:

real    0m4,518s
user    0m8,289s
sys     0m0,409s

@romani
Copy link
Member

romani commented Oct 31, 2020

Heh, from 13 second to 4 seconds, in worse case.
If we merge this , it will be only to please maintainers to put MT in code.

@@ -76,6 +77,7 @@ public Statistics getStatistics() {
public void addRecords(List<CheckstyleRecord> newRecords,
String filename) {
if (!newRecords.isEmpty()) {
newRecords.sort(new PositionOrderComparator());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we are using PositionOrderComparator for sorting lines within a file?
This comparator uses only line and column, as a result there may be false positives for

line 100, col 100, messageA
line 100, col 100, messageB

versus

line 100, col 100, messageB
line 100, col 100, messageA

I'm going to use same specificCompare method here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not review this code thoroughly, it is just a code that works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Well, now i wonder why CheckstyleRecord does not implement Comparable but instead exposes a method used in sorting? This should be reviewed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fix it if you think it weird

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pbludov pbludov marked this pull request as draft October 31, 2020 19:35
@pbludov pbludov marked this pull request as ready for review October 31, 2020 21:02
@rnveach
Copy link
Member

rnveach commented Nov 1, 2020

If we merge this , it will be only to please maintainers to put MT in code.

It really depends on the differences. When I created the issue, I had a 1 hour run and MT reduced it to 20 minutes.

* Severity level for a {@link CheckstyleRecord}.
* Used to compare two records in predefined order: "info", "warning", "error", all other.
*/
public enum SeverityLevel {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff report tool has no dependency on Checkstyle itself. This enum is to provide an order other then alphabetical to record severity. May be this enum is redundant and confusing. I'm going to review it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. @rnveach please review

statistics.incrementUniqueMessageCount(rec.getIndex());
}
}
records.entrySet().parallelStream()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelizing this loop gives 5% more speed for large diffs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upd: not in this PR. The class Statistics is not thread-safe.

@pbludov pbludov force-pushed the issue-344-threading branch 3 times, most recently from fd8f6ba to b65b85a Compare November 4, 2020 20:45
@pbludov
Copy link
Member Author

pbludov commented Nov 4, 2020

The diff report is available at https://pbludov.github.io/

The first report is built with patch-diff-report-tool from master
The second report is from this branch in single-threaded mode
The third report is from this branch in multi-threaded mode

They are all equal.

@pbludov pbludov requested a review from rnveach November 4, 2020 21:21
@rnveach
Copy link
Member

rnveach commented Nov 5, 2020

To be updated as jobs run:

Files used: http://rveach.no-ip.org/checkstyle/regression/reports/294/ (Warning XML is 169 megs) (Patch and master XMLs are 100% identical)
Command:

$ time java -jar ~/opensource/contribution/patch-diff-report-tool/target/patch-diff-report-tool-0.1-SNAPSHOT-jar-with-dependencies.jar --baseReport /var/www/html/reports/294/openjdk-pull.xml --patchReport /var/www/html/reports/294/openjdk-master.xml --output /tmp/test1 --baseConfig /var/www/html/reports/294/config.xml --patchConfig /var/www/html/reports/294/config.xml --refFiles /home/ricky/regression_repositories/openjdk10

master run:

patch-diff-report-tool execution started.
XML parsing is started.
Creation of configuration report is started.
Creation of diff html site is started.
Creation of the result site succeed.
patch-diff-report-tool execution finished.

real	60m10.596s
user	60m22.767s
sys	0m3.359s

this PR's run:

patch-diff-report-tool execution started.
XML parsing is started.
Creation of configuration report is started.
Creation of diff html site is started.
Creation of the result site succeed.
patch-diff-report-tool execution finished.

real	0m12.387s
user	0m23.788s
sys	0m1.658s

my PR's run with 6 processors:

patch-diff-report-tool execution started.
XML parsing is started.
Creation of configuration report is started.
Creation of diff html site is started.
Creation of the result site succeed.
patch-diff-report-tool execution finished.

real	21m45.254s
user	128m24.969s
sys	0m29.677s

@rnveach
Copy link
Member

rnveach commented Nov 5, 2020

@pbludov I do like your changes more. The thing is your changes aren't multi-threading. It is just pure optimization and so it all still runs single threaded. The quick run is just from 2 exact same files, I would like to see what is the time when everything is different. I assume it would then result in O(N^2) time since we have to go through each list twice, plus adding a sorting step which may result in longer times the more random the data gets.

I am fine if we want to keep this but still look into some way to multi-thread this.

@rnveach rnveach assigned pbludov and unassigned rnveach Nov 14, 2020
@rnveach
Copy link
Member

rnveach commented Nov 21, 2020

@pbludov @romani Any thoughts?

I will try to find time to create a fully random files to test.

@pbludov
Copy link
Member Author

pbludov commented Nov 21, 2020

@pbludov @romani Any thoughts?

I also noticed that the only 1.2 cores are loaded (on my PC). The main code is still single-threaded and has many ways for parallelization. But we can't go any further until we have a reliable method of regression testing.

@pbludov
Copy link
Member Author

pbludov commented Dec 27, 2020

@rnveach This takes much more time than I expected. Let's merge the optimization part and do multithreading later.

#533

@pbludov pbludov marked this pull request as draft December 27, 2020 11:04
@pbludov pbludov closed this Jan 30, 2021
@pbludov pbludov deleted the issue-344-threading branch January 30, 2021 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants