Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #344: add multi-thread support to CheckstyleReportsParser #514

Closed
wants to merge 1 commit into from

Conversation

rnveach
Copy link
Member

@rnveach rnveach commented Oct 23, 2020

Issue #344

I feel I have been sitting on this long enough. I have not run into any issues in all the times I have used it.

@rnveach rnveach requested a review from romani October 23, 2020 01:21
@romani
Copy link
Member

romani commented Oct 24, 2020

@rnveach , please write a summary of benefit to have MT mode. Is it only for extremely large reports ? Or it is benefitial on real life examples of use that we have currently in our PRs as regression report.

@rnveach
Copy link
Member Author

rnveach commented Oct 24, 2020

The benefit of MT is that the process will now split any work that was solely done in a single thread to be done simultaneously into separate threads, allowing the execution to be split among as many available threads open to the system for however big the reports and differences get.

Parsing of the 2 XML files, base and patch as represented by O(B + P) time (basically O(2n) time), is now split into 2 different threads allowing parsing time decreased to the smallest of either O(B) and O(P), which basically amounts to now O(n) time.

Finding differences on the violations in the XML files is where the most benefit is seen. Difference finding, which is comparing 2 lists TWICE and amounts to O(2 * B * P) time (basically an O(n^2) complexity), is now split into as small as 100 record chunks to each thread that is available on the system. This haves a dramatic effect on time proportional to how many threads are on the system, theoretically amounting to O(2 * B * P / T) time, where T is number of threads given to the process. The more threads available and the more records to process, the more time is saved with multi-threading. I think with all the savings, this may reduce the complexity to basically O(n log n) time. ( https://stackoverflow.com/questions/1592649/examples-of-algorithms-which-has-o1-on-log-n-and-olog-n-complexities )

Is it only for extremely large reports ?

No. Anything with over 100 violations in any XML file with more than 1 thread available will see improvements. The bigger the violations and the threads, the more improvements seen.

is benefitial on real life examples of use that we have currently in our PRs as regression report.

Mine was a real life example as presented in issue. The more violations a produced in the final XML, the more savings will be generated. Pitest regression ( https://github.com/checkstyle/contribution/tree/master/checkstyle-tester#checkstyle-pitest-regression ) multiplies the number of violations seen by 1 check and makes the number of violations bigger than normal. Indentation also produces possibly many violations just per file and will the be the single check to show the most improvement.

Remember improvements are seen based on the more violations in the XML file. This search routine is O(n^2) complexity.

@romani
Copy link
Member

romani commented Oct 24, 2020

I did not mean general ideas of why MT mode is better.

I meant, what is reduction of execution time for CI, for example for checkstyle/checkstyle#8913 .

@rnveach
Copy link
Member Author

rnveach commented Oct 24, 2020

I meant, what is reduction of execution time for CI

That is explained as I go into O(n) time complexity. I explain how the routines changed will now behave. The O notation is used to describe how many comparisons a routine does. The less the time complexity, the less comparisons done, the faster the routine generally is.

O(1) is less time than O(n)
O(n) is less time than O(A * n) where A is any numeric value
O(n) is less time than O(n log n)
O(n log n) is less time than O(n^2)
etc...

https://stackoverflow.com/questions/1592649/examples-of-algorithms-which-has-o1-on-log-n-and-olog-n-complexities which I linked to shows examples of code for each complexity.

Main issue provides some numbers an actual run but I don't have the file to reproduce and provide more details from it, I can only explain the time reduction in terms of comparisons as there are many variables.

for example for checkstyle/checkstyle#8913 .

Can you post to exact comment because I am not seeing anything.

@romani
Copy link
Member

romani commented Oct 24, 2020

I understand notation.
But if optimize process that takes 10 seconds to be 5 seconds, in Os it awesome but in real life it not that much , especially if whole other process takes 40 minutes.

I curious how much quicker generation of report be after merge if this update. So I provided link to PR with configs.
I need to maintainability price of code in comparison with benefits of time reduction to get report ready.

final ExecutorService executor =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
final List<Future<List<CheckstyleRecord>>> futures = new ArrayList<>();
final int size = list1.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks as hardcoded

    private static List<CheckstyleRecord> produceDiff(
            List<CheckstyleRecord> list1, List<CheckstyleRecord> list2) {
        return Stream.concat(
                StreamSupport.stream(list1.spliterator(), SPLIT_SIZE <= list1.size())
                    .filter(rec -> !isInList(list2, rec)),
                StreamSupport.stream(list2.spliterator(), SPLIT_SIZE <= list2.size())
                    .filter(rec -> !isInList(list1, rec))
            )
            .collect(Collectors.toList());
    }

Copy link
Member Author

@rnveach rnveach Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this support multi-threading? The whole issue is we need this multi-threaded to reduce processing time looping through the 2 lists. If everything is processed in 1 thread then there is no time saving.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamSupport.stream(list1.spliterator(), SPLIT_SIZE <= list1.size()) is basically list1.stream() if the size is less then SPLIT_SIZE. Otherwise it is list1.parallelStream()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familar with streams but I assume parallel is multi-threaded. I will compare and see how they stack against each other.

@pbludov
Copy link
Member

pbludov commented Oct 30, 2020

@rnveach what about such optimization:

        List<CheckstyleRecord> shortest;
        List<CheckstyleRecord> longest;
        if (list1.size() < list2.size()) {
            shortest = list1;
            longest = list2;
        }
        else {
            shortest = list2;
            longest = list1;
        }

        // O(log(N))
        Set<CheckstyleRecord> recordSet =
                StreamSupport.stream(shortest.spliterator(), SPLIT_SIZE <= shortest.size()) // Do we need threading here?
            .collect(Collectors.toCollection(ConcurrentSkipListSet::new));

        // O(M * log(N)), M > N
        List<CheckstyleRecord> diff =
                StreamSupport.stream(longest.spliterator(), SPLIT_SIZE <= longest.size())
            .filter(rec -> !recordSet.remove(rec)) // Remove from both collections if matches
            .collect(Collectors.toCollection(ArrayList::new));

        diff.addAll(recordSet);
        return diff;

Could you run your benchmark on this code?

@rnveach
Copy link
Member Author

rnveach commented Oct 30, 2020

what about such optimization:

I can add it to the list to benchmark but I don't understand the code to know what it is doing. Is this a replacement for produceDiffEx or produceDiff?

// Do we need threading here?

Imagine both lists are of infinite size, it doesn't matter who is smaller. This is how I read N in this complexities since it isn't a direct number. It only matters what the final complexity is and if it can be reduced below O(N^2) or O(N log N).

@pbludov
Copy link
Member

pbludov commented Oct 30, 2020

Imagine both lists are of infinite size, it doesn't matter who is smaller. This is how I read N in this complexities since it isn't a direct number. It only matters what the final complexity is and if it can be reduced below O(N^2) or O(N log N).

In this case, may be we should keep these lists sorted? It is easy to compare two sorted lists. Or even store them in a sorted collection, like

    private Map<String, ConcurrentSkipListSet<CheckstyleRecord>> records

?

@rnveach
Copy link
Member Author

rnveach commented Oct 30, 2020

In this case, may be we should keep these lists sorted?

It looks like we sorted the results after the diff.

Collections.sort(diff, new PositionOrderComparator());

If we are going to sort beforehand, wouldn't it be better to sort the records as they are added to the list?

@romani
Copy link
Member

romani commented Oct 31, 2020

@rnveach , please share comparison of executions, you might use Indentation Check to make sure that reports are huge in diff.

@pbludov
Copy link
Member

pbludov commented Oct 31, 2020

Please take a look to my PR #521

@rnveach rnveach closed this Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants