Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel.js is slow due to unnecessary per-job overhead #189

Open
benbucksch opened this issue May 14, 2020 · 2 comments
Open

Parallel.js is slow due to unnecessary per-job overhead #189

benbucksch opened this issue May 14, 2020 · 2 comments

Comments

@benbucksch
Copy link
Contributor

benbucksch commented May 14, 2020

There seems to be a significant overhead per job, not just per worker.

My task is to run the levenshtein algo, comparing a given string with about 200000 target strings and finding the best match. This is ideal for parallelism, because it runs the exact same algo thousands of times, the algo is small, needs little data and no context. Without parallelism, it takes a significant time that's noticeable for the end user.

So, I've used paralleljs to run it on multiple CPU cores at the same time. However, I find that using paralleljs actually makes it slower than before with the synchronous version.

Environment:

  • node 13.2
  • Ubuntu Linux 20.04
  • CPU Ryzen 7 1700X, 8 cores with SMT, i.e. 16 virtual cores

Terminology:

  • worker: Maps to a OS level thread or process and run on a separate CPU core than other workers, in parallel to the other workers.
  • job: Running the function once for one of the data values.

I understand the purpose of Parallel.js to create the workers and to distribute the jobs to the workers in an efficient manner and with a comfortable and easy to use API.

Test results

Synchronous

No parallel.js used, all evaluations run in a single JS thread

664 ms with 88959 (longer) values
358 ms with 89548 (shorter) values
93 ms with 31552 values
10 ms with 2994 values
0 ms with 79 values
1 ms with 137 values
1 ms with 823 values
0 ms with 360 values

This is too slow and what we want to improve using parallel.js

Using paralleljs with default options

Same with maxWorkers = 16 (= number of SMT virtual CPU cores)

2067 ms with 88959 values
2036 ms with 89548 values
907 ms with 31552 values
332 ms with 2994 values
272 ms with 79 values
287 ms with 137 values
287 ms with 823 values
280 ms with 360 values

It's 300 times slower for small amount of data. This is might be the overhead of creating worker threads.

It's even 3-6 times as slow for large amount of data. This is somewhat surprising, because it cannot be the overhead of creating the worker threads, because they should be limited to 16 and be re-used for additional data, so I would expect at worst the time of the synchronous execution, plus the 280ms overhead of creating the workers, plus a tiny little overhead for the copy of the data. That would be around 700 ms for the second test, but it's actually 2000 ms, almost 3 times as long.

Interestingly, the first test (longer strings) and second test (shorter test) cost about the same time. This makes a big difference in the synchronous execution, but not with parallel.js, where the larger and shorter strings take about the same time. So, the slowdown isn't due to the pure data copy time. (See also the note to the last test below.) This shows that the amount of data has less of an influence than other factors.

This indicates that there might be something wrong (very inefficient) in parallel.js. There is some significant overhead per job, e.g. that workers are not re-used, by re-created all the time for each job, or some other huge overhead per job.

with maxWorkers = 8

(= number of physical CPU cores without SMT)

2002 ms with 88959 (longer) values
1782 ms with 89548 (shorter) values
778 ms with 31552 values
198 ms with 2994 values
156 ms with 79 values
166 ms with 137 values
173 ms with 823 values
153 ms with 360 values

Split up array

Starting only as many jobs as CPU cores, by manually splitting up the value array before running Parallel

593ms with 88959 values
526ms with 89548 values
357ms with 31552 values
304ms with 2994 values
285ms with 79 values
285ms with 137 values
290ms with 823 values
284ms with 360 values

Here, I first manually split up the data array into one array per CPU core, leading to an nested array with 16 elements, each being an array with thousands of values. This feeds a large array to each worker, one per CPU core. Then, within (!) the worker, I make a synchronous loop over the second level array and the thousands of values.

This is to test whether parallel.js is inefficiently distributing the work, e.g. by creating too many workers, or something with a similar effect. This appears to be the case.

I would expect that Parallel.js creates 1 worker per maxWorker, then makes a synchronous loop over the data and runs the function on each value. If it did, it should get speeds very similar to this here. Given that Parallel.js is about 4 times slower, that indicates that there's something to improve.

Split up array, with 8 workers

578ms with 88959 values
457ms with 89548 values
228ms with 31552 values
167ms with 2994 values
142ms with 79 values
147ms with 137 values
155ms with 823 values
149ms with 360 values

This is coming closest to the synchronous version, because it's also sl

1 Worker, 1 Job

This is the same as the last 2 cases, just with core = 1. This means there is no parallelism and the entire data array is passed into a single worker. This is effectively the same as the synchronous version without any parallel.js, just that it's running in a single worker created by Parallel.js.

1178ms with 88959 values
838ms with 89548 values
316ms with 31552 values
83ms with 2994 values
48ms with 79 values
45ms with 137 values
60ms with 823 values
52ms with 360 values

This should be slightly slower than the synchronous version, due to the the overhead of creating the workers, and a lot slower than the parallel one, due to only 1 CPU core being used. However, it's a lot faster than the non-split parallel version, which should not be the case.

This also shows that the data copy, which still has to happen here, is a lesser factor. The data copy needs to happen here as well. Yet, it is a lot faster than the non-split version, which indicates that the data copy is not the primary reason for the slowness.

Conclusion

Using Parallel.js in this case is dramatically slower than the synchronous version, by 1 or 2 orders of magnitude.

Inherent thread creation overhead

There is a certain overhead of using Parallel.js, which is very significant for small inputs: hundreds of times slower, i.e. 2 orders of magnitude slower. That is due to the fundamental cost of creating threads or processes.

This could be worked around by the caller, by checking the amount of input and avoiding Parallel.js in these cases. The factors to consider are: Amount of input data, execution time per function call, and execution time per thread creation.

Data copy overhead

There is also an overhead for copying the data. However, while that is there, results show that this is not the primary reasons for the slowdown.

Needless overhead

There is a huge overhead per job, not just per worker.

The results suggest that Parallel.js creates more workers than maxWorkers, or has other significant overheads per job with a similar effect.

There should be no overhead per job, only per worker and for the data transfer.

It would be good to look into this.

@benbucksch
Copy link
Contributor Author

Test code

This is a reduced test case. It's not the exact code I ran, but similar to what I tested with.

Syncronous

import leven from 'didyoumean2';
let startTime = new Date();
let input = "What is going on";
let matches = ["I don't know", ... ]; // thousands of strings with lengths around 10 to 40 chars.
let results =  matches.map(match => leven(input, match));
results = results.filter(result => result.score <= 0.1).sort((a, b) => a.score - b.score);
console.log((new Date() - startTime) + "ms with", matches.length, "values");

Using paralleljs

import leven from 'didyoumean2';
let startTime = new Date();
let input = "What is going on";
let matches = ["I don't know", ... ]; // thousands of strings with lengths around 10 to 40 chars.
let parallel = new Parallel(matches, { env: { input } })
let results = await parallel
  .require(leven)
  .map(match => leven(global.env.input, match));
results = results.filter(result => result.score <= 0.1).sort((a, b) => a.score - b.score);
console.log((new Date() - startTime) + "ms with", matches.length, "values");

Split-up the array

import leven from 'didyoumean2';
let startTime = new Date();
let input = "What is going on";
let matches = ["I don't know", ... ]; // thousands of strings with lengths around 10 to 40 chars.
const cores = 16;
let lengthPerCore = Math.ceil(matches.length / cores);
let worksplit = [];
for (let i = 0; i < cores; i++) {
  worksplit.push(matches.slice(i * lengthPerCore, i * lengthPerCore + lengthPerCore));
}
let parallel = new Parallel(worksplit, { env: { inputText }, maxWorkers: cores })
let results = await parallel
  .require(leven)
  .map(valuesSplit =>
    .map(match => leven(global.env.input, match)));
results = results.flat(1)
results = results.filter(result => result.score <= 0.1).sort((a, b) => a.score - b.score);
console.log((new Date() - startTime) + "ms with", matches.length, "values");

@benbucksch benbucksch changed the title Parallel.js is slow Parallel.js is slow due to unnecessary per-job overhead May 14, 2020
@benbucksch
Copy link
Contributor Author

benbucksch commented May 14, 2020

Cause

Looking at the Parallel.js code, the code does seem to try to re-use workers in parallel.map(). It's possible that there are bugs which prevent the reuse of workers, but I'll presume that this is not the case.

The message passing between controlling process and the spawned worker process might be slow. I can see in the code that the data is passed to the worker one by one, and returned one by one, one job at a time. That's 180000 messages for the first test case. It is quite likely this constant back and forth between the processes that is so slow. It would match and explain the numbers that I measured.

Fix

The fix is relatively simple: Pass all values at once into a worker. I.e. do what I did with the split-up code. That leads to a 3x to 4x speedup for workloads with 100000 calls.

Edge case

There might be special cases where there are few jobs, but the running time varies dramatically between different jobs. For such cases, the current design of one by one is better. So, you could enable it based on the number of jobs/values, e.g. 100, where the back and forth overhead becomes significant and differences between the job execution times likely level out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants