Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support shared backing store #99

Open
jcharum opened this issue Jul 27, 2020 · 0 comments
Open

Support shared backing store #99

jcharum opened this issue Jul 27, 2020 · 0 comments
Labels
enhancement New feature or request

Comments

@jcharum
Copy link
Contributor

jcharum commented Jul 27, 2020

Bigslice workers currently store their task outputs locally. These stored outputs may then be read by other workers when needed by direct connections between machines.

When machines are especially flaky, e.g. high spot market contention in EC2, progress on a computation can grind to a halt, as machines are lost frequently enough that a large portion of time is spent recomputing lost results.

Workers could instead write to a more durable shared backing store. If workers are lost, their results would remain available. This would allow computations to always make forward progress at the cost of extra (read: slow) data transfer.

There is already a nod to implementation in the code. There's work to be done to plumb it through.

Amazon FSX for Lustre may be a good option, as it's basically designed for this sort of use case:

The open source Lustre file system is designed for applications that require fast storage – where you want your storage to keep up with your compute. Lustre was built to quickly and cost effectively process the fastest-growing data sets in the world, and it’s the most widely used file system for the 500 fastest computers in the world. It provides sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS.

We could also implement something like asynchronous copy to a shared backing store, first preferring worker-worker transfer but falling back to the shared backing store if the machine is no longer available.

It would be good to benchmark various approaches.

@jcharum jcharum added the enhancement New feature or request label Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant