parallelize graph-building #22

rkurchin · 2020-10-19T16:25:58Z

It's a one-time cost for a given dataset, but it's quite slow (~1k/hr) for larger sets, and should be easy to batch out in some reasonably automated way...

thazhemadam · 2021-03-14T19:52:50Z

@rkurchin Do you have something specific in mind, or would something like getting this loop to run on different threads work well enough?

rkurchin · 2021-03-17T14:47:00Z

yeah something like that would be great! I didn't have anything more specific than that in mind, perhaps just adding an argument to specify number of threads (and/or perhaps a flag of whether to parallelize at all and it'll figure out how many there are) to the batch build function...

thazhemadam · 2021-03-19T11:45:31Z

hmm. I'll try and see how we implement at least that much (for now).
If successful, I'll also try actually benchmarking the "before" and "after" (using something like BenchmarkTools.jl), and have those results also posted.

thazhemadam · 2021-03-21T12:33:32Z

Running that loop on different threads directly (using Threads.@threads or even Threads.@spawn ) causes a segmentation fault. This seems to be caused because of this line.
(Some more about PyCall and thread safety can be found here.

This could mean that we might have to try implementing some kind of synchronous locking mechanism that ensures only one thread can call pyimport_conda() and aseio.read() at a time, and see how that works out. But seeing how many pyimport_conda() calls get made for each build_graph() call, I'm not sure how much of a performance gain we'll get by parallelizing this. If anything, I feel this may just end up making the code unnecessarily complicated and unreadable.

rkurchin · 2021-03-22T14:54:40Z

Have you tried using Distributed and just a pmap? I've had some success with that when parallelizing this stuff "manually" so maybe something about the way that handles stuff internally plays nicer with PyCall? I'm far from an expert on the internals there, but worth a shot.

thazhemadam · 2021-03-25T08:29:48Z

Yes I tried that, and it still results in a segmentation fault. FWIW, I tried two variants of the function called in pmap - one called only pyimport_conda(), and the other called only aseio.read() (with aseio declared as a global variable).

rkurchin added the enhancement New feature or request label Oct 19, 2020

rkurchin self-assigned this Oct 19, 2020

thazhemadam self-assigned this Mar 19, 2021

rkurchin added this to To do, eventually in ChemistryFeaturization Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize graph-building #22

parallelize graph-building #22

rkurchin commented Oct 19, 2020

thazhemadam commented Mar 14, 2021 •

edited

rkurchin commented Mar 17, 2021

thazhemadam commented Mar 19, 2021 •

edited

thazhemadam commented Mar 21, 2021 •

edited

rkurchin commented Mar 22, 2021

thazhemadam commented Mar 25, 2021

parallelize graph-building #22

parallelize graph-building #22

Comments

rkurchin commented Oct 19, 2020

thazhemadam commented Mar 14, 2021 • edited

rkurchin commented Mar 17, 2021

thazhemadam commented Mar 19, 2021 • edited

thazhemadam commented Mar 21, 2021 • edited

rkurchin commented Mar 22, 2021

thazhemadam commented Mar 25, 2021

thazhemadam commented Mar 14, 2021 •

edited

thazhemadam commented Mar 19, 2021 •

edited

thazhemadam commented Mar 21, 2021 •

edited