General speedup #78

rwest · 2012-06-11T15:00:57Z

This "issue" is to collect thoughts and efforts to speed up RMG-Py.

rwest · 2012-06-11T15:25:07Z

I just ran the examples/rmg/methlyformate case under the profiler. It ran for ~17 hours on my iMac before crashing (with issue #77).
Profiling data is at https://gist.github.com/2910440 but in summary,

Running time: 17 hours (61,955 sec)
68% (41,975 sec) model updateUnimolecularReactionNetworks (pressure dependent stuff)
23% (14,360 sec) solver.base.ReactionSystem.simulate()

Looking into the first of these:

Function                                                           called...
                                                                       ncalls  tottime  cumtime
model.py:1250(updateUnimolecularReactionNetworks)                  ->      11    0.000    0.002  __init__.py:1584(warning)
                                                                          225    0.001    0.056  __init__.py:1594(info)
                                                                           44    0.000    0.000  __init__.py:1602(debug)
                                                                           65    0.004    0.004  pdep.py:231(merge)
                                                                      3289804   40.901 1361.483  pdep.py:352(update)
                                                                    214354576   41.754   41.754  {isinstance}
                                                                    43678774188 2486.316 2486.316  {len}
                                                                          280    0.002    0.002  {method 'format' of 'str' objects}
                                                                         1172    0.011    0.011  {method 'getEquilibriumConstant' of 'rmgpy.reaction.Reaction' objects}
                                                                         2344    0.165    0.165  {method 'getRateCoefficient' of 'rmgpy.reaction.Reaction' objects}
                                                                          282    0.000    0.000  {method 'insert' of 'list' objects}
                                                                          933    1.681    1.681  {method 'remove' of 'list' objects}
                                                                          160    0.000    0.000  {sum}

the actual updating is done in pdep.py:352(update) which only takes 2% total time (1,361 seconds). The rest of it is presumably the loops to remove reverse reactions from the core (for example, len is called 43,678,774,188 times!). Seems like an area to optimize.
I wonder if there's an entirely different way to do this, or else if we can cythonize it.

jwallen · 2012-06-12T00:02:18Z

I've tried to make updateUnimolecularReactionNetworks() go faster by reducing both the merging of partial pdep networks and the removal of reverse pdep reactions to O(N) instead of O(N^2). It's hard to tell which of these was causing it to be slow - maybe both? - so I tried to address each of them. Either way I've removed many of the calls to len() so hopefully it won't be called 43 billion times any more. I've tested the changes on the 1,3-hexadiene model, and the generated model doesn't appear to have changed.

See my new pdep branch if you want to try it.

rwest · 2012-06-12T00:08:36Z

Excellent. Look forward to trying it.

rwest · 2012-06-14T04:01:27Z

I ran it again on your pdep branch.
They both crashed in exactly the same way (issue #77 on network 41,809), suggesting the models are the same, but whereas the first took
47,278,589,233 function calls (47,165,643,959 primitive calls) in 61,955.711 seconds (17h)
the new one took
3,411,174,244 function calls (3,298,217,272 primitive calls) in 20,996.773 seconds (6h).
That's a 3-fold speed up overall! :-)

I suggest we merge that branch to master and look for some more fruit to pick!

New profile data is here

This gives a three-fold speed up in the methylformate test case! See issue #78 for details.

jwallen · 2012-06-14T13:23:45Z

Excellent. Looks like "pdep is slow" may no longer be true. Not bad for ~2 hours of work.

The most obvious thing in the current profile is the solver. I'm starting to consider making this a pure Fortran code (very much like dasslAUTO in RMG-Java) and wrapping it with f2py or similar. The thinking is that any Python overhead becomes magnified a lot in the ODE solve because of all the function calls required. An ancillary benefit is that we could build DASSL right into the solver and thereby eliminate the separate PyDAS requirement. One downside is that this would be a step away from using something like Cantera, and could make it harder to add multiple reactor types in the future if that is ever something we want. I don't think the current approach would scale well to multiple reactor types anyway.

There's probably more low-hanging fruit in the model enlargement code, but since all the boxes are currently blue it would be hard to see with this example until we make the solver faster. Parallelization is probably not too far away, but since we could in principle do that to RMG-Java as well, it probably makes sense to do a bit more single-process optimization first.

rwest · 2012-06-14T13:29:14Z

Perhaps now is a good time to reconside using Cantera, as we did in early versions of RMG-Py? I think Ray has done lots of work on it recently - perhaps ask him what he thinks.

Parallelizing the ODE solves should be relatively simple, as each reaction system is independent. Getting a generally robust kinetic model requires many reaction systems, so this would be a decent help.

jwallen · 2012-08-09T02:15:31Z

I recently ran a Python vs. Java speed comparison for the graph isomorphism algorithm. To do this I ported the current isomorphism algorithm from Python/Cython to Java. The speed test involves 3000 randomly-generated graphs of the same size, with no semantic information. The timing results (fastest of 4 runs in each case):

12-vertex graphs:
    Java:              9000000 graph isomorphisms in 6.3 s
    Cython, .pyx file: 9000000 graph isomorphisms in 7.1 s
    Cython, .py file:  9000000 graph isomorphisms in 10.6 s
    Python:            9000000 graph isomorphisms in 402 s
24-vertex graphs:
    Java:              9000000 graph isomorphisms in 7.9 s
    Cython, .pyx file: 9000000 graph isomorphisms in 12.1 s
    Cython, .py file:  9000000 graph isomorphisms in 16.6 s
    Python:            9000000 graph isomorphisms in 695 s

You can immediately see why I was originally excited about Cython, as the speed gain it affords is tremendous (~50x!) Interestingly, there appears to be a noticeable time penalty of ~25% in maintaining a pure Python implementation.

Unfortunately, the Java implementation is consistently ~2x faster than the Cython implementation. Note that I designed the new graph isomorphism algorithm to minimize the most expensive calls into the Python/C API, which means that it is basically tailored to making the Cython version as fast as possible. However, when you look at the generated C code from Cython, there are still a lot of places with significant overhead. For example, the code for iterating over the vertices

for vertex in vertices:
   pass

expands to (in rough pseudocode)

if vertices is None
    raise Exception
end if
get a reference to vertices and increment its reference counter
set i = 0
do while true
    set s to size of list
    if i >= s
        break
    end if
    get item at index i in list and increment its reference counter
    if item is None or item is not a Vertex
        raise Exception
    end if
    set vertex = item
    # Here's where you would use vertex for something...
    decrement the reference counter of vertex
end do
decrement the reference counter of vertices

In general we would want all of these extra checks, but in specific cases such as this we know that they are not necessary. For instance, the above could be simplified greatly by assuming that the list does not change within the for loop, that all items are valid vertices, and that the list items do not get deleted within the for loop. Cython cannot do this for us, unfortunately, so we would probably have to hand-code (or maybe use typed memoryviews?) to make up the difference between Cython and Java. I don't think this would need to be done much, just in the absolutely critical bits of code.

A summary of some thoughts:

Cython code is almost as easy to write as Python, and almost as fast as Java and C/C++. It's way easier than custom-writing C/C++.
We should at least consider using .pyx files and eliminating pure Python support. Even in everyday use I think that the Cython code is so much faster that running in pure Python mode is rare, especially now that a lot of that code is relatively stable. Furthermore, we need .pyx files for efficient array indexing (e.g. in the pressure dependence code) anyway.
To mitigate compilation time, we should spread the Cython code into more smaller modules (as is commonly done in compiled languages).
Most importantly, in the RMG-Java vs. RMG-Py battle royale - two implementations enter, one implementation leaves! - I don't think speed should be the deciding issue, because I think we can still make up the difference between Python/Cython and Java.

P.S. For the record, I am running Python 2.7.3, Cython 0.16, and Java 1.7 on a 64-bit machine running Arch Linux.

rwest · 2013-04-22T05:30:39Z

Latest profiling data for a ~8 hour run on methylformate in the neighbourhood of aeb076a.
18% of time spent checking duplicate reactions when writing chemkin output!
(29% running simulations)
Full data here

It reached:

The model core has 152 species and 7407 reactions
The model edge has 34644 species and 73022 reactions

Updating RMG execution statistics...:

Execution time (DD:HH:MM:SS): 00:07:38:46
Memory used: 3468.67 MB
PBS mem: 3848352kb
PBS vmem: 4547860kb
PBS cput: 07:27:24
PBS walltime: 07:37:56

rwest · 2013-04-22T20:07:39Z

Hopefully 44f742b speeds up the chemkin duplicate checking.

I can't see how running the simulations can be trivially parallelized without multiplying the memory requirements by the number of processors (currently each simulation needs the entire core and edge), and as we're currently constrained almost as much by memory as by CPU, that's a problem. I have some ideas, but they'll take a little more effort.

The Chemkin and HTML outputting is still a little slow, but we're already talking single-digit percentage improvements.

Previously, every time we wrote a chemkin file (twice per iteration) we checked every reaction in it for to see if it was a duplicate, in a method that scaled with N^2 and ended up taking almost a third of a 7-hour RMG-Py run (See issue #78). This commit changes it so that reactions being added to the core in an enlarge step are checked for duplicate status at that time, and the entire set is checked after adding a seed mechanism or reaction library. This way we should be sure to catch all the duplicates, but only need to check new reactions once at each iteration, instead of *all* reactions twice per iteration. The saveChemkinFile method can now be called without checking for duplicates. Hopefully this speeds things up. If called without the 'checkForDuplicates=False' option, then saveChemkinFile still does the check, so that other utilities (eg. for merging reaction sets) should not be messed up (nor sped up) by this.

rwest · 2013-05-03T18:54:23Z

Now able to fill up my iMac's memory within 2.5 hours.
A methyl-butanoate job with PDep and 4 reaction systems.

Now only 5% saving things (Try pip install MarkupSafe to speed up the Jinja2 rendering!).

26% reaction simulations
22% PDep networks
18% thermo data estimation (without QM)
15% generating reactions.

rwest · 2013-08-14T19:46:15Z

NB. profiling tutorial for cython http://docs.cython.org/src/tutorial/profiling_tutorial.html

connie · 2013-08-16T21:59:59Z

Job for Cineole finally completed. This job took 3 days (after going from a restart, so we're cheating a bit with the saved QM thermo files.)

Final model enlargement:
The model core has 218 species and 1522 reactions
The model edge has 12135 species and 807649 reactions

It's a bit surprising that we're spending quite a bit of time on saving the pickle file (20%). This is troubling- can we make this faster?
We are spending 36% on copying molecule objects as well.

nyee · 2013-08-17T19:49:45Z

Do we know how fast Java would be for the Cineole job? Based on my experience with JP10, I would expect less than a day, maybe even less than 6 hours. It might be good to run it and see.

rwest · 2013-08-18T21:42:38Z

If you can't make things faster, you can try doing them less often. Saving restart is a good example for this type of "optimization". It may be possible to cythonize some of the pickle helper functions, and/or make it save fewer derived properties and recalculate more when unpicking. (Which we do less often).

I had a look at Molecule.copy(deep=True) the other day. It seemed pretty well cythonized, so not a lot of low hanging fruit there. May be able to squeeze a little out. Or try to think of ways to avoid needing it.

github-actions · 2023-06-22T08:10:33Z

This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.

rwest added a commit that referenced this issue Jun 14, 2012

Merge remote-tracking branch 'josh/pdep'

add7456

This gives a three-fold speed up in the methylformate test case! See issue #78 for details.

pierrelb added the Status: Stale This issue is out-of-date and may no longer be applicable label Aug 28, 2015

connie mentioned this issue Oct 19, 2015

General speedup revisited #471

Closed

github-actions bot added the stale stale issue/PR as determined by actions bot label Jun 22, 2023

JacksonBurns closed this as completed Jun 22, 2023

JacksonBurns added the abandoned abandoned issue/PR as determined by actions bot label Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General speedup #78

General speedup #78

rwest commented Jun 11, 2012

rwest commented Jun 11, 2012

jwallen commented Jun 12, 2012

rwest commented Jun 12, 2012

rwest commented Jun 14, 2012

jwallen commented Jun 14, 2012

rwest commented Jun 14, 2012

jwallen commented Aug 9, 2012

rwest commented Apr 22, 2013

rwest commented Apr 22, 2013

rwest commented May 3, 2013

rwest commented Aug 14, 2013

connie commented Aug 16, 2013

nyee commented Aug 17, 2013

rwest commented Aug 18, 2013

github-actions bot commented Jun 22, 2023

General speedup #78

General speedup #78

Comments

rwest commented Jun 11, 2012

rwest commented Jun 11, 2012

jwallen commented Jun 12, 2012

rwest commented Jun 12, 2012

rwest commented Jun 14, 2012

jwallen commented Jun 14, 2012

rwest commented Jun 14, 2012

jwallen commented Aug 9, 2012

rwest commented Apr 22, 2013

rwest commented Apr 22, 2013

rwest commented May 3, 2013

rwest commented Aug 14, 2013

connie commented Aug 16, 2013

nyee commented Aug 17, 2013

rwest commented Aug 18, 2013

github-actions bot commented Jun 22, 2023