Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General speedup #78

Closed
rwest opened this issue Jun 11, 2012 · 15 comments
Closed

General speedup #78

rwest opened this issue Jun 11, 2012 · 15 comments
Labels
abandoned abandoned issue/PR as determined by actions bot stale stale issue/PR as determined by actions bot Status: Stale This issue is out-of-date and may no longer be applicable Topic: Performance

Comments

@rwest
Copy link
Member

rwest commented Jun 11, 2012

This "issue" is to collect thoughts and efforts to speed up RMG-Py.

@rwest
Copy link
Member Author

rwest commented Jun 11, 2012

I just ran the examples/rmg/methlyformate case under the profiler. It ran for ~17 hours on my iMac before crashing (with issue #77).
Profiling data is at https://gist.github.com/2910440 but in summary,
Profiling graph
Running time: 17 hours (61,955 sec)
68% (41,975 sec) model updateUnimolecularReactionNetworks (pressure dependent stuff)
23% (14,360 sec) solver.base.ReactionSystem.simulate()

Looking into the first of these:

Function                                                           called...
                                                                       ncalls  tottime  cumtime
model.py:1250(updateUnimolecularReactionNetworks)                  ->      11    0.000    0.002  __init__.py:1584(warning)
                                                                          225    0.001    0.056  __init__.py:1594(info)
                                                                           44    0.000    0.000  __init__.py:1602(debug)
                                                                           65    0.004    0.004  pdep.py:231(merge)
                                                                      3289804   40.901 1361.483  pdep.py:352(update)
                                                                    214354576   41.754   41.754  {isinstance}
                                                                    43678774188 2486.316 2486.316  {len}
                                                                          280    0.002    0.002  {method 'format' of 'str' objects}
                                                                         1172    0.011    0.011  {method 'getEquilibriumConstant' of 'rmgpy.reaction.Reaction' objects}
                                                                         2344    0.165    0.165  {method 'getRateCoefficient' of 'rmgpy.reaction.Reaction' objects}
                                                                          282    0.000    0.000  {method 'insert' of 'list' objects}
                                                                          933    1.681    1.681  {method 'remove' of 'list' objects}
                                                                          160    0.000    0.000  {sum}

the actual updating is done in pdep.py:352(update) which only takes 2% total time (1,361 seconds). The rest of it is presumably the loops to remove reverse reactions from the core (for example, len is called 43,678,774,188 times!). Seems like an area to optimize.
I wonder if there's an entirely different way to do this, or else if we can cythonize it.

@jwallen
Copy link
Contributor

jwallen commented Jun 12, 2012

I've tried to make updateUnimolecularReactionNetworks() go faster by reducing both the merging of partial pdep networks and the removal of reverse pdep reactions to O(N) instead of O(N^2). It's hard to tell which of these was causing it to be slow - maybe both? - so I tried to address each of them. Either way I've removed many of the calls to len() so hopefully it won't be called 43 billion times any more. I've tested the changes on the 1,3-hexadiene model, and the generated model doesn't appear to have changed.

See my new pdep branch if you want to try it.

@rwest
Copy link
Member Author

rwest commented Jun 12, 2012

Excellent. Look forward to trying it.

@rwest
Copy link
Member Author

rwest commented Jun 14, 2012

I ran it again on your pdep branch.
They both crashed in exactly the same way (issue #77 on network 41,809), suggesting the models are the same, but whereas the first took
47,278,589,233 function calls (47,165,643,959 primitive calls) in 61,955.711 seconds   (17h) 
the new one took
3,411,174,244 function calls (3,298,217,272 primitive calls) in 20,996.773 seconds (6h).
That's a 3-fold speed up overall! :-)

I suggest we merge that branch to master and look for some more fruit to pick!

New profile data is here
New profile

rwest added a commit that referenced this issue Jun 14, 2012
This gives a three-fold speed up in the methylformate test case!
See issue #78 for details.
@jwallen
Copy link
Contributor

jwallen commented Jun 14, 2012

Excellent. Looks like "pdep is slow" may no longer be true. Not bad for ~2 hours of work.

The most obvious thing in the current profile is the solver. I'm starting to consider making this a pure Fortran code (very much like dasslAUTO in RMG-Java) and wrapping it with f2py or similar. The thinking is that any Python overhead becomes magnified a lot in the ODE solve because of all the function calls required. An ancillary benefit is that we could build DASSL right into the solver and thereby eliminate the separate PyDAS requirement. One downside is that this would be a step away from using something like Cantera, and could make it harder to add multiple reactor types in the future if that is ever something we want. I don't think the current approach would scale well to multiple reactor types anyway.

There's probably more low-hanging fruit in the model enlargement code, but since all the boxes are currently blue it would be hard to see with this example until we make the solver faster. Parallelization is probably not too far away, but since we could in principle do that to RMG-Java as well, it probably makes sense to do a bit more single-process optimization first.

@rwest
Copy link
Member Author

rwest commented Jun 14, 2012

Perhaps now is a good time to reconside using Cantera, as we did in early versions of RMG-Py? I think Ray has done lots of work on it recently - perhaps ask him what he thinks.

Parallelizing the ODE solves should be relatively simple, as each reaction system is independent. Getting a generally robust kinetic model requires many reaction systems, so this would be a decent help.

@jwallen
Copy link
Contributor

jwallen commented Aug 9, 2012

I recently ran a Python vs. Java speed comparison for the graph isomorphism algorithm. To do this I ported the current isomorphism algorithm from Python/Cython to Java. The speed test involves 3000 randomly-generated graphs of the same size, with no semantic information. The timing results (fastest of 4 runs in each case):

12-vertex graphs:
    Java:              9000000 graph isomorphisms in 6.3 s
    Cython, .pyx file: 9000000 graph isomorphisms in 7.1 s
    Cython, .py file:  9000000 graph isomorphisms in 10.6 s
    Python:            9000000 graph isomorphisms in 402 s
24-vertex graphs:
    Java:              9000000 graph isomorphisms in 7.9 s
    Cython, .pyx file: 9000000 graph isomorphisms in 12.1 s
    Cython, .py file:  9000000 graph isomorphisms in 16.6 s
    Python:            9000000 graph isomorphisms in 695 s

You can immediately see why I was originally excited about Cython, as the speed gain it affords is tremendous (~50x!) Interestingly, there appears to be a noticeable time penalty of ~25% in maintaining a pure Python implementation.

Unfortunately, the Java implementation is consistently ~2x faster than the Cython implementation. Note that I designed the new graph isomorphism algorithm to minimize the most expensive calls into the Python/C API, which means that it is basically tailored to making the Cython version as fast as possible. However, when you look at the generated C code from Cython, there are still a lot of places with significant overhead. For example, the code for iterating over the vertices

for vertex in vertices:
   pass

expands to (in rough pseudocode)

if vertices is None
    raise Exception
end if
get a reference to vertices and increment its reference counter
set i = 0
do while true
    set s to size of list
    if i >= s
        break
    end if
    get item at index i in list and increment its reference counter
    if item is None or item is not a Vertex
        raise Exception
    end if
    set vertex = item
    # Here's where you would use vertex for something...
    decrement the reference counter of vertex
end do
decrement the reference counter of vertices

In general we would want all of these extra checks, but in specific cases such as this we know that they are not necessary. For instance, the above could be simplified greatly by assuming that the list does not change within the for loop, that all items are valid vertices, and that the list items do not get deleted within the for loop. Cython cannot do this for us, unfortunately, so we would probably have to hand-code (or maybe use typed memoryviews?) to make up the difference between Cython and Java. I don't think this would need to be done much, just in the absolutely critical bits of code.

A summary of some thoughts:

  • Cython code is almost as easy to write as Python, and almost as fast as Java and C/C++. It's way easier than custom-writing C/C++.
  • We should at least consider using .pyx files and eliminating pure Python support. Even in everyday use I think that the Cython code is so much faster that running in pure Python mode is rare, especially now that a lot of that code is relatively stable. Furthermore, we need .pyx files for efficient array indexing (e.g. in the pressure dependence code) anyway.
  • To mitigate compilation time, we should spread the Cython code into more smaller modules (as is commonly done in compiled languages).
  • Most importantly, in the RMG-Java vs. RMG-Py battle royale - two implementations enter, one implementation leaves! - I don't think speed should be the deciding issue, because I think we can still make up the difference between Python/Cython and Java.

P.S. For the record, I am running Python 2.7.3, Cython 0.16, and Java 1.7 on a 64-bit machine running Arch Linux.

@rwest
Copy link
Member Author

rwest commented Apr 22, 2013

Latest profiling data for a ~8 hour run on methylformate in the neighbourhood of aeb076a.
18% of time spent checking duplicate reactions when writing chemkin output!
(29% running simulations)
Full data here
RMG profile dot

It reached:

The model core has 152 species and 7407 reactions
The model edge has 34644 species and 73022 reactions

Updating RMG execution statistics...:

Execution time (DD:HH:MM:SS): 00:07:38:46
Memory used: 3468.67 MB
PBS mem: 3848352kb
PBS vmem: 4547860kb
PBS cput: 07:27:24
PBS walltime: 07:37:56

@rwest
Copy link
Member Author

rwest commented Apr 22, 2013

Hopefully 44f742b speeds up the chemkin duplicate checking.

I can't see how running the simulations can be trivially parallelized without multiplying the memory requirements by the number of processors (currently each simulation needs the entire core and edge), and as we're currently constrained almost as much by memory as by CPU, that's a problem. I have some ideas, but they'll take a little more effort.

The Chemkin and HTML outputting is still a little slow, but we're already talking single-digit percentage improvements.

rwest added a commit that referenced this issue May 2, 2013
Previously, every time we wrote a chemkin file (twice per iteration) we checked
every reaction in it for to see if it was a duplicate, in a method that scaled
with N^2 and ended up taking almost a third of a 7-hour RMG-Py run (See issue #78).

This commit changes it so that reactions being added to the core in an enlarge step
are checked for duplicate status at that time, and the entire set is checked after
adding a seed mechanism or reaction library. This way we should be sure to catch
all the duplicates, but only need to check new reactions once at each iteration, 
instead of *all* reactions twice per iteration.

The saveChemkinFile method can now be called without checking for duplicates.
Hopefully this speeds things up.

If called without the 'checkForDuplicates=False' option, then saveChemkinFile 
still does the check, so that other utilities (eg. for merging reaction sets) 
should not be messed up (nor sped up) by this.
@rwest
Copy link
Member Author

rwest commented May 3, 2013

Now able to fill up my iMac's memory within 2.5 hours.
A methyl-butanoate job with PDep and 4 reaction systems.
rmg profile dot

Now only 5% saving things (Try pip install MarkupSafe to speed up the Jinja2 rendering!).

26% reaction simulations
22% PDep networks
18% thermo data estimation (without QM)
15% generating reactions.

@rwest
Copy link
Member Author

rwest commented Aug 14, 2013

NB. profiling tutorial for cython http://docs.cython.org/src/tutorial/profiling_tutorial.html

@connie
Copy link
Member

connie commented Aug 16, 2013

Job for Cineole finally completed. This job took 3 days (after going from a restart, so we're cheating a bit with the saved QM thermo files.)

Final model enlargement:
The model core has 218 species and 1522 reactions
The model edge has 12135 species and 807649 reactions

file

It's a bit surprising that we're spending quite a bit of time on saving the pickle file (20%). This is troubling- can we make this faster?
We are spending 36% on copying molecule objects as well.

@nyee
Copy link
Contributor

nyee commented Aug 17, 2013

Do we know how fast Java would be for the Cineole job? Based on my experience with JP10, I would expect less than a day, maybe even less than 6 hours. It might be good to run it and see.

@rwest
Copy link
Member Author

rwest commented Aug 18, 2013

If you can't make things faster, you can try doing them less often. Saving restart is a good example for this type of "optimization". It may be possible to cythonize some of the pickle helper functions, and/or make it save fewer derived properties and recalculate more when unpicking. (Which we do less often).

I had a look at Molecule.copy(deep=True) the other day. It seemed pretty well cythonized, so not a lot of low hanging fruit there. May be able to squeeze a little out. Or try to think of ways to avoid needing it.

@pierrelb pierrelb added the Status: Stale This issue is out-of-date and may no longer be applicable label Aug 28, 2015
@github-actions
Copy link

This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.

@github-actions github-actions bot added the stale stale issue/PR as determined by actions bot label Jun 22, 2023
@JacksonBurns JacksonBurns added the abandoned abandoned issue/PR as determined by actions bot label Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abandoned abandoned issue/PR as determined by actions bot stale stale issue/PR as determined by actions bot Status: Stale This issue is out-of-date and may no longer be applicable Topic: Performance
Projects
None yet
Development

No branches or pull requests

6 participants