Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yaml writer slows down execution #2499

Open
ChrisBNEU opened this issue Jul 10, 2023 · 10 comments · May be fixed by #2508
Open

yaml writer slows down execution #2499

ChrisBNEU opened this issue Jul 10, 2023 · 10 comments · May be fixed by #2508
Assignees
Labels
bug bug which will never be closed by the actions bot

Comments

@ChrisBNEU
Copy link
Contributor

Bug Description

It appears that the yaml writer uses a lot of overhead during the execution of an rmg run. I ran a modified ethane pyrolysis input file with cProfile. The only thing I changed in rmg was I did not write a yaml file for rms every execution. I did that by commenting out this line in main.py:

self.attach(RMSWriter(self.output_directory))

the difference in execution times were:

  • with RMSWriter listener: 15 min 15 sec
  • without RMSWriter listener: 7min 35 sec

I attached the profiles and the rmg logs below. It is possible this becomes insignificant at longer execution times, I have not tried it. The Profiler graph shows it executing every time the model is enlarged, so it is also possible it gets worse with larger mechanisms.

How To Reproduce

Comment out the line specified above, then run the profiler on the input file I attached below (python-jl rmg.py -p input.py)

Installation Information

Describe your installation method and system information.

  • macOS 10.15.7
  • Installed from source via conda
  • RMG version information:
    • RMG-Py: 3.0.0-1437-gef83a1c0e
    • RMG-database: 3.1.0-616-g75fabcb6c

Additional Context

input file used:
input.py.zip

profiles:
profile_with_listener.pdf
profile_without_listener.pdf

rmg logs:
with_listener.log
without_listener.log

@JacksonBurns
Copy link
Contributor

Thanks for the thorough report @ChrisBNEU. I think a short term solution would be to add a way to optionally disable this writer from the input, but long term would be to use multiprocessing to spawn a task that runs the yaml writing parallel to the simulation itself. We could also look at a different output format that writes faster, like JSON.

@rwest
Copy link
Member

rwest commented Jul 10, 2023

I found this an interesting read. (One of those stack overflow posts where the best answer is in fact the one with the fewest upvotes). "How is it that json serialization is so much faster than yaml serialization in Python?"
I'm sure we could make it much faster with a little effort. Especially as we're outputting the same few things every time and could hard-code the templates. Worth looking into as we switch to Cantera yaml output (#2078)

@mjohnson541
Copy link
Contributor

I think at least a significant part of this is due to the fact that we're generating smiles twice for every species when generating the yaml: https://github.com/ReactionMechanismGenerator/RMG-Py/blob/ef83a1c0e33854aaf749484755cc02a943c1ab01/rmgpy/yml.py#L101C5-L101C5 and https://github.com/ReactionMechanismGenerator/RMG-Py/blob/ef83a1c0e33854aaf749484755cc02a943c1ab01/rmgpy/yml.py#L115C58-L115C58.

I think if we use Molecule.smiles instead this will speed up yaml generation quite significantly.

@ChrisBNEU ChrisBNEU linked a pull request Jul 26, 2023 that will close this issue
@ChrisBNEU ChrisBNEU self-assigned this Jul 26, 2023
@ChrisBNEU ChrisBNEU linked a pull request Jul 26, 2023 that will close this issue
@ChrisBNEU
Copy link
Contributor Author

I linked a pr for this, but it is the temporary fix we mentioned. I just made an input file arg. Should it be linked to this issue or do we want to leave it open/move it to a discussion after it's merged?

JacksonBurns added a commit to ChrisBNEU/RMG-Py that referenced this issue Jul 26, 2023
This should speed up execution time by accessing the property each time instead of regenerating it, see : ReactionMechanismGenerator#2499 (comment)
@JacksonBurns
Copy link
Contributor

I linked a pr for this, but it is the temporary fix we mentioned. I just made an input file arg. Should it be linked to this issue or do we want to leave it open/move it to a discussion after it's merged?

I think with a191fbd we can allow #2508 to close this (assuming that the patch works)

ChrisBNEU pushed a commit to ChrisBNEU/RMG-Py that referenced this issue Jul 27, 2023
This should speed up execution time by accessing the property each time instead of regenerating it, see : ReactionMechanismGenerator#2499 (comment)
@ChrisBNEU
Copy link
Contributor Author

I profiled the same run, with the updated code in yml.py that Jackson added. I think the use of molecule.smiles change helps, the % of the overhead time used by write_yml went from 46% to 30%. Unfortunately, it still appears that the yaml write takes a lot longer than writing 1 file at the end. I included the profiles and logs if people are curious. I think with #2508 we are covered, if people want it to run quicker they can just disable the yaml write for now, and people can refer back to this issue if they want to make the yaml write faster.

Profile_with_yaml_each_iter.pdf
profile_yaml_at_end.pdf

log_with_yaml_each_iter.log
log_yaml_at_end.log

@mjohnson541
Copy link
Contributor

Okay, I took a little bit closer look at the logs. The edge in this ethane pyrolysis system is incredibly small (since ethane is a small molecule) with edge core ratios of only ~3 for species and ~1.5 for reactions. I think this significantly reduces the reaction generation costs which is putting them on the same order as the simulation and the yaml file generation as this stage in mechanism generation.

I'm still surprised to see yaml still costs us so much to generate, but my suspicion is that this won't matter in performance critical runs as simulation costs and reaction generation costs will be much higher in those cases while the yaml generation will scale linearly with core size. I wonder if there's a clever way instead of generating an entirely new yaml just copy the last yaml file and "append" only the new species/reactions.

@JacksonBurns
Copy link
Contributor

We can cache the results of obj_to_dict which should speed things up - going to make another quick commit.

JacksonBurns added a commit to ChrisBNEU/RMG-Py that referenced this issue Jul 27, 2023
after calling `obj_to_dict` with a species once, the result is saved in a lookup dictionary so that subsequent calls are 'instant'. This costs memory but for an expensive function like this that inherently ends up calling on the same input a lot, it should be worth it.

related comment: ReactionMechanismGenerator#2499 (comment)
ChrisBNEU pushed a commit to ChrisBNEU/RMG-Py that referenced this issue Aug 16, 2023
This should speed up execution time by accessing the property each time instead of regenerating it, see : ReactionMechanismGenerator#2499 (comment)
ChrisBNEU pushed a commit to ChrisBNEU/RMG-Py that referenced this issue Aug 16, 2023
This should speed up execution time by accessing the property each time instead of regenerating it, see : ReactionMechanismGenerator#2499 (comment)
@github-actions
Copy link

This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.

@github-actions github-actions bot added the stale stale issue/PR as determined by actions bot label Oct 26, 2023
@JacksonBurns JacksonBurns added bug bug which will never be closed by the actions bot and removed stale stale issue/PR as determined by actions bot labels Oct 26, 2023
@JacksonBurns
Copy link
Contributor

Given the significance of the slowdown and the open PR which (nearly) fixes it, I'm going to label this as a bug.

ChrisBNEU pushed a commit to ChrisBNEU/RMG-Py that referenced this issue Nov 6, 2023
This should speed up execution time by accessing the property each time instead of regenerating it, see : ReactionMechanismGenerator#2499 (comment)
ChrisBNEU pushed a commit to ChrisBNEU/RMG-Py that referenced this issue Mar 21, 2024
This should speed up execution time by accessing the property each time instead of regenerating it, see : ReactionMechanismGenerator#2499 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bug which will never be closed by the actions bot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants