Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gmxapi.commandline_operation seems to have memory leaks #274

Open
wehs7661 opened this issue Apr 27, 2023 · 3 comments
Open

gmxapi.commandline_operation seems to have memory leaks #274

wehs7661 opened this issue Apr 27, 2023 · 3 comments

Comments

@wehs7661
Copy link

wehs7661 commented Apr 27, 2023

To test the assumption that gmxapi.commandline_operation has memory leaks, I performed the following two tests:

  • Test A: Use os.system to run the GROMACS grompp for 20000 times to generate 20000 tpr files. (See Test_A.py below.)
  • Test B: Use gmxapi.commandline_operation to run GROMACS grompp commands to generate 20000 tp files. (See Test_B.py below.)

Each of the 20000 iterations in each test was timed. Both the executions of Test_A.py and Test_B.py were memory-profiled by the mprof run command (e.g. mprof run python Test_A.py) enabled by memory-profiler, which measured the memory usage every 0.1 seconds. Below I plotted the wall time per tpr generation against the number of grompp commands executed (left) the memory usage as a function of time (right).

compare_c_d

It can be seen from the figure above that memory usage increased as a function of time when gmxapi.commandline_operation was used. On the other hand, in Test A, where os.system was used, the memory usage remained roughly constant, which to my understanding is because that os.system ran the GROMACS grompp command as a separate process and the allocated memory was released once each grompp command was finished. That is, gmxapi.commandline_operation seems to have memory leaks.

I'll paste Test_A.py and Test_B.py in the next comment.

@wehs7661
Copy link
Author

Here is the content of Test_A.py:

import os
import time
import numpy as np

if __name__ == "__main__":
    t_list = []
    for i in range(20000):
        t1 =time.time()
        os.system(f'mpirun -np 1 gmx grompp -f expanded.mdp -c anthracene.gro -p anthracene.top -o sys_EEX_{i}.tpr')
        t2 = time.time()
        t_list.append(t2-t1)

    np.save('t_list.npy', t_list)

And here is the content of Test_B.py:

import os
import time
import numpy as np
import gmxapi as gmx
from ensemble_md.utils import utils

if __name__ == "__main__":
    t_list = []
    for i in range(20000):
        grompp = gmx.commandline_operation(
            "gmx",
            arguments=["grompp"],
            input_files={
                '-f': '../expanded.mdp',
                '-c': '../anthracene.gro',
                '-p': '../anthracene.top',},
            output_files={
                '-o': f'../sys_EE_{i}.tpr',
                '-po': '../mdout.mdp'})

        t1 = time.time()
        grompp.run()
        t2 = time.time()
        t_list.append(t2-t1)

        utils.gmx_output(grompp)

    np.save('t_list.npy', t_list)

Please let me know if my interpretation of the figure is correct and if further information is needed to help troubleshoot. Thanks a lot!

@wehs7661
Copy link
Author

wehs7661 commented Apr 27, 2023

Here are some additional but less relevant interpretations of the figure:

  • The wall time of gmxapi was much shorter than os.system. (See the right panel of the figure.) I assumed that this is because os.system had a much larger overhead than gmxapi.commandline_operation. (While the execution of Test B seems fast here, in my EEXE simulation where the number of iteration was really large, the execution could become very slow.)
  • There are spikes in the wall time for both tests. (See the left panel of the figure.) I assumed that these peaks were caused by cache memory, so I added the following lines to the loop in both Test_A.py and Test_B.py to clear cache memory every 50 grompp commands:
    if (i + 1)%50 == 0:
        os.system('sync; echo 3 > /proc/sys/vm/drop_caches')
    
    As a result (shown below), the spikes in Test A were gone, but the ones in Test B still existed, so the spikes in Test B were not entirely due to the cache memory (while the ones in Test A were). Interestingly, the height of the spikes seemed to increase linearly in the left panel for Test B, so hopefully, resolving the issue mentioned in the first comment could also resolve the spikes in Test B. @eirrgang I'm wondering if you have some insights regarding this.
    compare_c_d_clear

@eirrgang
Copy link
Collaborator

Thanks! This is very interesting.

This is a sort of scaling situation that has certainly not been deeply explored with gmxapi and gmxapi.commandline_operation.

For the wall time, I'll probably have to reproduce and profile pretty broadly. It could be any combination of things ranging from memory (re)allocation to filesystem interactions, as well as unsophisticated order(N) logic in gmxapi.

For the memory, I can say right away that there are some places in the python module that don't deallocate or reuse memory when they could. In particular, gmxapi.commandline_operation relies on several dynamically defined functions and types that don't get cached/reused and may not get deleted until the interpreter shuts down. I'll definitely want to make sure that scalems.executable does not have the same slop, and you've pointed out some things to test.

Overall, I had planned that the workflow management part of both gmxapi and scalems would keep a hash in memory of all tasks and results that the script is aware of, but if adding to the mapping gets problematically slow or memory intensive as the number of workflow items exceeds 10^5 or 10^6, we may need a plan for a lighter weight bookkeeping method. Additionally, both scalems and gmxapi will need to migrate to a hierarchical approach for artifacts and task directories. HPC admins will be livid if we routinely cause directories with tens of thousands of items!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants