Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError after exceeding OpenMP/OpenBLAS thread limit #99

Open
lingfeiwang opened this issue Sep 21, 2021 · 3 comments
Open

MemoryError after exceeding OpenMP/OpenBLAS thread limit #99

lingfeiwang opened this issue Sep 21, 2021 · 3 comments

Comments

@lingfeiwang
Copy link

lingfeiwang commented Sep 21, 2021

Hello. I use threadpoolctl 2.2.0 which runs very well most of the time. However, after exceeding the OpenMP or OpenBLAS thread limit, threadpoolctl seems to have broken down. It does not recover even after the thread-limit-exceeding processes have been killed, or quite some time after that. The full error message of a simple example is shown below. Is there any way to reset threadpoolctl so it continues to function without having to reboot the computer?

Python 3.9.5 (default, Jun  4 2021, 12:28:51) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from threadpoolctl import threadpool_limits
   ...: with threadpool_limits(limits=1):
   ...:     a=1
   ...: 
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-1-2121fc2c928d> in <module>
      1 from threadpoolctl import threadpool_limits
----> 2 with threadpool_limits(limits=1):
      3     a=1
      4 
~/.local/lib/python3.9/site-packages/threadpoolctl.py in __init__(self, limits, user_api)
    169             self._check_params(limits, user_api)
    170 
--> 171         self._original_info = self._set_threadpool_limits()
    172 
    173     def __enter__(self):

~/.local/lib/python3.9/site-packages/threadpoolctl.py in _set_threadpool_limits(self)
    266             return None
    267 
--> 268         modules = _ThreadpoolInfo(prefixes=self._prefixes,
    269                                   user_api=self._user_api)
    270         for module in modules:

~/.local/lib/python3.9/site-packages/threadpoolctl.py in __init__(self, user_api, prefixes, modules)
    338 
    339             self.modules = []
--> 340             self._load_modules()
    341             self._warn_if_incompatible_openmp()
    342         else:

~/.local/lib/python3.9/site-packages/threadpoolctl.py in _load_modules(self)
    373             self._find_modules_with_enum_process_module_ex()
    374         else:
--> 375             self._find_modules_with_dl_iterate_phdr()
    376 
    377     def _find_modules_with_dl_iterate_phdr(self):

~/.local/lib/python3.9/site-packages/threadpoolctl.py in _find_modules_with_dl_iterate_phdr(self)
    404             ctypes.c_int,  # Return type
    405             ctypes.POINTER(_dl_phdr_info), ctypes.c_size_t, ctypes.c_char_p)
--> 406         c_match_module_callback = c_func_signature(match_module_callback)
    407 
    408         data = ctypes.c_char_p(b"")

MemoryError: 
@jeremiedbb
Copy link
Collaborator

Hi @lingfeiwang, I'm not sure that I understand how you triggered that. Could you detail a bit more the steps that lead to this broken state ?

@lingfeiwang
Copy link
Author

Actually I completely did not expect it to happen and therefore did not record the process to reproduce the error, or the error log itself from OpenMP or OpenBLAS. Briefly, I ran some computation in too many parallel processes where each used OpenMP or OpenBLAS possibly through numpy/scipy, so together it exceeded a certain limit, maybe set by the kernel, and reported the related error lines. I then killed such processes and everything seemed to have recovered, except threadpoolctl which I later discovered.

I understand this is super uninformative but trying to reproduce it on a shared computing server would be damaging. I don't know how rare this error appears, but I guess computing servers are constantly tortured on the planet. For me, reboot solved the issue, but someone else might follow up on this thread with more details another day.

@ogrisel
Copy link
Contributor

ogrisel commented Oct 8, 2021

Thanks for the feedback. It might indeed be a bug of the linux kernel or the openmp runtime relying on an incorrectly updated stateful attribute of the system. If that ever happens it would be interesting to start a post-mortem pdb session to introspect the values of the match_module_callback signature. I do not understand how a MemoryError can possibly be raised on this line...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants