Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReproZip hangs on fit_transform() method from sklearn.decomposition and phate #388

Open
milech opened this issue Jan 13, 2023 · 15 comments
Open
Labels
C-tracer (C) Component: The C part of the tracer codebase (_pytracer extension) T-bug Type: A fix for unwanted behavior, NOT a new feature or a simple enhancement

Comments

@milech
Copy link

milech commented Jan 13, 2023

When running reprozip trace for the experiment containing dimensionality reduction algorithms like PCA or t-SNE from scikit-learn or PHATE from phate library, it hangs on performing the fit_transform() method.

Versions of libraries:
scikit-learn 1.2.0
phate 1.0.10
pandas 1.3.5

System: Ubuntu 18.04.6 LTS

Sample code to reproduce the issue:

from pathlib import Path
import pandas as pd
from sklearn.decomposition import PCA

def main():
    path = Path(f"path to a csv file")
    X = pd.read_csv(path)

    print("Creating PCA object")
    pca = PCA(random_state=123, n_components=3)
    print("PCA object created")
   
    print("fitting PCA transform")
    X_pca = pca.fit_transform(X)  # It hangs here. The print method below is never reached.
    print("PCA transform fitted")

if __name__ == "__main__":
    main()
@remram44
Copy link
Member

remram44 commented Jan 13, 2023

I can't reproduce this, I ran it successfully in a Ubuntu 18.04 VM.

reprozip 1.1 (from pip)
Python 3.8.0 (package python3.8)
numpy 1.24.1
phate 1.0.10
scikit-learn 1.2.0
scipy 1.10.0

I used the Olivetti faces dataset.

Can you try running with increased verbosity? reprozip -v -v trace python ...

@milech
Copy link
Author

milech commented Jan 16, 2023

Thank you for the quick reply. I've run it with -v -v. I was not able to open the trace.sqlite3-journal so I'm attaching the trace log as screen shots for places where it failed. Python version was 3.9.15.

trace_1
trace_2
trace_3
trace_4
trace_5
trace_6
trace_7
trace_8

@remram44
Copy link
Member

Unfortunately there doesn't seem to be anything wrong in that log, it looks like PCA is running. I assume you have waited long enough and it never completes? ReproZip shouldn't slow down that process anyway.

Unless I can reproduce this locally I am not sure I'll be able to fix it, sorry.

@milech
Copy link
Author

milech commented Jan 17, 2023

When I'm running it "normally" (without ReproZip) this single iteration of fit_transform() takes 0.03 s. With ReproZip tracing I waited for 30 min and nothing happened. However, the processor usage is between 4-6% during that process so it looks like something is going on. I will try to run it on a different machine and with Python 3.8 and will share the findings.

@milech
Copy link
Author

milech commented Jan 19, 2023

I've managed to run reprozip trace after downgrading Python from version 3.9.15 to 3.8.0. It doesn't hang on fit_transform() anymore. However, single iteration of fit_transform() still performs much longer (25 seconds) when running through reprozip trace compared to "normal" run (160 miliseconds). Could you please check if the time of execution on your side is comparable no matter if you run python file_name.py or reprozip trace python file_name.py for that piece of code with PCA that I've shared with you and the Olivetti faces dataset?

@remram44
Copy link
Member

Ubuntu 18,04 does not have Python 3.9 so I am not sure how to reproduce your setup. Did you compile Python from source?

@milech
Copy link
Author

milech commented Jan 19, 2023

I've just meant reproducing it in Python 3.8.0. Just like you did it before, but this time checking how long it takes to perform python file_name.py versus reprozip trace python file_name.py. It seems that in our environment, performing reprozip trace python file_name.py takes about 155 times longer than python file_name.py. We have several thousand of iterations to perform so tracing would make the process extremely long. If in your environment, with the Python version you use, ReproZip tracing doesn't slow down this process then it must be something with the environment settings on our side (conda maybe?)

ps.
I've created conda environment with Python 3.9 installed inside it.

@remram44
Copy link
Member

How long are those commands taking respectively?

There should be no overhead for the computing process, only for figuring out dependencies to write the config.yml at the end (and a little bit per system call).

@milech
Copy link
Author

milech commented Jan 19, 2023

For single iteration of fit_transform():
python file_name.py - about 1 second
reprozip trace python file_name.py - about 27 seconds

For 5 iterations of fit_transform():
python file_name.py - about 5 seconds
reprozip trace python file_name.py - about 129 seconds

@remram44
Copy link
Member

I can reproduce the slowness if I increase the number of iterations.

It seems that scikit-learn uses threads (15 for me) that call sched_yield() regularly, incurring the tracing overhead each time. What's weird is that running under strace does not take as long as under reprozip. I am not sure where the difference comes from.

@remram44 remram44 added T-bug Type: A fix for unwanted behavior, NOT a new feature or a simple enhancement C-tracer (C) Component: The C part of the tracer codebase (_pytracer extension) labels Jan 20, 2023
@remram44
Copy link
Member

I'm stumped. It looks like a bug in scikit-learn to be honest. If I slow down reprozip further, making the sched_yield longer, the Python code yields even more and never completes.

In any case, you can trace your program with a low number of iterations and then change it back to the proper number before packing. The reprozip tracer is not used during reproduction so it shouldn't be a problem. Sorry I can't help further!

@milech
Copy link
Author

milech commented Jan 20, 2023

Thank you! That tip with tracing the program with a low number of iterations should do the trick!

@milech
Copy link
Author

milech commented Feb 6, 2023

Update:
The trick with setting lower number of iterations for tracing and then changing it back before packing doesn't necessarily work when various dimensionality reduction algorithms are called (e.g. pca, t-sne, umap). Due to some unspecified thread-dependent weirdness inside scikit-learn, it's either pca that hangs or umap or t-sne. But we figured out a temporary solution that makes all three running. The program stopped hanging after setting:

export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1

Actually, it was that last export of the above-mentioned that eventually made things run, so only the last one might be needed or all four. It runs slower due to switching off threading inside methods called from scikit-learn but at least it doesn't hang. Multiprocessing outside scikit-learn works fine so it is still possible to run the algos in parallel.

@remram44
Copy link
Member

remram44 commented Feb 6, 2023

Thanks @milech, that should help narrow it down. Hopefully I can find where the issue is. ReproZip should not interfere with OpenMP like this.

@milech
Copy link
Author

milech commented Feb 6, 2023

One more thing that I forgot to mention. Setting those environment variables made it work only in this configuration:

Package Version


certifi 2022.12.7
charset-normalizer 2.1.1
contourpy 1.0.6
cycler 0.11.0
decorator 5.1.1
Deprecated 1.2.13
distro 1.8.0
fonttools 4.38.0
future 0.18.2
graphtools 1.5.3
idna 3.4
joblib 1.2.0
kiwisolver 1.4.4
llvmlite 0.39.1
matplotlib 3.6.2
numba 0.56.4
numpy 1.23.5
packaging 22.0
pandas 1.3.5
patsy 0.5.3
phate 1.0.10
Pillow 9.4.0
pip 22.3.1
PyGSP 0.5.1
pynndescent 0.5.8
pyparsing 3.0.9
python-dateutil 2.8.2
pytz 2022.7
PyYAML 6.0
reprozip 1.1
requests 2.28.1
rpaths 1.0.0
s-gd2 1.8.1
scikit-learn 1.2.0
scipy 1.10.0
scprep 1.2.1
seaborn 0.12.2
setuptools 65.5.0
six 1.16.0
sklearn 0.0.post1
statsmodels 0.13.5
tasklogger 1.2.0
threadpoolctl 3.1.0
tqdm 4.64.1
umap-learn 0.5.3
urllib3 1.26.13
usagestats 1.0.1
wheel 0.37.1
wrapt 1.14.1

python==3.9.15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracer (C) Component: The C part of the tracer codebase (_pytracer extension) T-bug Type: A fix for unwanted behavior, NOT a new feature or a simple enhancement
Projects
None yet
Development

No branches or pull requests

2 participants