Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicting on large DF runs into infinite loop #272

Open
janosh opened this issue Nov 22, 2019 · 4 comments
Open

Predicting on large DF runs into infinite loop #272

janosh opened this issue Nov 22, 2019 · 4 comments
Labels
major bug v1.1 Issues and enhancements for upcoming minor release v1.1

Comments

@janosh
Copy link
Member

janosh commented Nov 22, 2019

I've been trying to work around what might be a bug in (auto-)matminer. Trying to make predictions for a large dataframe (around 80000 rows) never finishes. I think the culprit might be guessing oxidation states as that seems to a long time and also increases rapidly in run time from one prediction to the next when slicing up the dataframe into chunks and predicting on each chunk individually.

@ardunn I couldn't create a minimal example with dummy data that reproduces this issue but maybe you can try to run this script and see if you experience the same issue.

@janosh
Copy link
Member Author

janosh commented Nov 22, 2019

Turns out that if I only use an ElementProperty featurizer (which generates the only features that are retained anyway), the problem disappears.

import automatminer as amm
import matminer as mm

featurizers = {
    "composition": [mm.featurizers.composition.ElementProperty.from_preset("magpie")],
    "structure": [],
}
pipe_config = {
    **amm.get_preset_config(),
    "autofeaturizer": amm.AutoFeaturizer(
        featurizers=featurizers,
        guess_oxistates=False,
    ),
}

pipe = amm.MatPipe(**pipe_config)

@ardunn ardunn added major bug v1.1 Issues and enhancements for upcoming minor release v1.1 labels Nov 23, 2019
@ardunn
Copy link
Contributor

ardunn commented Nov 23, 2019

Hey @janosh thanks for the bug report. I've been aware of this problem for some time and am actually currently running some tests to try and pinpoint it.

I actually think this is a bug with matminer and job parallelization with mulitprocessing. For example, if you try just using StructuretoOxidStructure etc. from matminer I'd wager you'd see the same issues.

What I think is happening behind the scenes is when n_jobs is high (relative to the compute ability of whatever machine you are running it on), the expensive chunks are delegated very few compute cycles by the CPU and/or are not allocated sufficient memory. I don't think there is any infinite loop happening (AFAIK) but the CPU is not allowing a highly parallelized process to run efficiently.

Some tests to try

Does running the bare featurizers (without automatminer) still have this problem? My guess is yes.

If so, does setting n_jobs for an individual featurizer change the halting behavior whatsoever? My guess is that if you set n_jobs=1 the job will go very slowly but eventually finish, and if you turn n_jobs very high you increase the probability it halts indefinitely.

@ardunn
Copy link
Contributor

ardunn commented Nov 9, 2020

@ardunn
Copy link
Contributor

ardunn commented Nov 9, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major bug v1.1 Issues and enhancements for upcoming minor release v1.1
Projects
None yet
Development

No branches or pull requests

2 participants