Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky unit tests: joblib parallel ValueError #167

Closed
jeremyliweishih opened this issue Oct 30, 2019 · 12 comments
Closed

Flaky unit tests: joblib parallel ValueError #167

jeremyliweishih opened this issue Oct 30, 2019 · 12 comments
Assignees
Labels
bug Issues tracking problems with existing features.

Comments

@jeremyliweishih
Copy link
Contributor

Why are our results for 3.6 and 3.7 consistent but not 3.5?

@jeremyliweishih
Copy link
Contributor Author

This issue can possible deal with Dict() in 3.5 having insertion order.

@dsherry
Copy link
Contributor

dsherry commented Jan 6, 2020

We should write a summary here of what appears to be inconsistent

@jeremyliweishih
Copy link
Contributor Author

I can't seem to find another ticket but this issue might also be related/fix CircleCI inconsistency with parallelization in 3.5.

Errors that could pop up can be found here.

@dsherry dsherry changed the title 3.5 Inconsistency Unit tests failing on Python 3.5 Jan 8, 2020
@dsherry
Copy link
Contributor

dsherry commented Jan 8, 2020

Awesome, thanks.

What I see in the logs is that some of our unit tests are failing on python 3.5 but still passing on 3.6 and 3.7.

This issue can possible deal with Dict() in 3.5 having insertion order.

That would make sense.

Next questions/tasks:

  • Decide: how important is it that we support python 3.5 in evalml? (@kmax12: thoughts?)
  • Reproduce this locally, debug and verify the root cause

@dsherry dsherry added the bug Issues tracking problems with existing features. label Jan 8, 2020
@dsherry
Copy link
Contributor

dsherry commented Jan 9, 2020

I started looking into this.

Summary: able to repro unreliably/occasionally. Still not sure of the root cause.

From sifting through the CircleCI results, it looks like this happens a small percentage of the time. Maybe 10-20%.

Note: I filed #311 to track some warning messages I saw in the unit tests. May be related, unsure.

Stack trace

evalml/tests/automl_tests/test_auto_regression_search.py:83:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
evalml/automl/auto_base.py:165: in search
    self._do_iteration(X, y, pbar, raise_errors)
evalml/automl/auto_base.py:261: in _do_iteration
    raise e
evalml/automl/auto_base.py:258: in _do_iteration
    score, other_scores = pipeline.score(X_test, y_test, other_objectives=self.additional_objectives)
evalml/pipelines/pipeline_base.py:257: in score
    y_predicted = self.predict(X)
evalml/pipelines/pipeline_base.py:205: in predict
    return self.estimator.predict(X_t)
evalml/pipelines/components/estimators/estimator.py:17: in predict
    return self._component_obj.predict(X)
test_python/lib/python3.5/site-packages/sklearn/ensemble/_forest.py:782: in predict
    for e in self.estimators_)
test_python/lib/python3.5/site-packages/joblib/parallel.py:1004: in __call__
    if self.dispatch_one_batch(iterator):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
...
>               islice = list(itertools.islice(iterator, big_batch_size))
E               ValueError: Stop argument for islice() must be None or an integer: 0 <= x <= sys.maxsize.

test_python/lib/python3.5/site-packages/joblib/parallel.py:808: ValueError

Some possibilities

  • Bug with the py3.5 version of sklearn's random forest, which uses joblib.Parallel internally. I've noticed the stack traces seem to always mention the line test_python/lib/python3.5/site-packages/sklearn/ensemble/_forest.py:782 (which, if I got the right version, is here). The sklearn version used by CircleCI for py3.5 is 0.22.1.
  • Perhaps the docker container is snubbing joblib somehow. The final frame in the stack trace is in joblib code

Stuff I tried
I grabbed this failed "linux python 3.5 unit tests" job and used CircleCI's "rerun with SSH" (super handy). Once inside I activated the test_python venv and ran the pytest cmd triggered by make circleci-test

I tried running some of the unit tests which had failed individually, with no luck. It was only when I ran all of them at once that I was able to repro some failures. But the tests which failed changed a bit each time and seemed unpredictable.

Next steps

  • Make sure this bug is worth the effort: are we going to continue to support python 3.5?
  • What evalml estimator and dataset is this failure occurring for? Can we repro this by calling the estimator directly? What about with other data?
  • Can we repro this on either mac or windows?
  • Continue to try to reliably repro. Perhaps write a similar unit test which wraps sklearn's random forest instead of calling evalml.

Not the cause

  • I found an issue online with the same ValueError which said: "If you move to a 64 bit build of Python, sys.maxsize will jump from 231 - 1 to 263 - 1." This had me wondering if the test was using 32-bit python. I verified we are using 64-bit python, so that's not it.
  • I noticed we're using the -n flag on pytest. Perhaps this issue is exposing a bug in the way pytest spins up parallel workers -- wait, never mind, because this test failed before the -n was added by Jeremy on his branch

@dsherry dsherry changed the title Unit tests failing on Python 3.5 Linux python 3.5 unit tests are failing randomly Jan 9, 2020
@kmax12
Copy link
Contributor

kmax12 commented Jan 9, 2020

in terms of the necessity to support 3.5...

currently 5-10% of our featuretools downloads come from python 3.5. i also looked at a few other ml-related libraries

scikit-learn: ~10%
pandas: 10-15%
xgboost: 20-30%
numpy: 10%

so, my thought would be that yes, we should try to support it since there are people using it. if maintaining it is a slowing us down drastically, we could revisit that.

checkout whatever package you want here: https://pypistats.org/packages/pandas

@dsherry
Copy link
Contributor

dsherry commented Jan 31, 2020

Just saw another instance of this failure on my PR, here. It hasn't magically gone away :) we should dig into this soon

@dsherry
Copy link
Contributor

dsherry commented Mar 3, 2020

We're removing support for python 3.5 in #435.

But note @angela97lin mentioned she's seen this failure on python 3.6 💩 Updating issue name to correspond.

RE comment in #435, I wonder if this issue has something to do with our use of OrderedDict... probably not, just adding to the list of possibilities.

@angela97lin do you have any info / links / repro with the 3.6 failure you saw? Was it local or on circleci?

@dsherry dsherry changed the title Linux python 3.5 unit tests are failing randomly Linux python 3.5/3.6 unit tests are failing randomly with joblib parallel ValueError Mar 3, 2020
@dsherry dsherry changed the title Linux python 3.5/3.6 unit tests are failing randomly with joblib parallel ValueError Flaky linux python 3.5/3.6 unit tests: joblib parallel ValueError Mar 3, 2020
@dsherry dsherry changed the title Flaky linux python 3.5/3.6 unit tests: joblib parallel ValueError Flaky linux python 3.6 unit tests: joblib parallel ValueError Mar 3, 2020
@dsherry dsherry changed the title Flaky linux python 3.6 unit tests: joblib parallel ValueError Flaky unit tests: joblib parallel ValueError Mar 3, 2020
@angela97lin
Copy link
Contributor

Sure! I've only run into it via my random_state PR for python 3.6, so I've been trying to debug. Here's that PR: #431

From Slack thread:
My guess is that it has to do something with n_jobs=-1 since in the stack trace, we get a ValueError: Stop argument for islice() must be None or an integer: 0 <= x <= sys.maxsize. ; it’s likely that when n_jobs=-1, the stop argument passed in becomes negative and triggers this exception. The error goes away when n_jobs an positive integer.

Here’s a run where I ran into this: https://app.circleci.com/jobs/github/FeatureLabs/evalml/12366

It seems to only happen on CircleCI, so I wonder if that has to do with the issue at all?

@dsherry
Copy link
Contributor

dsherry commented Mar 4, 2020

Status
We're able to repro this on python 3.6 on @angela97lin 's PR #441. Circleci failure is here. We can't repro by running individual tests, have to run them all.

We were previously seeing this failure only on python 3.5. Now Angela tweaking the random_state causes this to fail on python 3.6 only. This makes me think there's a race condition which has to do with the ordering of calls to the random number generator. It's quite helpful that it appears to be failing consistently on python 3.6 on Angela's random_state branch.

Next steps

  • Dylan check numpy/sklearn package versions on 3.6 vs 3.7, and use that to check their changelogs
  • Dylan try to get another reproducer, off master
  • Dylan try rerunning all tests in circleci via ssh, see if that fails
  • Dylan try messing with docker config in circleci job, potential fix
    Could add something like the following (docker doc, circleci doc) to unit_tests circleci config to limit test job to 8 cpus:
    docker:
      - command: ['--cpuset', '0-7']

@dsherry
Copy link
Contributor

dsherry commented Mar 10, 2020

We should reevaluate if this is still an issue now that @christopherbunn merged #407

@dsherry
Copy link
Contributor

dsherry commented Mar 30, 2020

I haven't seen this issue since #407 was merged. Closing.

@dsherry dsherry closed this as completed Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features.
Projects
None yet
Development

No branches or pull requests

5 participants