Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64 CI setup with TravisCI #17996

Merged
merged 60 commits into from Jul 31, 2020
Merged

Arm64 CI setup with TravisCI #17996

merged 60 commits into from Jul 31, 2020

Conversation

rth
Copy link
Member

@rth rth commented Jul 26, 2020

Experiment with Arm64 CI setup using TravisCI, since we already have it activated. It's a similar setup to that used by numpy.

Closes #13073

@rth rth changed the title WIP Arm64 CI setup with TravisCI Arm64 CI setup with TravisCI Jul 26, 2020
@ogrisel
Copy link
Member

ogrisel commented Jul 26, 2020

Maybe the test failure is caused by a corruption of the /home/travis/scikit_learn_data/20news_home/20news-bydate.tar.gz file caused by concurrent download in tests run in // because of the use of pytest-xdist?

It would be worth making the dataset concurrent-safe by downloading the files to temporary filenames and then renaming at the end to ensure atomic operations.

@ogrisel
Copy link
Member

ogrisel commented Jul 26, 2020

I will work on making the download safe in another PR.

@ogrisel
Copy link
Member

ogrisel commented Jul 26, 2020

Apparently there is no easy way to make the downloaders truly concurrent safe. So maybe prefetching the datasets sequentially to a cache folder is better.

@rth
Copy link
Member Author

rth commented Jul 26, 2020

Overall this looks promising, the ARM CI here runs in 28min including 10min for the tests suite on 32 (I imagine shared) CPUs. The build time should decrease once ccache starts to work.

We could probably run it for each commit on master or in a cron job, not sure.

Now there are two categories of failing tests,

  • those that make fetch_* calls. That's actually a non issue. By default the test suite should not make network calls, and they were happening because SKLEARN_SKIP_NETWORK_TESTS was set to 0 in the cron jobs. We don't need to download these datasets in arm64, same as we don't do it in default azure jobs. Now we still get some failure because some of the recently modified tests are not correctly handling SKLEARN_SKIP_NETWORK_TESTS as far as I can tell. Possibly somehow related to MNT refactor tests for datasets #17599 (cc @glemaitre ) So that would need to be investigated a bit.
  • GradientBoostingClassifier doctest ([0.23.1] doctest GradientBoostingClassifier failes on arm(rhel) processors #17797) should now be skipped in conftest.py but if I copy it to the test directory to account for it with pytest --pyargs sklearn I get import errors for this conftest.py. Not sure what's happening there.

I won't be able to look into this in the next few days, so if anyone wants to investigate don't hesitate to push to this PR.

@ogrisel
Copy link
Member

ogrisel commented Jul 27, 2020

I would go for cronjob + a flag in the commit message to make it possible to trigger the tests manually in PRs prior to merging ARM related fixes.

@ogrisel
Copy link
Member

ogrisel commented Jul 30, 2020

The ICC build failure is unrelated and probably transient:

[...]
Reading package lists... Done
E: Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-i386/Packages.gz  Hash Sum mismatch
E: Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-all/Packages.gz  
E: Some index files failed to download. They have been ignored, or old ones used instead.

@ogrisel
Copy link
Member

ogrisel commented Jul 30, 2020

@jeremiedbb any idea why we fetch binary-i386/Packages.gz on a x66_64 CI machine for the icc-build?

@ogrisel ogrisel marked this pull request as ready for review July 30, 2020 15:16
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me on my end. Ready for final review.

@ogrisel
Copy link
Member

ogrisel commented Jul 30, 2020

FYI @rth, I believe the scikit-learn/scikit-learn folder is automatically added to the PYTHONPATH by travis hence the fact that the toplevel conftest.py is already picked up by pytest without the need to copy it to the test folder.

@rth
Copy link
Member Author

rth commented Jul 30, 2020

Thanks for your perseverance in finding this issue @ogrisel ! I also have no idea why travis_terminate is needed either.

I think there are failing tests in other TravisCI builds, that's expected since the scipy-dev build fails on master. We can probably address that in a separate PR.

Copy link
Member Author

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel! LGTM.

As mentioned above, I would say the other CI failure in Travis CI are unrelated to this PR?

item.add_marker(marker)
# Known failure on with GradientBoostingClassifier on ARM64
elif (item.name.endswith('GradientBoostingClassifier')
and platform.machine() == 'aarch64'):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused on the "aarch64" vs "ARM64" since the latter can also happen apparently https://github.com/python/cpython/blob/0c4f0f3b29d84063700217dcf90ad6860ed71c70/Lib/test/test_regrtest.py#L662

Anyway the the issue where this doctest failure was originally reported #17797 was also aarch64 so it's probably fine to merge as is

@jeremiedbb
Copy link
Member

any idea why we fetch binary-i386/Packages.gz on a x66_64 CI machine for the icc-build?

No but it's just packages available in the apt repository, not the package we're installing, right ?
we add intel apt repo with sudo add-apt-repository "deb https://apt.repos.intel.com/oneapi all main". I guess the 'all main' is a bit overkill but should not be a problem. tbh I just copy/pasted the instructions from intel (https://software.intel.com/content/www/us/en/develop/articles/oneapi-repo-instructions.html#aptpkg)

@jeremiedbb
Copy link
Member

/opt/intel/inteloneapi/setvars.sh: No such file or directory
hum...

@ogrisel
Copy link
Member

ogrisel commented Jul 30, 2020

Indeed I restarted the icc-build to check if the i386 hash mismatch could be reproduced and now it fails with:

[...]
Setting up intel-oneapi-tbb-devel-2021.1-beta08 (2021.1-560.beta08) ...
Setting up intel-oneapi-tbb-devel (2021.1-560.beta08) ...
Setting up intel-oneapi-icc-2021.1-beta08 (2021.1-863.beta08) ...
Setting up intel-oneapi-icc (2021.1-863.beta08) ...
build_tools/travis/install.sh: line 92: /opt/intel/inteloneapi/setvars.sh: No such file or directory

They are probably in the process of updating the apt repo.

@ogrisel
Copy link
Member

ogrisel commented Jul 30, 2020

The last arm64 has a failed doctest:

______________________________ [doctest] sgd.rst _______________________________
[gw4] linux -- Python 3.7.8 /home/travis/miniconda/envs/testenv/bin/python3.7
124 parameters if an example violates the margin constraint, which makes
125 training very efficient and may result in sparser models (i.e. with more zero
126 coefficents), even when L2 penalty is used.
127 
128 Using ``loss="log"`` or ``loss="modified_huber"`` enables the
129 ``predict_proba`` method, which gives a vector of probability estimates
130 :math:`P(y|x)` per sample :math:`x`::
131 
132     >>> clf = SGDClassifier(loss="log", max_iter=5).fit(X, y)
133     >>> clf.predict_proba([[1., 1.]])
Expected:
    array([[0.00..., 0.99...]])
Got:
    array([[4.97248476e-07, 9.99999503e-01]])

There is not random_state. This failure is probably random but very rare and not related to arm64. Let me try to restart this build to confirm its randomness.

@jeremiedbb
Copy link
Member

But array([[4.97248476e-07, 9.99999503e-01]]) is close to array([[0.00..., 0.99...]]). It seems to fail just because it's written in scientific notation.

@ogrisel
Copy link
Member

ogrisel commented Jul 30, 2020

But array([[4.97248476e-07, 9.99999503e-01]]) is close to array([[0.00..., 0.99...]]). It seems to fail just because it's written in scientific notation.

But most of the time it passes. It's probably random variations at some lower decimals that trigger the switch between the 2 notations.

@glemaitre
Copy link
Member

LGTM. I would just add the new tag there: https://scikit-learn.org/stable/developers/contributing.html#continuous-integration-ci
such that it is mentioned in the documentation.

@glemaitre
Copy link
Member

I am even thinking that in the near future we will probably have to run this CI all the time if manufacturer like Apple is selling personal computers with these processors

@ogrisel ogrisel merged commit 90dc61f into scikit-learn:master Jul 31, 2020
@ogrisel
Copy link
Member

ogrisel commented Jul 31, 2020

Merged. Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC/CI: use Shippable for testing on ARM v8 architecture
5 participants