Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX, ENH Possible fix for segfault and oversubscription of RAM in oblique splitters #267

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 6 additions & 4 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,12 +81,14 @@ jobs:
- run:
name: setup Python venv
command: |
pip install -r build_requirements.txt
pip install -r doc_requirements.txt
pip install --upgrade pip spin
spin setup-submodule
pip install .[build]
pip install .[doc]
- run:
name: build scikit-tree
command: |
python spin build -j2
spin build -j2

- save_cache:
key: deps_ccache-{{ .Branch }}
Expand Down Expand Up @@ -127,7 +129,7 @@ jobs:
echo "After..."
ls build-install/**
export PYTHONPATH=$PWD/build-install/usr/lib/python3.8/site-packages
python ./spin docs
./spin docs

- store_artifacts:
path: doc/_build/html
Expand Down
21 changes: 11 additions & 10 deletions .github/workflows/build_wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,10 @@ jobs:

- name: Setup submodule
run: |
python -m pip install -r build_requirements.txt
python -m pip install --upgrade pip
python -m pip install spin
python spin setup-submodule
python -m spin setup-submodule
python -m pip install .[build]

- name: win_amd64 - install rtools
run: |
Expand All @@ -79,7 +80,7 @@ jobs:
# configuration step to create a build directory. The subsequent wheel
# build then needs to use that directory. This can be done with pip
# using a command like:
# python -m pip wheel --config-settings builddir=build .
# pip wheel --config-settings builddir=build .
# if: >-
# ( ! contains(matrix.os[2], 'arm64' ) )
env:
Expand Down Expand Up @@ -114,10 +115,10 @@ jobs:

- name: Install cibuildwheel dependencies
run: |
python -m pip install cibuildwheel
python -m pip install -r build_requirements.txt
python -m pip install spin
python spin setup-submodule
pip install cibuildwheel
pip install spin
spin setup-submodule
pip install .[build]

- name: Build wheels
uses: pypa/cibuildwheel@v2.16.5
Expand Down Expand Up @@ -146,9 +147,9 @@ jobs:

- name: Build source distribution
run: |
python -m pip install -r build_requirements.txt
python -m pip install spin
python spin setup-submodule
pip install spin build
spin setup-submodule
pip install .[build]
python -m build

- name: Store artifacts
Expand Down
43 changes: 24 additions & 19 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,28 +46,32 @@ jobs:
python-version: ${{ matrix.python-version }}
architecture: "x64"
cache: "pip"
cache-dependency-path: "requirements.txt"
cache-dependency-path: "pyproject.toml"

- name: show-gcc
- name: Install packages for Ubuntu
if: ${{ matrix.os == 'ubuntu-22.04'}}
run: |
gcc --version
sudo apt-get update
sudo apt-get install -y libopenblas-dev libatlas-base-dev liblapack-dev gfortran libgmp-dev libmpfr-dev libsuitesparse-dev ccache libmpc-dev
sudo apt-get install -y gcc
sudo apt-get update

- name: Install Ccache for MacOSX
if: ${{ matrix.os == 'macos-latest'}}
run: |
brew install ccache
brew install gcc

- name: Install packages for Ubuntu
if: ${{ matrix.os == 'ubuntu-22.04'}}
- name: show-gcc
run: |
sudo apt-get update
sudo apt-get install -y libopenblas-dev libatlas-base-dev liblapack-dev gfortran libgmp-dev libmpfr-dev libsuitesparse-dev ccache libmpc-dev
gcc --version

- name: Install Python packages
run: |
python -m pip install -r build_requirements.txt
python -m pip install spin
python -m pip install -r test_requirements.txt
python -m spin setup-submodule
python -m pip install .[build]
python -m pip install .[test]

- name: Prepare compiler cache
id: prep-ccache
Expand Down Expand Up @@ -153,7 +157,7 @@ jobs:
python-version: ${{ matrix.python-version }}
architecture: "x64"
cache: "pip"
cache-dependency-path: "requirements.txt"
cache-dependency-path: "pyproject.toml"

- name: show-gcc
run: |
Expand All @@ -172,9 +176,10 @@ jobs:

- name: Install Python packages
run: |
python -m pip install -r build_requirements.txt
python -m pip install spin
python -m pip install -r test_requirements.txt
python -m spin setup-submodule
python -m pip install .[build]
python -m pip install .[test]

- name: Prepare compiler cache
id: prep-ccache
Expand Down Expand Up @@ -232,7 +237,7 @@ jobs:
- name: Upload coverage stats to codecov
uses: codecov/codecov-action@v4
with:
# python spin goes into the INSTALLED path in order to run pytest
# spin goes into the INSTALLED path in order to run pytest
files: ./coverage.xml
fail_ci_if_error: true
verbose: true
Expand All @@ -258,7 +263,7 @@ jobs:
# python-version: "3.10"
# architecture: "x64"
# cache: "pip"
# cache-dependency-path: "requirements.txt"
# cache-dependency-path: "pyproject.toml"

# - name: install-rtools
# run: |
Expand Down Expand Up @@ -287,8 +292,8 @@ jobs:

# - name: pip-packages
# run: |
# pip install -r build_requirements.txt
# pip install -r test_requirements.txt
# pip install .[build]
# pip install .[test]
# pip install spin
# pip install numpy==1.22.4

Expand Down Expand Up @@ -319,8 +324,8 @@ jobs:
# run: |
# echo "SCIPY_USE_PROPACK=1" >> $env:GITHUB_ENV
# echo "FORCE_SUBMODULE=True" >> $env:GITHUB_ENV
# python spin setup_submodule --forcesubmodule
# python spin build -j 2
# spin setup_submodule --forcesubmodule
# spin build -j 2
# # Necessary because GitHub Actions checks out the repo to D:\ while OpenBLAS
# # got installed to C:\ higher up. The copying with `--win-cp-openblas` fails
# # when things are split over drives.
Expand All @@ -329,4 +334,4 @@ jobs:

# - name: test
# run: |
# python spin test
# spin test
4 changes: 2 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ jobs:
architecture: "x64"
- name: Install dependencies
run: |
python -m pip install --progress-bar off --upgrade pip
python -m pip install --progress-bar off build twine
pip install --progress-bar off --upgrade pip
pip install --progress-bar off build twine
- name: Prepare environment
run: |
echo "RELEASE_VERSION=${GITHUB_REF#refs/tags/v}" >> $GITHUB_ENV
Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/style.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,10 @@ jobs:
sudo apt-get install -y libopenblas-dev libatlas-base-dev liblapack-dev

- name: Install dependencies
run: pip install -r style_requirements.txt
run: |
pip install --upgrade pip spin
spin setup-submodule
pip install .[style]

# check formatting of the code style
- name: Check code formatting
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ When you're ready to contribute code to address an open issue, please follow the

Once your virtual environment is activated, you can install your local clone in "editable mode" with

pip install -r build_requirements.txt
pip install .[build]
pip install -e .

The "editable mode" comes from the `-e` argument to `pip`, and essential just creates a symbolic link from the site-packages directory of your virtual environment to the source code in your local clone. That way any changes you make will be immediately reflected in your virtual environment.
Expand Down
10 changes: 5 additions & 5 deletions DEVELOPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,14 @@ Assuming these steps have worked properly and you have read and followed any nec

If you are developing locally, you will need the build dependencies to compile the Cython / C++ code:

pip install -r build_requirements.txt
pip install .[build]

Other requirements can be installed as such:

pip install -r requirements.txt
pip install -r style_requirements.txt
pip install -r test_requirements.txt
pip install -r doc_requirements.txt
pip install .
pip install .[style]
pip install .[test]
pip install .[doc]

# Building the project from source

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Building locally with Meson (For developers)
Make sure you have the necessary packages installed

# install build dependencies
pip install -r build_requirements.txt
pip install .[build]

# you may need these optional dependencies to build scikit-learn locally
conda install -c conda-forge joblib threadpoolctl pytest compilers llvm-openmp
Expand Down
Binary file added clf.joblib
Binary file not shown.
2 changes: 1 addition & 1 deletion doc/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Then run installation of build packages

.. code-block:: bash

pip install -r build_requirements.txt
pip install .[build]
pip install spin

# use spin CLI to run Meson build locally
Expand Down
16 changes: 8 additions & 8 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
[build-system]
build-backend = "mesonpy"
requires = [
"meson-python>=0.13.0",
"meson-python>=0.15.0",
'ninja',
# `wheel` is needed for non-isolated builds, given that `meson-python`
# doesn't list it as a runtime requirement (at least in 0.10.0)
# See https://github.com/FFY00/meson-python/blob/main/pyproject.toml#L4
"wheel",
"setuptools<=65.5",
"packaging",
"Cython>=0.29.24",
"scikit-learn>=1.4",
"Cython>=3.0.8",
"scikit-learn>=1.4.1",
"scipy>=1.5.0",
"numpy>=1.25; python_version>='3.9'"
]
Expand Down Expand Up @@ -49,9 +49,9 @@ include = [
{path = "meson.build", format = "sdist"}
]
dependencies = [
'numpy',
'numpy>=1.25.0',
'scipy>=1.5.0',
'scikit-learn>=1.3.1'
'scikit-learn>=1.4.1'
]

[project.optional-dependencies]
Expand All @@ -68,10 +68,10 @@ build = [
'meson-python',
'spin',
'doit',
'scikit-learn>=1.3.1',
'Cython>=0.29.36',
'scikit-learn>=1.4.1',
'Cython>=3.0.8',
'ninja',
'numpy',
'numpy>=1.25.0',
'rich-click',
'pydevtool'
]
Expand Down
2 changes: 1 addition & 1 deletion sktree/_lib/sklearn_fork
Submodule sklearn_fork updated 68 files
+1 −1 build_tools/azure/debian_atlas_32bit_lock.txt
+21 −21 build_tools/azure/pylatest_conda_forge_mkl_linux-64_conda.lock
+12 −12 build_tools/azure/pylatest_conda_forge_mkl_osx-64_conda.lock
+4 −4 build_tools/azure/pylatest_conda_mkl_no_openmp_osx-64_conda.lock
+10 −10 build_tools/azure/pylatest_pip_openblas_pandas_linux-64_conda.lock
+13 −9 build_tools/azure/pymin_conda_defaults_openblas_linux-64_conda.lock
+13 −13 build_tools/azure/pymin_conda_forge_mkl_win-64_conda.lock
+17 −17 build_tools/azure/pymin_conda_forge_openblas_ubuntu_2204_linux-64_conda.lock
+4 −4 build_tools/azure/pypy3_linux-64_conda.lock
+1 −1 build_tools/azure/ubuntu_atlas_lock.txt
+33 −33 build_tools/circle/doc_linux-64_conda.lock
+16 −16 build_tools/circle/doc_min_dependencies_linux-64_conda.lock
+4 −4 build_tools/cirrus/pymin_conda_forge_linux-aarch64_conda.lock
+1 −0 doc/glossary.rst
+2 −0 doc/metadata_routing.rst
+626 −550 doc/modules/clustering.rst
+2 −2 doc/modules/linear_model.rst
+10 −0 doc/whats_new/v1.4.rst
+45 −0 doc/whats_new/v1.5.rst
+4 −4 sklearn/cluster/_agglomerative.py
+8 −0 sklearn/cluster/_hdbscan/hdbscan.py
+3 −1 sklearn/cluster/_optics.py
+2 −2 sklearn/cluster/_spectral.py
+21 −0 sklearn/cluster/tests/test_optics.py
+15 −5 sklearn/compose/_target.py
+9 −1 sklearn/compose/tests/test_target.py
+2 −2 sklearn/datasets/_arff_parser.py
+34 −2 sklearn/datasets/_base.py
+27 −3 sklearn/datasets/_california_housing.py
+19 −2 sklearn/datasets/_covtype.py
+28 −3 sklearn/datasets/_kddcup99.py
+48 −5 sklearn/datasets/_lfw.py
+19 −2 sklearn/datasets/_olivetti_faces.py
+22 −3 sklearn/datasets/_rcv1.py
+14 −0 sklearn/datasets/_samples_generator.py
+31 −5 sklearn/datasets/_species_distributions.py
+43 −4 sklearn/datasets/_twenty_newsgroups.py
+28 −0 sklearn/datasets/tests/test_base.py
+10 −1 sklearn/decomposition/_dict_learning.py
+1 −1 sklearn/ensemble/_bagging.py
+1 −1 sklearn/ensemble/_iforest.py
+1 −1 sklearn/linear_model/_huber.py
+130 −27 sklearn/linear_model/_ransac.py
+4 −4 sklearn/linear_model/tests/test_ransac.py
+157 −118 sklearn/manifold/_locally_linear.py
+37 −9 sklearn/manifold/_spectral_embedding.py
+1 −1 sklearn/manifold/_t_sne.py
+1 −1 sklearn/metrics/pairwise.py
+1 −1 sklearn/model_selection/_split.py
+2 −2 sklearn/model_selection/_validation.py
+115 −54 sklearn/multioutput.py
+4 −4 sklearn/neighbors/_graph.py
+5 −1 sklearn/neighbors/_nca.py
+16 −4 sklearn/neighbors/tests/test_nca.py
+2 −2 sklearn/preprocessing/_label.py
+17 −7 sklearn/tests/metadata_routing_common.py
+5 −3 sklearn/tests/test_metadata_routing.py
+118 −20 sklearn/tests/test_metaestimators_metadata_routing.py
+62 −33 sklearn/tests/test_multioutput.py
+7 −1 sklearn/tests/test_public_functions.py
+9 −287 sklearn/utils/__init__.py
+175 −0 sklearn/utils/_chunking.py
+115 −0 sklearn/utils/_mask.py
+2 −2 sklearn/utils/_metadata_requests.py
+73 −0 sklearn/utils/tests/test_chunking.py
+5 −0 sklearn/utils/tests/test_param_validation.py
+0 −69 sklearn/utils/tests/test_utils.py
+7 −2 sklearn/utils/validation.py
14 changes: 9 additions & 5 deletions sktree/tree/_oblique_splitter.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,12 @@ cdef struct ObliqueSplitRecord:
float64_t impurity_left # Impurity of the left split.
float64_t impurity_right # Impurity of the right split.

vector[float32_t]* proj_vec_weights # weights of the vector (max_features,)
vector[intp_t]* proj_vec_indices # indices of the features (max_features,)
# XXX: By storing the values, they will have to be copied into the split record
# which may be inefficient.
# Inside the tree, they will also again be copied into a Node, which will
# in total incur two copies, whereas we ideally only need one copy.
vector[float32_t] proj_vec_weights # weights of the vector (max_features,)
vector[intp_t] proj_vec_indices # indices of the features (max_features,)


cdef class BaseObliqueSplitter(Splitter):
Expand All @@ -43,7 +47,7 @@ cdef class BaseObliqueSplitter(Splitter):
cdef vector[vector[float32_t]] proj_mat_weights # nonzero weights of sparse proj_mat matrix
cdef vector[vector[intp_t]] proj_mat_indices # nonzero indices of sparse proj_mat matrix

# TODO: assumes all oblique splitters only work with dense data
# TODO: assumes all oblique splitters only work with dense data; make work with sparse arrays too
cdef const float32_t[:, :] X

# feature weights across (n_dims,)
Expand Down Expand Up @@ -72,8 +76,8 @@ cdef class BaseObliqueSplitter(Splitter):
intp_t end,
const intp_t[:] samples,
float32_t[:] feature_values,
vector[float32_t]* proj_vec_weights, # weights of the vector (max_features,)
vector[intp_t]* proj_vec_indices # indices of the features (max_features,)
vector[float32_t]& proj_vec_weights, # weights of the vector (max_features,)
vector[intp_t]& proj_vec_indices # indices of the features (max_features,)
) noexcept nogil

cdef int node_split(
Expand Down