Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medium Article on Tree-Boosting for Spatial Data is hanging on Mac #84

Open
IanQS opened this issue Jan 25, 2023 · 7 comments
Open

Medium Article on Tree-Boosting for Spatial Data is hanging on Mac #84

IanQS opened this issue Jan 25, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@IanQS
Copy link

IanQS commented Jan 25, 2023

While trying to run the code for Tree-Boosting for Spatial Data it seems like the code is hanging? The code below is more-or-less lifted from the article.

My machine

  • python 3.11
  • conda 4.12.0
  • MacOS Ventura 13.1

Steps to replicate

Top of the script

import numpy as np
np.random.seed(1)
# Simulate Gaussian process: training and test data (the latter on a grid for visualization)
sigma2_1 = 0.35  # marginal variance of GP
rho = 0.1  # range parameter
sigma2 = 0.1  # error variance
n = 200  # number of training samples
nx = 50 # test data: number of grid points on each axis
# training locations (exclude upper right rectangle)
coords = np.column_stack((np.random.uniform(size=1)/2, np.random.uniform(size=1)/2))
while coords.shape[0] < n:
    coord_i = np.random.uniform(size=2)
    if not (coord_i[0] >= 0.6 and coord_i[1] >= 0.6):
        coords = np.vstack((coords,coord_i))
# test locations (rectangular grid)
s_1 = np.ones(nx * nx)
s_2 = np.ones(nx * nx)
for i in range(nx):
    for j in range(nx):
        s_1[j * nx + i] = (i + 1) / nx
        s_2[i * nx + j] = (i + 1) / nx
coords_test = np.column_stack((s_1, s_2))
n_all = nx**2 + n # total number of data points 
coords_all = np.vstack((coords_test,coords))
D = np.zeros((n_all, n_all))  # distance matrix
for i in range(0, n_all):
    for j in range(i + 1, n_all):
        D[i, j] = np.linalg.norm(coords_all[i, :] - coords_all[j, :])
        D[j, i] = D[i, j]
Sigma = sigma2_1 * np.exp(-D / rho) + np.diag(np.zeros(n_all) + 1e-10)
C = np.linalg.cholesky(Sigma)
b_all = C.dot(np.random.normal(size=n_all))
b_train = b_all[(nx*nx):n_all] # training data GP
# Mean function. Use two predictor variables of which only one has an effect for easy visualization
def f1d(x):
    return np.sin(3*np.pi*x) + (1 + 3 * np.maximum(np.zeros(len(x)),x-0.5)/(x-0.5)) - 3
X = np.random.rand(n, 2)
F_X_train = f1d(X[:, 0]) # mean
xi_train = np.sqrt(sigma2) * np.random.normal(size=n)  # simulate error term
y = F_X_train + b_train + xi_train  # observed data
# test data
x = np.linspace(0,1,nx**2)
x[x==0.5] = 0.5 + 1e-10
X_test = np.column_stack((x,np.zeros(nx**2)))
F_X_test = f1d(X_test[:, 0])
b_test = b_all[0:(nx**2)]
xi_test = np.sqrt(sigma2) * np.random.normal(size=(nx**2))
y_test = F_X_test + b_test + xi_test

Bottom of the script

Modeling

import gpboost as gpb
gp_model = gpb.GPModel(gp_coords=coords, cov_function="exponential")
data_train = gpb.Dataset(X, y)
params = { 'objective': 'regression_l2', 'learning_rate': 0.01,
            'max_depth': 3, 'min_data_in_leaf': 10, 
            'num_leaves': 2**10, 'verbose': 1}
# Training
bst = gpb.train(params=params, train_set=data_train,
                gp_model=gp_model, num_boost_round=247)
gp_model.summary() # Estimated covariance parameters
# Make predictions: latent variables and response variable
pred = bst.predict(data=X_test, gp_coords_pred=coords_test,  
                   predict_var=True, pred_latent=True)
# pred['fixed_effect']: predictions from the tree-ensemble.
# pred['random_effect_mean']: predicted means of the gp_model.
# pred['random_effect_cov']: predicted (co-)variances  of the gp_model
pred_resp = bst.predict(data=X_test, gp_coords_pred=coords_test, 
                        predict_var=False, pred_latent=False)
y_pred = pred_resp['response_mean'] # predicted response mean
# Calculate mean square error
np.mean((y_pred-y_test)**2)

It has been running for about 5 minutes now and is still going...

@IanQS
Copy link
Author

IanQS commented Jan 25, 2023

Works just fine on my Linux machine

@fabsig
Copy link
Owner

fabsig commented Jan 25, 2023

Glad to hear that it runs on Linux. I hope that it also runs on the Mac (computational performance obviously depends on the machine you are using). Otherwise, it is hard to tell from the distance what might have gone wrong. To these examples run on your Mac?

@IanQS IanQS changed the title Medium Article on Tree-Boosting for Spatial Data is hanging on Medium Article on Tree-Boosting for Spatial Data is hanging on Mac Jan 25, 2023
@IanQS
Copy link
Author

IanQS commented Jan 25, 2023

Nope, it hangs :( It stalls on the instantiation line, gp_model = gpb.GPModel(group_data=group, likelihood=likelihood)

@fabsig
Copy link
Owner

fabsig commented Jan 26, 2023

That's not good. Unfortunately, I cannot reproduce this on my Apple silicon machine as it works without any problems. I might investigate this issue sometimes in the future. For the time being, the only thing that I can recommend is trying to install from source: https://github.com/fabsig/GPBoost/tree/master/python-package#installation-from-source

@fabsig
Copy link
Owner

fabsig commented Jan 26, 2023

FWIW, several Python packages seem to have problems on M1 macs; see, e.g., microsoft/LightGBM#4843

@IanQS
Copy link
Author

IanQS commented Jan 27, 2023

Ahh, gotcha! I'm running into

Exception: Please install CMake and all required dependencies first
The full version of error log was saved into /Users/ianquah/GPBoost_compilation.log

when doing an Installation from source from GitHub at the python setup.py install.


When doing Installation from source from PyPI, it installs just fine (pip install --no-binary :all: gpboost) but it hangs again

@fabsig fabsig added the bug Something isn't working label Feb 23, 2023
@StephenRogers1
Copy link

Found this problematic as well and reproduced the stalling on the instantiation line gp_model = gpb.GPModel(group_data=group, likelihood=likelihood) on an Apple M2 machine (python 3.9, conda 23.7.2, MacOS 13.0)

The error seems to occur due to conflicting conda-forge package installation and pip-installed gpboost. That is packages that share dependencies (I think) with gpboost should be installed using pip.

Steps to reproduce error

brew install miniforge
conda create -n env_conda -c conda-forge python=3.9
conda activate env_conda
pip install gpboost -U
conda install lightgbm

(Note conda install scikit-learn also produces error)

Then running gp_model = gpb.GPModel(group_data=group, likelihood=likelihood) will lag

Steps to fix error

As it seems the shared dependencies should be installed by pip so making sure not to conda install any or using a python virtualenv fixes this

  1. Using miniforge
... (as above)
pip install lightgbm
  1. Using virtualenv
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
python3 -m pip install gpboost

Doing either and then running gp_model = gpb.GPModel(group_data=group, likelihood=likelihood) will work

If followed brew install miniforge use conda list to make sure all sharing packages i.e. (scikit-learn, lightgbm,...) are installed through pip, this should fix the lagging issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants