Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h2o4gpu :Genetic algorithm along with Random Forest Regression produces error: terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: out of memory #789

Open
Geerthy11 opened this issue Jul 24, 2019 · 1 comment
Assignees

Comments

@Geerthy11
Copy link

I am working on feature selection using Genetic Algorithm (GA) with Random forest regression model (h2o4gpu.RandomForest Regressor). The number of estimators is 100, rest of the parameters are default. Here, the fitness function for GA is RF model's MAE. My dataset is 1.51 MB and dimension is 4000*44. However, The following is the types of error i get after certain iterations (say 30-40) whenever i run the program:

terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: out of memory
Aborted (core dumped)

terminate called after throwing an instance of 'dmlc::Error'
what(): [08:58:38] /workspace/include/xgboost/./../../src/common/common.h:41: /workspace/src/tree/../common/device_helpers.cuh: 422: out of memory
Stack trace:
[bt] (0) /conda/envs/rapids/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f3f0b07fcb4]
[bt] (1) /conda/envs/rapids/xgboost/libxgboost.so(+0x3267e2) [0x7f3f0b2a57e2]
[bt] (2) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::DeviceShard<xgboost::detail::GradientPairInternal >::EvaluateSplits(std::vector<int, std::allocator >, xgboost::RegTree const&, unsigned long)+0x1041) [0x7f3f0b2b48b1]
[bt] (3) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::DeviceShard<xgboost::detail::GradientPairInternal >::UpdateTree(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, xgboost::RegTree*, dh::AllReducer*)+0x131e) [0x7f3f0b2c7dfe]
[bt] (4) /conda/envs/rapids/xgboost/libxgboost.so(+0x34a201) [0x7f3f0b2c9201]
[bt] (5) /conda/envs/rapids/bin/../lib/libgomp.so.1(GOMP_parallel+0x42) [0x7f3f1c5bee92]
[bt] (6) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x918) [0x7f3f0b2bae98]
[bt] (7) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0xa81) [0x7f3f0b105791]
[bt] (8) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix
, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0xd65) [0x7f3f0b106c95]

Aborted (core dumped)

The following are the specifications:
Ubuntu 16.04.6 LTS
Python 3.6.8
CUDA 10.2/ cuDNN -7.4.1
GPU model -Quadro GV100
Nvidia docker version : 18.09.6
RAM: 125 GB
H2o4gpu is installed using PIP wheel for cuda 10.0 (https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.3-cuda10/h2o4gpu-0.3.2-cp36-cp36m-linux_x86_64.whl)

Kindly provide your suggestions to this issue.

@sh1ng sh1ng self-assigned this Sep 9, 2019
@sh1ng
Copy link
Contributor

sh1ng commented Sep 9, 2019

Could you provide a code snippet to reproduce it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants