Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray problem: ray.tune.error.TuneError: ('Trials did not complete', [NeuroCard_44b5b_00000]) #4

Open
Doris404 opened this issue Oct 10, 2021 · 14 comments

Comments

@Doris404
Copy link

I try to fix it with the help of google, which turns out no use. The slack overflow don't have the correct answer (I try to change the version of packages which do not scceed), so I send this email hoping to get some advice.

More details about my problem:
Linux 5.4.0-84-generic #94-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
cuda version 11.4.20210728
The packages:

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_llvm conda-forge
_pytorch_select 0.2 gpu_0
_tflow_select 2.3.0 mkl
absl-py 0.9.0 py37hc8dfbb8_1 conda-forge
aiohttp 3.7.4.post0 py37h5e8e339_0 conda-forge
aiohttp-cors 0.7.0 pypi_0 pypi
aioredis 2.0.0 pypi_0 pypi
argh 0.26.2 pypi_0 pypi
arrow-cpp 0.11.1 py37h0e61e49_1004 conda-forge
astor 0.8.1 pyh9f0ad1d_0 conda-forge
async-timeout 3.0.1 py_1000 conda-forge
attrs 21.2.0 pyhd8ed1ab_0 conda-forge
beautifulsoup4 4.10.0 pypi_0 pypi
blas 1.0 mkl conda-forge
blessings 1.7 pypi_0 pypi
blinker 1.4 py_1 conda-forge
boost-cpp 1.68.0 h11c811c_1000 conda-forge
brotlipy 0.7.0 py37h5e8e339_1001 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.17.2 h7f98852_0 conda-forge
ca-certificates 2021.10.8 ha878542_0 conda-forge
cached-property 1.5.2 hd8ed1ab_1 conda-forge
cached_property 1.5.2 pyha770c72_1 conda-forge
cachetools 4.2.2 pypi_0 pypi
certifi 2021.5.30 pypi_0 pypi
cffi 1.14.6 py37hc58025e_0 conda-forge
chardet 4.0.0 py37h89c1867_1 conda-forge
charset-normalizer 2.0.6 pypi_0 pypi
click 8.0.1 py37h89c1867_0 conda-forge
cloudpickle 2.0.0 pypi_0 pypi
colorama 0.4.4 pypi_0 pypi
colorful 0.5.4 pypi_0 pypi
configparser 3.8.1 pypi_0 pypi
cryptography 3.4.7 py37h5d9358c_0 conda-forge
cudatoolkit 10.1.243 h036e899_9 conda-forge
cudnn 7.6.5.32 hc0a50b0_1 conda-forge
dataclasses 0.8 pyhc8e2a94_3 conda-forge
decorator 5.1.0 pyhd8ed1ab_0 conda-forge
docker-pycreds 0.4.0 pypi_0 pypi
filelock 3.0.12 pypi_0 pypi
funcsigs 1.0.2 pypi_0 pypi
gast 0.2.2 py_0 conda-forge
gitdb 4.0.7 pypi_0 pypi
gitpython 1.0.0 pypi_0 pypi
glog 0.3.1 pypi_0 pypi
google 3.0.0 pypi_0 pypi
google-api-core 1.31.3 pypi_0 pypi
google-auth 1.35.0 pyh6c4a22f_0 conda-forge
google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge
google-pasta 0.2.0 pyh8c360ce_0 conda-forge
googleapis-common-protos 1.53.0 pypi_0 pypi
gpustat 0.4.1 pypi_0 pypi
gql 0.3.0 py_0 conda-forge
graphql-core 1.1 pypi_0 pypi
grpcio 1.40.0 pypi_0 pypi
h5py 3.3.0 nompi_py37ha3df211_100 conda-forge
hdf5 1.10.6 nompi_h3c11f04_101 conda-forge
icu 58.2 hf484d3e_1000 conda-forge
idna 3.2 pypi_0 pypi
importlib-metadata 4.8.1 py37h89c1867_0 conda-forge
iniconfig 1.1.1 pypi_0 pypi
jsonschema 3.2.0 pypi_0 pypi
keras-applications 1.0.8 py_1 conda-forge
keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge
krb5 1.16.4 h2fd8d38_0 conda-forge
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libblas 3.9.0 8_mkl conda-forge
libcblas 3.9.0 8_mkl conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 11.2.0 h1d223b6_9 conda-forge
libgfortran-ng 7.5.0 h14aa051_19 conda-forge
libgfortran4 7.5.0 h14aa051_19 conda-forge
liblapack 3.9.0 8_mkl conda-forge
libpq 11.5 hd9ab2ff_2 conda-forge
libprotobuf 3.6.1 hdbcaa40_1001 conda-forge
libstdcxx-ng 11.2.0 he4da1e4_9 conda-forge
libzlib 1.2.11 h36c2ea0_1013 conda-forge
llvm-openmp 12.0.1 h4bd325d_1 conda-forge
mako 1.1.3 pyh9f0ad1d_0 conda-forge
markdown 3.3.4 pyhd8ed1ab_0 conda-forge
markupsafe 2.0.1 py37h5e8e339_0 conda-forge
mkl 2020.4 h726a3e6_304 conda-forge
mkl-service 2.3.0 py37h8f50634_2 conda-forge
msgpack 1.0.2 pypi_0 pypi
multidict 5.1.0 pypi_0 pypi
ncurses 6.2 h58526e2_4 conda-forge
networkx 2.4 py_1 conda-forge
ninja 1.10.2 h4bd325d_1 conda-forge
numpy 1.18.4 py37h8960a57_0 conda-forge
nvidia-ml-py3 7.352.0 pypi_0 pypi
oauthlib 3.1.1 pyhd8ed1ab_0 conda-forge
opencensus 0.7.13 pypi_0 pypi
opencensus-context 0.1.2 pypi_0 pypi
openssl 1.1.1l h7f98852_0 conda-forge
opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge
packaging 21.0 pypi_0 pypi
pandas 1.0.5 py37h0da4684_0 conda-forge
parquet-cpp 1.5.1 3 conda-forge
pathtools 0.1.2 pypi_0 pypi
pillow 8.3.2 pypi_0 pypi
pip 21.2.4 pyhd8ed1ab_0 conda-forge
pipdeptree 2.1.0 pypi_0 pypi
pluggy 1.0.0 pypi_0 pypi
prometheus-client 0.11.0 pypi_0 pypi
promise 2.3 py37h89c1867_4 conda-forge
protobuf 3.17.3 pypi_0 pypi
psutil 5.0.0 pypi_0 pypi
psycopg2 2.8.4 py37h1ba5d50_0
py 1.10.0 pypi_0 pypi
py-spy 0.3.9 pypi_0 pypi
py4j 0.10.7 py_1 conda-forge
pyarrow 0.11.1 py37hbbcf98d_1002 conda-forge
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.8 pypi_0 pypi
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pyjwt 2.2.0 pyhd8ed1ab_0 conda-forge
pyopenssl 21.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 2.4.7 pypi_0 pypi
pyrsistent 0.18.0 pypi_0 pypi
pysocks 1.7.1 py37h89c1867_3 conda-forge
pyspark 2.4.3 py_0 conda-forge
pytest 6.2.5 pypi_0 pypi
python 3.7.7 hcff3b4d_5
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-gflags 3.1.2 pypi_0 pypi
python_abi 3.7 2_cp37m conda-forge
pytorch 1.4.0 cuda101py37h02f0884_0
pytz 2021.3 pyhd8ed1ab_0 conda-forge
pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge
pyyaml 3.10 pypi_0 pypi
ray 0.8.6 pypi_0 pypi
readline 8.1 h46c0cb4_0 conda-forge
redis 3.4.1 pypi_0 pypi
requests 2.26.0 pypi_0 pypi
requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge
rsa 4.7.2 pyh44b312d_0 conda-forge
rx 3.2.0 pyhd8ed1ab_0 conda-forge
scipy 1.4.1 py37ha3d9a3c_3 conda-forge
sentry-sdk 0.4.0 pypi_0 pypi
setuptools 58.2.0 py37h89c1867_0 conda-forge
shortuuid 0.5.0 pypi_0 pypi
six 1.16.0 pyh6c4a22f_0 conda-forge
smmap 4.0.0 pypi_0 pypi
soupsieve 2.2.1 pypi_0 pypi
sqlite 3.36.0 h9cd32fc_2 conda-forge
subprocess32 3.5.4 pypi_0 pypi
tabulate 0.8.7 pypi_0 pypi
tensorboard 1.15.0 py37_0 conda-forge
tensorboard-data-server 0.6.0 py37hf1a17b8_0 conda-forge
tensorboard-plugin-wit 1.8.0 pyh44b312d_0 conda-forge
tensorboardx 2.4 pypi_0 pypi
tensorflow 1.15.0 mkl_py37h28c19af_0
tensorflow-base 1.15.0 mkl_py37he1670d9_0
tensorflow-estimator 1.15.1 pyh2649769_0
termcolor 1.1.0 py_2 conda-forge
thrift-cpp 0.12.0 h0a07b25_1002 conda-forge
tk 8.6.11 h27826a3_1 conda-forge
toml 0.10.2 pypi_0 pypi
torchvision 0.5.0 pypi_0 pypi
typing-extensions 3.10.0.2 hd8ed1ab_0 conda-forge
typing_extensions 3.10.0.2 pyha770c72_0 conda-forge
urllib3 1.26.7 pyhd8ed1ab_0 conda-forge
wandb 0.8.36 pypi_0 pypi
watchdog 0.8.3 pypi_0 pypi
werkzeug 0.16.1 py_0 conda-forge
wheel 0.37.0 pyhd8ed1ab_1 conda-forge
wrapt 1.13.1 py37h5e8e339_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
yapf 0.27.0 py_0 conda-forge
yarl 1.6.3 pypi_0 pypi
zipp 3.6.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.11 h36c2ea0_1013 conda-forge
The error.txt I get:
Failure # 1 (occurred at 2021-10-09_11-15-28)
Traceback (most recent call last):
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NeuroCard.train() [39m (pid=3821066, ip=10.77.110.215)
File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 474, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 478, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trainable.py", line 245, in init
self.setup(copy.deepcopy(self.config))
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trainable.py", line 769, in setup
self._setup(config)
File "run.py", line 508, in _setup
loaded_tables)
File "run.py", line 683, in MakeSamplerDatasetLoader
load_samples=self._load_samples)
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/common.py", line 789, in init
self._init_sampler()
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler.py", line 283, in _init_sampler
self.add_full_join_fanouts)
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler.py", line 190, in init
prepare_utils.prepare(join_spec)
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler_lib/prepare_utils.py", line 261, in prepare
print(table, ray.get(jkg))
ray.exceptions.RayTaskError: [36mray::factorized_sampler_lib.prepare_utils.get_join_key_groups() [39m (pid=3821015, ip=10.77.110.215)
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler_lib/prepare_utils.py", line 158, in get_join_key_groups
jct = ray.get(jcts[table])
ray.exceptions.RayTaskError: [36mray::factorized_sampler_lib.prepare_utils.get_first_jct() [39m (pid=3821012, ip=10.77.110.215)
File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 464, in ray._raylet.execute_task
ray.exceptions.RayWorkerError: The worker died unexpectedly while executing this task.

@concretevitamin
Copy link
Member

@Doris404 Is there any additional info in these files?

cat /tmp/ray/session_<timestamp>/logs/worker-<hash>.*

You may also try the README here to rebuild the rustlib to see if that's the issue.

@Doris404
Copy link
Author

@concretevitamin As for additional info in these files, I run the bash code seelogs.sh and get the result runseelogs. They are as follows:

seelogs.sh

cat worker-017cb5734d3e86412b9d5ef764d643043c02595b.err
cat worker-017cb5734d3e86412b9d5ef764d643043c02595b.out
cat worker-0216abc5c5194b60123b82dc6b95b7f9c99295ed.err
cat worker-0216abc5c5194b60123b82dc6b95b7f9c99295ed.out
cat worker-024c4fcf56bb0deae114970bd4122e6dd985fb36.err
cat worker-024c4fcf56bb0deae114970bd4122e6dd985fb36.out
cat worker-0727fe8e152f44f4c9b0999490acaedd2b724bba.err
cat worker-0727fe8e152f44f4c9b0999490acaedd2b724bba.out
cat worker-08c821e78b2c36ead28a531f5149c279630f651b.err
cat worker-08c821e78b2c36ead28a531f5149c279630f651b.out
cat worker-0cb47181397d4273b87854291335ef785bd06352.err
cat worker-0cb47181397d4273b87854291335ef785bd06352.out
cat worker-0dc0003534c956aed6369a242ebea7e8adfffb4c.err
cat worker-0dc0003534c956aed6369a242ebea7e8adfffb4c.out
cat worker-13a371ea8e333e49ce100b3aaa2c26e224cd81d7.err
cat worker-13a371ea8e333e49ce100b3aaa2c26e224cd81d7.out
cat worker-14d35afa48bce62a8334d4a61f6037e6d3a7bb74.err
cat worker-14d35afa48bce62a8334d4a61f6037e6d3a7bb74.out
cat worker-17b8a2bf1aaf58b8c87df17ea927a3887a8165f3.err
cat worker-17b8a2bf1aaf58b8c87df17ea927a3887a8165f3.out
cat worker-1918e1d886a78211357fb6caf078f56fc93a8161.err
cat worker-1918e1d886a78211357fb6caf078f56fc93a8161.out
cat worker-1a2c9c284db306363124f2dd55b9b775757c282c.err
cat worker-1a2c9c284db306363124f2dd55b9b775757c282c.out
cat worker-1ae61428dc1316e4840fdb1049150541932338be.err
cat worker-1ae61428dc1316e4840fdb1049150541932338be.out
cat worker-21cd66e011f4b8f1bbf3c0130338339b976acec8.err
cat worker-21cd66e011f4b8f1bbf3c0130338339b976acec8.out
cat worker-25fd269dc662d7e14ed5370b4a19f552624fbabf.err
cat worker-25fd269dc662d7e14ed5370b4a19f552624fbabf.out
cat worker-330f9ba9ea670062c40731b950f175dc35ed09fd.err
cat worker-330f9ba9ea670062c40731b950f175dc35ed09fd.out
cat worker-39d7f90bdc8238912efce03adf647893879fe85c.err
cat worker-39d7f90bdc8238912efce03adf647893879fe85c.out
cat worker-3a1159bc923874384ad107cee901edede632c0f8.err
cat worker-3a1159bc923874384ad107cee901edede632c0f8.out
cat worker-3ae34ba9b6bcfd366f9e7a27b9bfd42c605b50cc.err
cat worker-3ae34ba9b6bcfd366f9e7a27b9bfd42c605b50cc.out
cat worker-3e0353c141dd095426f4a9cedd0f375c94cd1251.err
cat worker-3e0353c141dd095426f4a9cedd0f375c94cd1251.out
cat worker-406eabc817c1a65534be67b775a5152d860162a4.err
cat worker-406eabc817c1a65534be67b775a5152d860162a4.out
cat worker-47d8033cf46008024c2f64205c36905ecee6b6d2.err
cat worker-47d8033cf46008024c2f64205c36905ecee6b6d2.out
cat worker-4cb92ea32449af46b111d981940c67786ccfe2b0.err
cat worker-4cb92ea32449af46b111d981940c67786ccfe2b0.out
cat worker-4d41763dd090f7d4c13d82e2a40c965ebc725601.err
cat worker-4d41763dd090f7d4c13d82e2a40c965ebc725601.out
cat worker-4e93c383ff5ddf3a7f2fc2cc2e8031c9c20cccc2.err
cat worker-4e93c383ff5ddf3a7f2fc2cc2e8031c9c20cccc2.out
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb-0100.err
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb-0100.out
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb.err
cat worker-4ebdbe793a040d1fd1cef7f63c7f3b80e4c27fbb.out
cat worker-52da1fd761640880b3df93f8492f87dafd583bed.err
cat worker-52da1fd761640880b3df93f8492f87dafd583bed.out
cat worker-530d963a4fddc4c935a800e5eaa8f229f688cda8.err
cat worker-530d963a4fddc4c935a800e5eaa8f229f688cda8.out
cat worker-56ef3ab21a65034ed63d8f692f99e5040abcbaba.err
cat worker-56ef3ab21a65034ed63d8f692f99e5040abcbaba.out
cat worker-5c9df2067c226879e37bc8bb26b2990156e86c54.err
cat worker-5c9df2067c226879e37bc8bb26b2990156e86c54.out
cat worker-5f6f38534da04203da303512a6ea3ff09b136005.err
cat worker-5f6f38534da04203da303512a6ea3ff09b136005.out
cat worker-6b15ac1a779d34578850c11518836cc704a721a4.err
cat worker-6b15ac1a779d34578850c11518836cc704a721a4.out
cat worker-6bd5ad49c545b20b21ac5b1f08736d10b7c92da6.err
cat worker-6bd5ad49c545b20b21ac5b1f08736d10b7c92da6.out
cat worker-7041f912671bdc6f1615fcc18cd8e936d2f1abb4.err
cat worker-7041f912671bdc6f1615fcc18cd8e936d2f1abb4.out
cat worker-7105b7a78112d1f4c6c42754b4fef324f104ac37.err
cat worker-7105b7a78112d1f4c6c42754b4fef324f104ac37.out
cat worker-73bb256ba085fc55fa792ddc6c04a2cde89d8377.err
cat worker-73bb256ba085fc55fa792ddc6c04a2cde89d8377.out
cat worker-765463ac5ce664a0b84b12823f5c2ea33b5e1ede.err
cat worker-765463ac5ce664a0b84b12823f5c2ea33b5e1ede.out
cat worker-7a265f83923c987f42f7aa92bc8ab3c0db80eb60.err
cat worker-7a265f83923c987f42f7aa92bc8ab3c0db80eb60.out
cat worker-7ae3949ca714a904230aa8994a3c942f694fd7da.err
cat worker-7ae3949ca714a904230aa8994a3c942f694fd7da.out
cat worker-7b55165b111bcd694cf497dd0a5112e7761cd006.err
cat worker-7b55165b111bcd694cf497dd0a5112e7761cd006.out
cat worker-7e0370376c0fbef48f22ac499daef61f6c185c1a.err
cat worker-7e0370376c0fbef48f22ac499daef61f6c185c1a.out
cat worker-7ef53ea888e82f4c317b890102de8684c5316a29.err
cat worker-7ef53ea888e82f4c317b890102de8684c5316a29.out
cat worker-827bf3141f4effe18b3dd397d2f09ab979f9595c.err
cat worker-827bf3141f4effe18b3dd397d2f09ab979f9595c.out
cat worker-82861b89f3102a3ad5e5ce0f6665ce9776ee4ccb.err
cat worker-82861b89f3102a3ad5e5ce0f6665ce9776ee4ccb.out
cat worker-8528be8b2f16dac520548431a5f7b4b4a2af0a0f.err
cat worker-8528be8b2f16dac520548431a5f7b4b4a2af0a0f.out
cat worker-85660fd96be5df65c528496f70ccd7cdbc1554b0.err
cat worker-85660fd96be5df65c528496f70ccd7cdbc1554b0.out
cat worker-8ab59a67ed3e9624a01ece730b8e41381a014add.err
cat worker-8ab59a67ed3e9624a01ece730b8e41381a014add.out
cat worker-910b1d93c50f7dce552196ad7258e32ad5ab3e73.err
cat worker-910b1d93c50f7dce552196ad7258e32ad5ab3e73.out
cat worker-9912341fcf75ce4df093088a7e2a6af660b464fd.err
cat worker-9912341fcf75ce4df093088a7e2a6af660b464fd.out
cat worker-9baf4fcd347f2a9cf3c46d96f9485effdd155cdf.err
cat worker-9baf4fcd347f2a9cf3c46d96f9485effdd155cdf.out
cat worker-a0f5f5702a5865e4533fcffc579ed1b941c5f733.err
cat worker-a0f5f5702a5865e4533fcffc579ed1b941c5f733.out
cat worker-a6a2b6c55658d71c92f8387e837078931448578e.err
cat worker-a6a2b6c55658d71c92f8387e837078931448578e.out
cat worker-a6f8df020b9e0b8fc5a68f840f0bd25c31894fb4.err
cat worker-a6f8df020b9e0b8fc5a68f840f0bd25c31894fb4.out
cat worker-aa0917bc5c1273f210ee448d283184cf3ac7eda8.err
cat worker-aa0917bc5c1273f210ee448d283184cf3ac7eda8.out
cat worker-b572acc29082dd0bc2396a6fc5ea1e3ef3b72161.err
cat worker-b572acc29082dd0bc2396a6fc5ea1e3ef3b72161.out
cat worker-b67dde277d3c11099ac3aa33d846e4556638291c.err
cat worker-b67dde277d3c11099ac3aa33d846e4556638291c.out
cat worker-b683fe9a04935e5fe478b94f3faa922a44c471af.err
cat worker-b683fe9a04935e5fe478b94f3faa922a44c471af.out
cat worker-b769c2862108c953641e90a5533a767286d6e3ef.err
cat worker-b769c2862108c953641e90a5533a767286d6e3ef.out
cat worker-b925e3eedefd710c30e866038f1a20a087974fcb.err
cat worker-b925e3eedefd710c30e866038f1a20a087974fcb.out
cat worker-c57a5df310c3f9d8445e7a03052058e90c67e275.err
cat worker-c57a5df310c3f9d8445e7a03052058e90c67e275.out
cat worker-c5ef4e65a4ed0730f3b835d53976fcccf46217b2.err
cat worker-c5ef4e65a4ed0730f3b835d53976fcccf46217b2.out
cat worker-c8841e828f3ee383efb10bb0652c9823750352f4.err
cat worker-c8841e828f3ee383efb10bb0652c9823750352f4.out
cat worker-cb77666172979bf4e9e2fce6651e6f2c805b74f8.err
cat worker-cb77666172979bf4e9e2fce6651e6f2c805b74f8.out
cat worker-cf5d483ccd97f9a4efd18f2429f8d9ff3e4896de.err
cat worker-cf5d483ccd97f9a4efd18f2429f8d9ff3e4896de.out
cat worker-d1ef4a67485ef740ba99bc7f4454f4df52123fe3.err
cat worker-d1ef4a67485ef740ba99bc7f4454f4df52123fe3.out
cat worker-d7ade90c8b1dc3e0ed8a0d8d19d30971961a8b0a.err
cat worker-d7ade90c8b1dc3e0ed8a0d8d19d30971961a8b0a.out
cat worker-d80378791e3641e990a163682be02c98722b39c2.err
cat worker-d80378791e3641e990a163682be02c98722b39c2.out
cat worker-de658e84e3280288766fd67b0afbe5a40397f22a.err
cat worker-de658e84e3280288766fd67b0afbe5a40397f22a.out
cat worker-e25cc27447984c64f343f9d4354091c4b8d056c4.err
cat worker-e25cc27447984c64f343f9d4354091c4b8d056c4.out
cat worker-e3785a863b9f471ccbcf37c49678dea2eed2a4a6.err
cat worker-e3785a863b9f471ccbcf37c49678dea2eed2a4a6.out
cat worker-e637f257603bef20375eecd07ba55fb5b2b0834d.err
cat worker-e637f257603bef20375eecd07ba55fb5b2b0834d.out
cat worker-e7ad6d0602fbabdff0a4c86f0cf48c348b5127d1.err
cat worker-e7ad6d0602fbabdff0a4c86f0cf48c348b5127d1.out
cat worker-ea23ac181e9dd8f182a1dc191d881b6079b57e45.err
cat worker-ea23ac181e9dd8f182a1dc191d881b6079b57e45.out
cat worker-ea41e7d1f1beafafe0a2d782d822165505c09ec1.err
cat worker-ea41e7d1f1beafafe0a2d782d822165505c09ec1.out
cat worker-ea90cb69f7e1967ff1eeb4c25b0b10d13999f5f3.err
cat worker-ea90cb69f7e1967ff1eeb4c25b0b10d13999f5f3.out
cat worker-eefab545de60dd79b4918a7aaf8214d9422e7f88.err
cat worker-eefab545de60dd79b4918a7aaf8214d9422e7f88.out
cat worker-f1ddc43002669eedf53ea40558bc949c3c1e7b2e.err
cat worker-f1ddc43002669eedf53ea40558bc949c3c1e7b2e.out
cat worker-f371acc4e46b4b295431c7748ee3db44fc98b0a2.err
cat worker-f371acc4e46b4b295431c7748ee3db44fc98b0a2.out
cat worker-f72a41922ada55153579408f8d5eb39a03419f0f.err
cat worker-f72a41922ada55153579408f8d5eb39a03419f0f.out
cat worker-faa4483294313dfee4727fd0d2a11f8481254f40.err
cat worker-faa4483294313dfee4727fd0d2a11f8481254f40.out

runseelogs

Ray worker pid: 4061730
Ray worker pid: 4061730
Ray worker pid: 4061917
Ray worker pid: 4061917
Ray worker pid: 4061756
Ray worker pid: 4061756
Ray worker pid: 4061898
Ray worker pid: 4061898
Ray worker pid: 4061886
Ray worker pid: 4061886
Ray worker pid: 4061755
Ray worker pid: 4061755
Ray worker pid: 4061763
Ray worker pid: 4061763
Ray worker pid: 4061740
Ray worker pid: 4061740
Ray worker pid: 4061741
Ray worker pid: 4061741
Ray worker pid: 4061832
Ray worker pid: 4061832
Ray worker pid: 4061827
Ray worker pid: 4061827
Ray worker pid: 4061818
Ray worker pid: 4061818
Ray worker pid: 4061808
Ray worker pid: 4061808
Ray worker pid: 4061760
Ray worker pid: 4061760
Ray worker pid: 4061930
Ray worker pid: 4061930
Ray worker pid: 4061738
Ray worker pid: 4061738
Ray worker pid: 4061867
Ray worker pid: 4061867
Ray worker pid: 4061754
Ray worker pid: 4061754
Ray worker pid: 4061747
Ray worker pid: 4061747
Ray worker pid: 4061789
Ray worker pid: 4061789
Ray worker pid: 4061759
Ray worker pid: 4061759
Ray worker pid: 4061737
Ray worker pid: 4061737
Ray worker pid: 4061718
Ray worker pid: 4061718
Ray worker pid: 4061736
Ray worker pid: 4061736
Ray worker pid: 4061716
Ray worker pid: 4061716
Ray worker pid: 4061926
wandb: W&B is a tool that helps track and visualize machine learning experiments
wandb: No credentials found.  Run "wandb login" to visualize your metrics
wandb: Tracking run with wandb version 0.8.36
wandb: Wandb version 0.12.4 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Run data is saved locally in wandb/run-20211012_045917-r4hynqbc

2021-10-12 04:59:32,363	ERROR worker.py:666 -- Calling ray.init() again after it has already been called.
I1012 04:59:36.770085 4061926 factorized_sampler.py:142] DataTableActor of `cast_info` is ready.
I1012 04:59:39.437844 4061926 factorized_sampler.py:142] DataTableActor of `movie_companies` is ready.
I1012 04:59:43.439739 4061926 factorized_sampler.py:142] DataTableActor of `movie_info` is ready.
I1012 04:59:45.324186 4061926 factorized_sampler.py:142] DataTableActor of `movie_keyword` is ready.
I1012 04:59:48.940280 4061926 factorized_sampler.py:142] DataTableActor of `title` is ready.
I1012 04:59:50.276457 4061926 factorized_sampler.py:142] DataTableActor of `movie_info_idx` is ready.
I1012 04:59:50.276656 4061926 data_utils.py:28] Loading cached join count table of `cast_info` from ./cache/job-light-a2be9f04/cast_info.jct
*** Aborted at 1634014790 (unix time) try "date -d @1634014790" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x28) received by PID 4061926 (TID 0x7ff44703d740) from PID 40; stack trace: ***
    @     0x7ff4473ae3c0 (unknown)
    @     0x7ff4473a4fc4 __GI___pthread_mutex_lock
    @     0x7fe43906a068 google::protobuf::internal::OnShutdownRun()
    @     0x7fe43907225f google::protobuf::internal::InitProtobufDefaults()
    @     0x7fe4390724a1 google::protobuf::internal::InitSCCImpl()
    @     0x7fe438ff4fbe protobuf_orc_5fproto_2eproto::InitDefaults()
    @     0x7fe438ff52aa protobuf_orc_5fproto_2eproto::AddDescriptorsImpl()
    @     0x7ff4473ab47f __pthread_once_slow
    @     0x7fe438ff580d protobuf_orc_5fproto_2eproto::AddDescriptors()
    @     0x7ff4473dbb8a (unknown)
    @     0x7ff4473dbc91 (unknown)
    @     0x7ff44730a915 _dl_catch_exception
    @     0x7ff4473e00bf (unknown)
    @     0x7ff44730a8b8 _dl_catch_exception
    @     0x7ff4473df5fa (unknown)
    @     0x7ff4471a234c (unknown)
    @     0x7ff44730a8b8 _dl_catch_exception
    @     0x7ff44730a983 _dl_catch_error
    @     0x7ff4471a2b59 (unknown)
    @     0x7ff4471a23da dlopen
    @     0x56288edb876d _PyImport_FindSharedFuncptr
    @     0x56288eddcc20 _PyImport_LoadDynamicModuleWithSpec
    @     0x56288eddce79 _imp_create_dynamic
    @     0x56288ece4b62 _PyMethodDef_RawFastCallDict
    @     0x56288ece4c81 _PyCFunction_FastCallDict
    @     0x56288ed802ed _PyEval_EvalFrameDefault
    @     0x56288ecc32b9 _PyEval_EvalCodeWithName
    @     0x56288ed13497 _PyFunction_FastCallKeywords
    @     0x56288ed7f229 _PyEval_EvalFrameDefault
    @     0x56288ed1320b _PyFunction_FastCallKeywords
    @     0x56288ed7ae70 _PyEval_EvalFrameDefault
    @     0x56288ed1320b _PyFunction_FastCallKeywords
wandb: Program ended successfully.
wandb: You can sync this run to the cloud by running: 
wandb: wandb sync wandb/run-20211012_045917-r4hynqbc
Ray worker pid: 4061926
NeuroCard config:
{'__cpu': 1,
 '__gpu': 1,
 '__run': 'test-job-light',
 '_load_samples': None,
 '_save_samples': None,
 'asserts': {'fact_psample_8000_median': 4,
             'fact_psample_8000_p99': 50,
             'train_bits': 80},
 'bs': 2048,
 'checkpoint_every_epoch': False,
 'checkpoint_to_load': None,
 'compute_test_loss': True,
 'constant_lr': None,
 'custom_lr_lambda': None,
 'cwd': '/home/liujw/deepice/code/ce/neurocard-master/neurocard',
 'dataset': 'imdb',
 'direct_io': True,
 'disable_learnable_unk': False,
 'dropout': 1,
 'embed_size': 32,
 'embs_tied': True,
 'epochs': 1,
 'epochs_per_iteration': 1,
 'eval_join_sampling': None,
 'eval_psamples': [8000],
 'factorize': True,
 'factorize_blacklist': None,
 'factorize_fanouts': False,
 'fc_hiddens': 128,
 'fixed_dropout_ratio': False,
 'force_query_cols': None,
 'grouped_dropout': True,
 'input_encoding': 'embed',
 'input_no_emb_if_leq': False,
 'join_clauses': None,
 'join_how': 'outer',
 'join_keys': {'cast_info': ['movie_id'],
               'movie_companies': ['movie_id'],
               'movie_info': ['movie_id'],
               'movie_info_idx': ['movie_id'],
               'movie_keyword': ['movie_id'],
               'title': ['id']},
 'join_name': 'job-light',
 'join_root': 'title',
 'join_tables': ['cast_info',
                 'movie_companies',
                 'movie_info',
                 'movie_keyword',
                 'title',
                 'movie_info_idx'],
 'label_smoothing': 0,
 'layers': 4,
 'loader_workers': 4,
 'lr_scheduler': 'OneCycleLR-0.28',
 'max_steps': 500,
 'num_dmol': 0,
 'num_eval_queries_at_checkpoint_load': 2000,
 'num_eval_queries_at_end': 70,
 'num_eval_queries_per_iteration': 70,
 'num_orderings': 1,
 'optimizer': 'adam',
 'order': None,
 'order_content_only': True,
 'order_indicators_at_front': False,
 'order_seed': None,
 'output_encoding': 'embed',
 'per_row_dropout': False,
 'queries_csv': './queries/job-light.csv',
 'query_filters': [5, 12],
 'residual': True,
 'resmade_drop_prob': 0.1,
 'sampler': 'factorized_sampler',
 'sampler_batch_size': 4096,
 'save_checkpoint_at_end': False,
 'seed': 0,
 'special_order_seed': 0,
 'special_orders': 0,
 'table_dropout': True,
 'transformer_args': {},
 'use_cols': 'simple',
 'use_data_parallel': False,
 'use_transformer': False,
 'warmups': 0.05,
 'word_size_bits': 11}
Training on Join(['cast_info', 'movie_companies', 'movie_info', 'movie_keyword', 'title', 'movie_info_idx'])
Loading cast_info
Loaded parsed Table from ./datasets/job/cast_info.movie_id-role_id.table
cast_info([Column(movie_id, distribution_size=2331601), Column(role_id, distribution_size=11)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36244344 entries, 0 to 36244343
Data columns (total 2 columns):
 #   Column    Dtype
---  ------    -----
 0   movie_id  int64
 1   role_id   int64
dtypes: int64(2)
memory usage: 553.0 MB
Loading movie_companies
Loaded parsed Table from ./datasets/job/movie_companies.company_id-company_type_id-movie_id.table
movie_companies([Column(company_id, distribution_size=234997), Column(company_type_id, distribution_size=2), Column(movie_id, distribution_size=1087236)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2609129 entries, 0 to 2609128
Data columns (total 3 columns):
 #   Column           Dtype
---  ------           -----
 0   company_id       int64
 1   company_type_id  int64
 2   movie_id         int64
dtypes: int64(3)
memory usage: 59.7 MB
Loading movie_info
Loaded parsed Table from ./datasets/job/movie_info.movie_id-info_type_id.table
movie_info([Column(movie_id, distribution_size=2468825), Column(info_type_id, distribution_size=71)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835720 entries, 0 to 14835719
Data columns (total 2 columns):
 #   Column        Dtype
---  ------        -----
 0   movie_id      int64
 1   info_type_id  int64
dtypes: int64(2)
memory usage: 226.4 MB
Loading movie_keyword
Loaded parsed Table from ./datasets/job/movie_keyword.movie_id-keyword_id.table
movie_keyword([Column(movie_id, distribution_size=476794), Column(keyword_id, distribution_size=134170)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4523930 entries, 0 to 4523929
Data columns (total 2 columns):
 #   Column      Dtype
---  ------      -----
 0   movie_id    int64
 1   keyword_id  int64
dtypes: int64(2)
memory usage: 69.0 MB
Loading title
Loaded parsed Table from ./datasets/job/title.id-kind_id-production_year.table
title([Column(id, distribution_size=2528312), Column(kind_id, distribution_size=7), Column(production_year, distribution_size=133)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2528312 entries, 0 to 2528311
Data columns (total 3 columns):
 #   Column           Dtype  
---  ------           -----  
 0   id               int64  
 1   kind_id          int64  
 2   production_year  float64
dtypes: float64(1), int64(2)
memory usage: 57.9 MB
Loading movie_info_idx
Loaded parsed Table from ./datasets/job/movie_info_idx.info_type_id-movie_id.table
movie_info_idx([Column(info_type_id, distribution_size=5), Column(movie_id, distribution_size=459925)])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1380035 entries, 0 to 1380034
Data columns (total 2 columns):
 #   Column        Non-Null Count    Dtype
---  ------        --------------    -----
 0   info_type_id  1380035 non-null  int64
 1   movie_id      1380035 non-null  int64
dtypes: int64(2)
memory usage: 21.1 MB
Full outer join specified, inserting np.nan to all column domains
Ray worker pid: 4061926
Ray worker pid: 4061926
Ray worker pid: 4061757
Ray worker pid: 4061757
Ray worker pid: 4061911
Ray worker pid: 4061911
Ray worker pid: 4061729
Ray worker pid: 4061729
Ray worker pid: 4061795
Ray worker pid: 4061795
Ray worker pid: 4061937
Ray worker pid: 4061937
Ray worker pid: 4061864
Ray worker pid: 4061864
Ray worker pid: 4061748
Ray worker pid: 4061748
Ray worker pid: 4061745
Ray worker pid: 4061745
Ray worker pid: 4061727
Ray worker pid: 4061727
Ray worker pid: 4061897
Ray worker pid: 4061897
Ray worker pid: 4061743
Ray worker pid: 4061743
Ray worker pid: 4061934
Ray worker pid: 4061934
Ray worker pid: 4061761
Ray worker pid: 4061761
Ray worker pid: 4061719
Ray worker pid: 4061719
Ray worker pid: 4061830
Ray worker pid: 4061830
Ray worker pid: 4061723
Ray worker pid: 4061723
Ray worker pid: 4061770
Ray worker pid: 4061770
Ray worker pid: 4061825
Ray worker pid: 4061825
Ray worker pid: 4061921
Ray worker pid: 4061921
Ray worker pid: 4061835
Ray worker pid: 4061835
Ray worker pid: 4061732
Ray worker pid: 4061732
Ray worker pid: 4061746
Ray worker pid: 4061746
Ray worker pid: 4061924
Ray worker pid: 4061924
Ray worker pid: 4061721
Ray worker pid: 4061721
Ray worker pid: 4061742
Ray worker pid: 4061742
Ray worker pid: 4061822
Ray worker pid: 4061822
Ray worker pid: 4061734
Ray worker pid: 4061734
Ray worker pid: 4061733
Ray worker pid: 4061733
Ray worker pid: 4061735
Ray worker pid: 4061735
Ray worker pid: 4061753
Ray worker pid: 4061753
Ray worker pid: 4061762
Ray worker pid: 4061762
Ray worker pid: 4061758
Ray worker pid: 4061758
Ray worker pid: 4061722
Ray worker pid: 4061722
Ray worker pid: 4061807
Ray worker pid: 4061807
Ray worker pid: 4061739
Ray worker pid: 4061739
Ray worker pid: 4061749
Ray worker pid: 4061749
Ray worker pid: 4061892
Ray worker pid: 4061892
Ray worker pid: 4061774
Ray worker pid: 4061774
Ray worker pid: 4061731
Ray worker pid: 4061731
Ray worker pid: 4061744
Ray worker pid: 4061744
Ray worker pid: 4061725
Ray worker pid: 4061725
Ray worker pid: 4061750
Ray worker pid: 4061750
Ray worker pid: 4061720
Ray worker pid: 4061720
Ray worker pid: 4061797
Ray worker pid: 4061797
Ray worker pid: 4061717
Ray worker pid: 4061717
Ray worker pid: 4061764
Ray worker pid: 4061764
Ray worker pid: 4061728
Ray worker pid: 4061728
Ray worker pid: 4061912
Ray worker pid: 4061912
Ray worker pid: 4061851
Ray worker pid: 4061851
Ray worker pid: 4061752
Ray worker pid: 4061752
Ray worker pid: 4061920
Ray worker pid: 4061920
Ray worker pid: 4061765
Ray worker pid: 4061765
Ray worker pid: 4061804
Ray worker pid: 4061804
Ray worker pid: 4061767
Ray worker pid: 4061767

As for rustilb issue: my rustic version is 1.55.0 and cargo is 1.55.0. It seems that I have no rustib and when I run bash build.sh I can't succeed. The error reports are as follows:

error: failed to parse manifest at `Cargo.toml`

Caused by:
  can't find library `rustlib`, rename file to `src/lib.rs` or specify lib.path

I'm new to pre-packaged library and can not find useful information about it on the internet. Where can I get the rustilb? Maybe you can give me some clue, with which I can get the key to solve this problem? 😃

@franklsf95
Copy link

The rustlib source code is in neurocard/neurocard/factorized_sampler_lib/pyext-rustlib/. Are you using the Nightly build of Rust? (See instructions here https://github.com/neurocard/neurocard/tree/master/neurocard/factorized_sampler_lib/pyext-rustlib)

I'm not sure what the cause could be here. Maybe you ran out of memory? How big is your machine RAM?

@Doris404
Copy link
Author

As for the memory: it is not that case.
As for rust: It is difficult for me to build a Nightly build of Rust on my server (can not find the host). May be there are other methods to build the same environment for running NeuroCard. For example, can you build a docker image which contains the environment.

@franklsf95
Copy link

@Doris404 Installing Nightly Rust should be easy and does not require building Rust (https://rust-lang.github.io/rustup/concepts/channels.html). I can try to make a Docker image in the future, but probably not any time soon.

@concretevitamin
Copy link
Member

@Doris404 what OS are you on? Hacked together a non-optimized, basic Dockerfile - can you try it out?

# Example usage (call from project root dir):
#   docker build -t neurocard-test .
#   docker run -it --rm --runtime=nvidia neurocard-test

FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime

RUN apt -y update --fix-missing && \
    apt -y install tree wget vim less build-essential python-setuptools python-dev && \
    rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

# Install NeuroCard dependencies.
RUN pip install \
      numpy==1.18.4 \
      pandas==1.0.5 \
      absl-py==0.9.0 \
      glog==0.3.1 \
      networkx==2.4 \
      ray[tune]==0.8.7 \
      tabulate==0.8.7 \
      scipy==1.4.1 \
      yapf==0.27.0 \
      mako==1.1.3 \
      pyspark==2.4.3 \
      wandb==0.8.36 \
      psycopg2 \
      pyarrow

WORKDIR /app
COPY . .
RUN cd neurocard && bash scripts/download_imdb.sh
CMD cd neurocard && bash

I've tested this on x86_64-unknown-linux-gnu.

Note that this doesn't rebuild the rust lib. To see if rust lib is the issue, you can set this option to fair_sampler.

@Doris404
Copy link
Author

I managed to install rust on my Mac. Now I am stucked on build step, in which "feature has been removed" is reported. I googled the problem on the internet. It is said that rust has removed some features when updating. The rust version on my Mac is rustc 1.58.0-nightly. I guess the version of rust on my Mac may be different from the required one. What's the rust version required for the environment?
截屏2021-10-24 上午10 31 30

@concretevitamin
Copy link
Member

Applying this patch should make it compile. Tested with cargo 1.57.0-nightly (7fbbf4e8f 2021-10-19). It bumps up dependencies' versions and fixes some "unsafe" compilation errors.

BTW, we do not recommend running NeuroCard experiments on Mac or non-GPU machines.

@Doris404
Copy link
Author

Doris404 commented Nov 3, 2021

This time I move the environment to x86 with gpu and the rustily.so installed. When I run the python run.py --run test-job-light, it still doesn't succeed. The memory size is 60g.

截屏2021-11-03 下午6 06 08

截屏2021-11-03 下午6 03 34

@concretevitamin
Copy link
Member

concretevitamin commented Nov 3, 2021 via email

@Doris404
Copy link
Author

Doris404 commented Nov 4, 2021

I build up the environment according to the environment.yml. The version is the same.

@Doris404
Copy link
Author

My fault, I fixed the problem at last. Thanks a lot!

@yuting-weng
Copy link

Hello, could you tell how you solve the problem?

@WeChat098
Copy link

你好,请问你是怎么解决这个问题的额?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants