-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray problem: ray.tune.error.TuneError: ('Trials did not complete', [NeuroCard_44b5b_00000]) #4
Comments
@concretevitamin As for additional info in these files, I run the bash code seelogs.sh and get the result runseelogs. They are as follows: seelogs.sh
runseelogs
As for rustilb issue: my rustic version is 1.55.0 and cargo is 1.55.0. It seems that I have no rustib and when I run
I'm new to pre-packaged library and can not find useful information about it on the internet. Where can I get the |
The rustlib source code is in I'm not sure what the cause could be here. Maybe you ran out of memory? How big is your machine RAM? |
As for the memory: it is not that case. |
@Doris404 Installing Nightly Rust should be easy and does not require building Rust (https://rust-lang.github.io/rustup/concepts/channels.html). I can try to make a Docker image in the future, but probably not any time soon. |
@Doris404 what OS are you on? Hacked together a non-optimized, basic Dockerfile - can you try it out?
I've tested this on x86_64-unknown-linux-gnu. Note that this doesn't rebuild the rust lib. To see if rust lib is the issue, you can set this option to |
Applying this patch should make it compile. Tested with BTW, we do not recommend running NeuroCard experiments on Mac or non-GPU machines. |
Check that wandb version matches the one specified in the environment yaml.
…On Wed, Nov 3, 2021 at 03:07 Doris404 ***@***.***> wrote:
This time I move the environment to x86 with gpu and the rustily.so
installed. When I run the python run.py --run test-job-light, it still
doesn't succeed. The memory size is 60g.
[image: 截屏2021-11-03 下午6 06 08]
<https://user-images.githubusercontent.com/37341760/140041491-f59de11f-9678-4fd0-af2a-8c0fb96356a3.png>
[image: 截屏2021-11-03 下午6 03 34]
<https://user-images.githubusercontent.com/37341760/140041102-33139d71-30a0-43cb-9317-d78d1ba33d6d.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEQWHR5XPFD5DL2ZAVCKLTUKECXDANCNFSM5FWBHSUA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I build up the environment according to the environment.yml. The version is the same. |
My fault, I fixed the problem at last. Thanks a lot! |
Hello, could you tell how you solve the problem? |
你好,请问你是怎么解决这个问题的额? |
I try to fix it with the help of google, which turns out no use. The slack overflow don't have the correct answer (I try to change the version of packages which do not scceed), so I send this email hoping to get some advice.
More details about my problem:
Linux 5.4.0-84-generic #94-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
cuda version 11.4.20210728
The packages:
Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_llvm conda-forge
_pytorch_select 0.2 gpu_0
_tflow_select 2.3.0 mkl
absl-py 0.9.0 py37hc8dfbb8_1 conda-forge
aiohttp 3.7.4.post0 py37h5e8e339_0 conda-forge
aiohttp-cors 0.7.0 pypi_0 pypi
aioredis 2.0.0 pypi_0 pypi
argh 0.26.2 pypi_0 pypi
arrow-cpp 0.11.1 py37h0e61e49_1004 conda-forge
astor 0.8.1 pyh9f0ad1d_0 conda-forge
async-timeout 3.0.1 py_1000 conda-forge
attrs 21.2.0 pyhd8ed1ab_0 conda-forge
beautifulsoup4 4.10.0 pypi_0 pypi
blas 1.0 mkl conda-forge
blessings 1.7 pypi_0 pypi
blinker 1.4 py_1 conda-forge
boost-cpp 1.68.0 h11c811c_1000 conda-forge
brotlipy 0.7.0 py37h5e8e339_1001 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.17.2 h7f98852_0 conda-forge
ca-certificates 2021.10.8 ha878542_0 conda-forge
cached-property 1.5.2 hd8ed1ab_1 conda-forge
cached_property 1.5.2 pyha770c72_1 conda-forge
cachetools 4.2.2 pypi_0 pypi
certifi 2021.5.30 pypi_0 pypi
cffi 1.14.6 py37hc58025e_0 conda-forge
chardet 4.0.0 py37h89c1867_1 conda-forge
charset-normalizer 2.0.6 pypi_0 pypi
click 8.0.1 py37h89c1867_0 conda-forge
cloudpickle 2.0.0 pypi_0 pypi
colorama 0.4.4 pypi_0 pypi
colorful 0.5.4 pypi_0 pypi
configparser 3.8.1 pypi_0 pypi
cryptography 3.4.7 py37h5d9358c_0 conda-forge
cudatoolkit 10.1.243 h036e899_9 conda-forge
cudnn 7.6.5.32 hc0a50b0_1 conda-forge
dataclasses 0.8 pyhc8e2a94_3 conda-forge
decorator 5.1.0 pyhd8ed1ab_0 conda-forge
docker-pycreds 0.4.0 pypi_0 pypi
filelock 3.0.12 pypi_0 pypi
funcsigs 1.0.2 pypi_0 pypi
gast 0.2.2 py_0 conda-forge
gitdb 4.0.7 pypi_0 pypi
gitpython 1.0.0 pypi_0 pypi
glog 0.3.1 pypi_0 pypi
google 3.0.0 pypi_0 pypi
google-api-core 1.31.3 pypi_0 pypi
google-auth 1.35.0 pyh6c4a22f_0 conda-forge
google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge
google-pasta 0.2.0 pyh8c360ce_0 conda-forge
googleapis-common-protos 1.53.0 pypi_0 pypi
gpustat 0.4.1 pypi_0 pypi
gql 0.3.0 py_0 conda-forge
graphql-core 1.1 pypi_0 pypi
grpcio 1.40.0 pypi_0 pypi
h5py 3.3.0 nompi_py37ha3df211_100 conda-forge
hdf5 1.10.6 nompi_h3c11f04_101 conda-forge
icu 58.2 hf484d3e_1000 conda-forge
idna 3.2 pypi_0 pypi
importlib-metadata 4.8.1 py37h89c1867_0 conda-forge
iniconfig 1.1.1 pypi_0 pypi
jsonschema 3.2.0 pypi_0 pypi
keras-applications 1.0.8 py_1 conda-forge
keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge
krb5 1.16.4 h2fd8d38_0 conda-forge
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libblas 3.9.0 8_mkl conda-forge
libcblas 3.9.0 8_mkl conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 11.2.0 h1d223b6_9 conda-forge
libgfortran-ng 7.5.0 h14aa051_19 conda-forge
libgfortran4 7.5.0 h14aa051_19 conda-forge
liblapack 3.9.0 8_mkl conda-forge
libpq 11.5 hd9ab2ff_2 conda-forge
libprotobuf 3.6.1 hdbcaa40_1001 conda-forge
libstdcxx-ng 11.2.0 he4da1e4_9 conda-forge
libzlib 1.2.11 h36c2ea0_1013 conda-forge
llvm-openmp 12.0.1 h4bd325d_1 conda-forge
mako 1.1.3 pyh9f0ad1d_0 conda-forge
markdown 3.3.4 pyhd8ed1ab_0 conda-forge
markupsafe 2.0.1 py37h5e8e339_0 conda-forge
mkl 2020.4 h726a3e6_304 conda-forge
mkl-service 2.3.0 py37h8f50634_2 conda-forge
msgpack 1.0.2 pypi_0 pypi
multidict 5.1.0 pypi_0 pypi
ncurses 6.2 h58526e2_4 conda-forge
networkx 2.4 py_1 conda-forge
ninja 1.10.2 h4bd325d_1 conda-forge
numpy 1.18.4 py37h8960a57_0 conda-forge
nvidia-ml-py3 7.352.0 pypi_0 pypi
oauthlib 3.1.1 pyhd8ed1ab_0 conda-forge
opencensus 0.7.13 pypi_0 pypi
opencensus-context 0.1.2 pypi_0 pypi
openssl 1.1.1l h7f98852_0 conda-forge
opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge
packaging 21.0 pypi_0 pypi
pandas 1.0.5 py37h0da4684_0 conda-forge
parquet-cpp 1.5.1 3 conda-forge
pathtools 0.1.2 pypi_0 pypi
pillow 8.3.2 pypi_0 pypi
pip 21.2.4 pyhd8ed1ab_0 conda-forge
pipdeptree 2.1.0 pypi_0 pypi
pluggy 1.0.0 pypi_0 pypi
prometheus-client 0.11.0 pypi_0 pypi
promise 2.3 py37h89c1867_4 conda-forge
protobuf 3.17.3 pypi_0 pypi
psutil 5.0.0 pypi_0 pypi
psycopg2 2.8.4 py37h1ba5d50_0
py 1.10.0 pypi_0 pypi
py-spy 0.3.9 pypi_0 pypi
py4j 0.10.7 py_1 conda-forge
pyarrow 0.11.1 py37hbbcf98d_1002 conda-forge
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.8 pypi_0 pypi
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pyjwt 2.2.0 pyhd8ed1ab_0 conda-forge
pyopenssl 21.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 2.4.7 pypi_0 pypi
pyrsistent 0.18.0 pypi_0 pypi
pysocks 1.7.1 py37h89c1867_3 conda-forge
pyspark 2.4.3 py_0 conda-forge
pytest 6.2.5 pypi_0 pypi
python 3.7.7 hcff3b4d_5
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-gflags 3.1.2 pypi_0 pypi
python_abi 3.7 2_cp37m conda-forge
pytorch 1.4.0 cuda101py37h02f0884_0
pytz 2021.3 pyhd8ed1ab_0 conda-forge
pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge
pyyaml 3.10 pypi_0 pypi
ray 0.8.6 pypi_0 pypi
readline 8.1 h46c0cb4_0 conda-forge
redis 3.4.1 pypi_0 pypi
requests 2.26.0 pypi_0 pypi
requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge
rsa 4.7.2 pyh44b312d_0 conda-forge
rx 3.2.0 pyhd8ed1ab_0 conda-forge
scipy 1.4.1 py37ha3d9a3c_3 conda-forge
sentry-sdk 0.4.0 pypi_0 pypi
setuptools 58.2.0 py37h89c1867_0 conda-forge
shortuuid 0.5.0 pypi_0 pypi
six 1.16.0 pyh6c4a22f_0 conda-forge
smmap 4.0.0 pypi_0 pypi
soupsieve 2.2.1 pypi_0 pypi
sqlite 3.36.0 h9cd32fc_2 conda-forge
subprocess32 3.5.4 pypi_0 pypi
tabulate 0.8.7 pypi_0 pypi
tensorboard 1.15.0 py37_0 conda-forge
tensorboard-data-server 0.6.0 py37hf1a17b8_0 conda-forge
tensorboard-plugin-wit 1.8.0 pyh44b312d_0 conda-forge
tensorboardx 2.4 pypi_0 pypi
tensorflow 1.15.0 mkl_py37h28c19af_0
tensorflow-base 1.15.0 mkl_py37he1670d9_0
tensorflow-estimator 1.15.1 pyh2649769_0
termcolor 1.1.0 py_2 conda-forge
thrift-cpp 0.12.0 h0a07b25_1002 conda-forge
tk 8.6.11 h27826a3_1 conda-forge
toml 0.10.2 pypi_0 pypi
torchvision 0.5.0 pypi_0 pypi
typing-extensions 3.10.0.2 hd8ed1ab_0 conda-forge
typing_extensions 3.10.0.2 pyha770c72_0 conda-forge
urllib3 1.26.7 pyhd8ed1ab_0 conda-forge
wandb 0.8.36 pypi_0 pypi
watchdog 0.8.3 pypi_0 pypi
werkzeug 0.16.1 py_0 conda-forge
wheel 0.37.0 pyhd8ed1ab_1 conda-forge
wrapt 1.13.1 py37h5e8e339_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
yapf 0.27.0 py_0 conda-forge
yarl 1.6.3 pypi_0 pypi
zipp 3.6.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.11 h36c2ea0_1013 conda-forge
The error.txt I get:
Failure # 1 (occurred at 2021-10-09_11-15-28)
Traceback (most recent call last):
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::NeuroCard.train() [39m (pid=3821066, ip=10.77.110.215)
File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 474, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 478, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trainable.py", line 245, in init
self.setup(copy.deepcopy(self.config))
File "/home/liujw/miniconda3/envs/neurocard/lib/python3.7/site-packages/ray/tune/trainable.py", line 769, in setup
self._setup(config)
File "run.py", line 508, in _setup
loaded_tables)
File "run.py", line 683, in MakeSamplerDatasetLoader
load_samples=self._load_samples)
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/common.py", line 789, in init
self._init_sampler()
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler.py", line 283, in _init_sampler
self.add_full_join_fanouts)
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler.py", line 190, in init
prepare_utils.prepare(join_spec)
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler_lib/prepare_utils.py", line 261, in prepare
print(table, ray.get(jkg))
ray.exceptions.RayTaskError: [36mray::factorized_sampler_lib.prepare_utils.get_join_key_groups() [39m (pid=3821015, ip=10.77.110.215)
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "/home/liujw/deepice/code/ce/neurocard-master/neurocard/factorized_sampler_lib/prepare_utils.py", line 158, in get_join_key_groups
jct = ray.get(jcts[table])
ray.exceptions.RayTaskError: [36mray::factorized_sampler_lib.prepare_utils.get_first_jct() [39m (pid=3821012, ip=10.77.110.215)
File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 464, in ray._raylet.execute_task
ray.exceptions.RayWorkerError: The worker died unexpectedly while executing this task.
The text was updated successfully, but these errors were encountered: