Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training process #14

Open
zhangyahu1 opened this issue Aug 15, 2021 · 8 comments
Open

Training process #14

zhangyahu1 opened this issue Aug 15, 2021 · 8 comments

Comments

@zhangyahu1
Copy link

zhangyahu1 commented Aug 15, 2021

No description provided.

@zhangyahu1 zhangyahu1 changed the title Traning process Training process Aug 15, 2021
@zhangyahu1
Copy link
Author

I also try to run
pip install "neuralnet-pytorch[gin] @ git+git://github.com/justanhduc/neuralnet-pytorch.git@6bda19fdc57f176cb82f58d287602f4ccf4cfc23" --global-option="--cuda-ext"

There exists an error

ERROR: Command errored out with exit status 1:
command: /home/yzhang4/anaconda3/envs/graphxx/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/setup.py'"'"'; __
file__='"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools i
mport setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cuda-ext install --record /tmp/pip-record-jzi_zomv/install-record.txt --single-versi
on-externally-managed --compile --install-headers /home/yzhang4/anaconda3/envs/graphxx/include/python3.6m/neuralnet-pytorch
cwd: /tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/
Complete output (114 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch
copying neuralnet_pytorch/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch
copying neuralnet_pytorch/_version.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch
copying neuralnet_pytorch/version.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch
copying neuralnet_pytorch/metrics.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch
copying neuralnet_pytorch/monitor.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions
copying neuralnet_pytorch/extensions/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions
copying neuralnet_pytorch/extensions/bpd.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions
copying neuralnet_pytorch/extensions/dist_chamfer.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions
copying neuralnet_pytorch/extensions/dist_emd.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions
copying neuralnet_pytorch/extensions/pc2vox.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/extensions
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/gin_nnt
copying neuralnet_pytorch/gin_nnt/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/gin_nnt
copying neuralnet_pytorch/gin_nnt/external_configurables.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/gin_nnt
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/adain.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/aggregation.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/blocks.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/points.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/resizing.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/abstract.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/convolution.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
copying neuralnet_pytorch/layers/normalization.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/layers
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim
copying neuralnet_pytorch/optim/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim
copying neuralnet_pytorch/optim/adabound.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim
copying neuralnet_pytorch/optim/lookahead.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim
copying neuralnet_pytorch/optim/nadam.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils
copying neuralnet_pytorch/utils/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils
copying neuralnet_pytorch/utils/activation_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils
copying neuralnet_pytorch/utils/cv_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils
copying neuralnet_pytorch/utils/misc_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils
copying neuralnet_pytorch/utils/tensor_utils.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/utils
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo
copying neuralnet_pytorch/zoo/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo
copying neuralnet_pytorch/zoo/resnet.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo
copying neuralnet_pytorch/zoo/vgg.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/zoo
creating build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler
copying neuralnet_pytorch/optim/lr_scheduler/init.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler
copying neuralnet_pytorch/optim/lr_scheduler/inverse_lr.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler
copying neuralnet_pytorch/optim/lr_scheduler/warm_restart.py -> build/lib.linux-x86_64-3.6/neuralnet_pytorch/optim/lr_scheduler
UPDATING build/lib.linux-x86_64-3.6/neuralnet_pytorch/_version.py
set build/lib.linux-x86_64-3.6/neuralnet_pytorch/_version.py to '1.0.0+fancy.144.g6bda19f'
running build_ext
building 'neuralnet_pytorch.ext' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/neuralnet_pytorch
creating build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions
creating build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions/csrc
gcc -pthread -B /home/yzhang4/anaconda3/envs/graphxx/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ineuralnet_pytorch/extensions/include -I/home/yzhang4/anaconda3
/envs/graphxx/lib/python3.6/site-packages/torch/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packag
es/torch/include/TH -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/yzhang4/anaconda3/envs/graphxx/include/python3.6m -c neuralnet_pytorch/extension
s/csrc/bindings.cpp -o build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions/csrc/bindings.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from neuralnet_pytorch/extensions/include/bpd.h:2:0,
from neuralnet_pytorch/extensions/csrc/bindings.cpp:1:
/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/torch.h:7:2: warning: #warning "Including torch/torch.h for C++ extensions is deprecated. Please include torch/e
xtension.h" [-Wcpp]
#warning
^~~~~~~
gcc -pthread -B /home/yzhang4/anaconda3/envs/graphxx/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ineuralnet_pytorch/extensions/include -I/home/yzhang4/anaconda3
/envs/graphxx/lib/python3.6/site-packages/torch/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packag
es/torch/include/TH -I/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/yzhang4/anaconda3/envs/graphxx/include/python3.6m -c neuralnet_pytorch/extension
s/csrc/chamfer_cuda.cpp -o build/temp.linux-x86_64-3.6/neuralnet_pytorch/extensions/csrc/chamfer_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from neuralnet_pytorch/extensions/include/chamfer_cuda.h:2:0,
from neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:1:
/home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/torch.h:7:2: warning: #warning "Including torch/torch.h for C++ extensions is deprecated. Please include torch/e
xtension.h" [-Wcpp]
#warning
^~~~~~~
In file included from neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:3:0:
neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp: In function ‘std::vectorat::Tensor chamfer_forward(at::Tensor, at::Tensor)’:
neuralnet_pytorch/extensions/include/utils.h:6:3: error: ‘TORCH_CHECK’ was not declared in this scope
TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor")
^
neuralnet_pytorch/extensions/include/utils.h:10:3: note: in expansion of macro ‘CHECK_CUDA’
CHECK_CUDA(x);
^~~~~~~~~~
neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:16:3: note: in expansion of macro ‘CHECK_INPUT’
CHECK_INPUT(xyz1);
^~~~~~~~~~~
neuralnet_pytorch/extensions/include/utils.h:6:3: note: suggested alternative: ‘AT_CHECK’
TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor")
^
neuralnet_pytorch/extensions/include/utils.h:10:3: note: in expansion of macro ‘CHECK_CUDA’
CHECK_CUDA(x);
^~~~~~~~~~
neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:16:3: note: in expansion of macro ‘CHECK_INPUT’
CHECK_INPUT(xyz1);
neuralnet_pytorch/extensions/include/utils.h:10:3: note: in expansion of macro ‘CHECK_CUDA’
CHECK_CUDA(x);
^~~~~~~~~~
neuralnet_pytorch/extensions/csrc/chamfer_cuda.cpp:28:3: note: in expansion of macro ‘CHECK_INPUT’
CHECK_INPUT(xyz1);
^~~~~~~~~~~
error: command 'gcc' failed with exit status 1
----------------------------------------
Rolling back uninstall of neuralnet-pytorch
Moving to /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+fancy.166.gcbb0c5a-py3.6.egg-info
from /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/~euralnet_pytorch-1.0.0+fancy.166.gcbb0c5a-py3.6.egg-info
Moving to /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/neuralnet_pytorch/
from /home/yzhang4/anaconda3/envs/graphxx/lib/python3.6/site-packages/~euralnet_pytorch
ERROR: Command errored out with exit status 1: /home/yzhang4/anaconda3/envs/graphxx/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c94849168
0c3702bc4723f5f/setup.py'"'"'; file='"'"'/tmp/pip-install-xt7qo28s/neuralnet-pytorch_fee85099c948491680c3702bc4723f5f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else i
o.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cuda-ext install --record /tmp/pip-record-jzi_zomv/
install-record.txt --single-version-externally-managed --compile --install-headers /home/yzhang4/anaconda3/envs/graphxx/include/python3.6m/neuralnet-pytorch Check the logs for full command output.

@zhangyahu1
Copy link
Author

Now I run the code without GPUs. After several epochs for training, there is an error:

Traceback (most recent call last):
File "train.py", line 93, in
train_valid()
File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/config.py", line 1032, in wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/utils.py", line 48, in augment_exception_message_and_reraise
six.raise_from(proxy.with_traceback(exception.traceback), None)
File "", line 3, in raise_from
File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/gin/config.py", line 1009, in wrapper
return fn(*new_args, **new_kwargs)
File "train.py", line 87, in train_valid
valid_freq=val_freq, reduce='mean')
File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch/monitor.py", line 932, in run_training
raise ValueError('NaN or Inf encountered. Training failed!')
ValueError: NaN or Inf encountered. Training failed!

I would appreciate it if you can give me some advice to solve this problem.

@justanhduc
Copy link
Owner

Hi @zhangyahu1. Could you please give me more details about your conda and pytorch environments? The error comes from TORCH_CHECK which is not available in early versions of Pytorch.

@zhangyahu1
Copy link
Author

zhangyahu1 commented Aug 20, 2021

Hi @justanhduc, my environment is:

pytorch 1.5.1
torchvision 0.6.1
cudatoolkit 10.1
python 3.6

The code can run now but get the following error:

raise ValueError('NaN or Inf encountered. Training failed!')
ValueError: NaN or Inf encountered. Training failed!

@justanhduc
Copy link
Owner

Hi @zhangyahu1. Are you able to run on GPU now? I think I used Pytorch 1.7 for this code. Could you please try again?

@zhangyahu1
Copy link
Author

Thanks! @justanhduc
I will try to use Pytorch 1.7 to run the code.

@zhangyahu1
Copy link
Author

zhangyahu1 commented Sep 1, 2021

Hi @justanhduc
It works now when I use Pytorch 1.7. However, the code works well with less data, and it raises error: 'CUDA out of memory' within one epoch when all data is used. I wonder if it is because the memory is not released during training.

@zhangyahu1
Copy link
Author

zhangyahu1 commented Sep 1, 2021

It seems I use wrong verison of neuralnet-pytorch. Then I download neuralnet-pytorch of right verison and run: python setup.py install --cuda-ext, However, it raises the following error when I run the code:

Traceback (most recent call last):
File "train.py", line 16, in
import neuralnet_pytorch.gin_nnt as gin
File "/home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+unknown-py3.6-linux-x86_64.egg/neuralnet_pytorch/init.py", line 38, in
import neuralnet_pytorch.ext as ext
ImportError: /home/yzhang4/anaconda3/envs/graphx/lib/python3.6/site-packages/neuralnet_pytorch-1.0.0+unknown-py3.6-linux-x86_64.egg/neuralnet_pytorch/ext.cpython-36m-x86_64-linux-gnu.so: undefined symbol: PyThread_tss_create

I would appreciate it if you can give me some suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants