Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help with instructions to reproduce experiments #227

Open
jiwidi opened this issue Dec 25, 2020 · 10 comments
Open

Need help with instructions to reproduce experiments #227

jiwidi opened this issue Dec 25, 2020 · 10 comments

Comments

@jiwidi
Copy link

jiwidi commented Dec 25, 2020

Hi Hiro!

First, thank you for the repo. I've been following for a while and I saw you implement a big number of dl architectures.

So far I was only watching the repo from time to time, but now I would like to see If I can reproduce some results and eventually use it with custom datasets. I tried to reproduce librispeech experiment without success and need some help with it.

I went ahead and follow the installation instructions:

# Set path to CUDA, NCCL
CUDAROOT=/usr/local/cuda
NCCL_ROOT=/usr/local/nccl

export CPATH=$NCCL_ROOT/include:$CPATH
export LD_LIBRARY_PATH=$NCCL_ROOT/lib/:$CUDAROOT/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=$NCCL_ROOT/lib/:$LIBRARY_PATH
export CUDA_HOME=$CUDAROOT
export CUDA_PATH=$CUDAROOT
export CPATH=$CUDA_PATH/include:$CPATH  # for warp-rnnt

# Install miniconda, python libraries, and other tools
cd tools
make 

Kaldi complained about a few libraries but after installing them manually the make command run successfully. After this a conda environment was created under my path: /mnt/kingston/github/neural_sp/tools/miniconda. I activatated it with conda activate /mnt/kingston/github/neural_sp/tools/miniconda and proceeded to run

cd examples/librispeech/s5/
sh run.sh

But got the following output:

============================================================================
                                LibriSpeech                               
============================================================================
run.sh: 14: ./path.sh: source: not found
run.sh: 34: utils/parse_options.sh: Syntax error: Bad for loop variable

Have I missed an important part of the installation process? Do you have a more detailed list of steps I should follow in order to reproduce? Any help would be very much appreciated thanks.

@hirofumi0810
Copy link
Owner

@jiwidi I'll fix Makefile. Please retry it after the next PR.

@jiwidi
Copy link
Author

jiwidi commented Dec 30, 2020

@hirofumi0810 Hi again,

So I tried to run the same steps as in the original post and now im stuck with the warprnnt make step. My output is:

git clone https://github.com/HawkAaron/warp-transducer.git /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer
Cloning into '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 905 (delta 1), reused 5 (delta 1), pack-reused 894
Receiving objects: 100% (905/905), 248.13 KiB | 622.00 KiB/s, done.
Resolving deltas: 100% (462/462), done.
# Note: Requires gcc>=5.0 to build extensions with pytorch>=1.0
if . /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate && python -c 'import torch as t;assert t.__version__[0] == "1"' &> /dev/null; then \
        . /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate && python -c "from distutils.version import LooseVersion as V;assert V('10.2.0') >= V('5.0'), 'Requires gcc>=5.0'"; \
fi
. /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate; cd /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer && mkdir build && cd build && cmake .. && make; true
-- The C compiler identification is GNU 10.2.0
-- The CXX compiler identification is GNU 10.2.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- cuda found TRUE
-- Building shared library with GPU support
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build
make[1]: Entering directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[2]: Entering directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[3]: Entering directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
[  7%] Building NVCC (Device) object CMakeFiles/warprnnt.dir/src/warprnnt_generated_rnnt_entrypoint.cu.o
nvcc fatal   : Unsupported gpu architecture 'compute_30'
CMake Error at warprnnt_generated_rnnt_entrypoint.cu.o.cmake:220 (message):
  Error generating
  /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build/CMakeFiles/warprnnt.dir/src/./warprnnt_generated_rnnt_entrypoint.cu.o


make[3]: *** [CMakeFiles/warprnnt.dir/build.make:65: CMakeFiles/warprnnt.dir/src/warprnnt_generated_rnnt_entrypoint.cu.o] Error 1
make[3]: Leaving directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[2]: *** [CMakeFiles/Makefile2:191: CMakeFiles/warprnnt.dir/all] Error 2
make[2]: Leaving directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
make[1]: *** [Makefile:130: all] Error 2
make[1]: Leaving directory '/mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/build'
. /mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/bin/activate; cd /mnt/kingston/github/neural_sp/tools/neural_sp/warp-transducer/pytorch_binding && python setup.py install
Could not find libwarprnnt.so in ../build.
Build warp-rnnt and set WARP_RNNT_PATH to the location of libwarprnnt.so (default is '../build')
make: *** [Makefile:93: warp-transducer.done] Error 1

Seems like the error is

nvcc fatal   : Unsupported gpu architecture 'compute_30'

I have a rtx 3090 from the latest nvidia gen, do you know if this repo is updated to compile with them? Also, since I want to test the LAS and transformer architecture in librispeech recipe I think i wont need the transducer right? Any way to skip this step?

Thanks

@jiwidi
Copy link
Author

jiwidi commented Dec 30, 2020

I found this PR on the repo with support for compute 30 HawkAaron/warp-transducer#76, will give it a try and come back

EDIT:
Managed to compile it with the branch at https://github.com/ncilfone/warp-transducer/tree/3691b3fa5483e911645738a7894c48fe1f116c9b.

Also discovered I couldnt run the run.sh script with sh run.sh since it will get the same error:

============================================================================
                                LibriSpeech                               
============================================================================
run.sh: 14: ./path.sh: source: not found
run.sh: 34: utils/parse_options.sh: Syntax error: Bad for loop variable

It has to be run with ./run.sh --gpu 1 . This downloads all the data and does some preprocessing it stops during the data prep, just stops the script with no error.

It fails on data_prep.sh:

    for part in dev-clean test-clean dev-other test-other train-clean-100 train-clean-360 train-other-500; do
        # use underscore-separated names in data directories.
        local/data_prep.sh ${data_download_path}/LibriSpeech/${part} ${data}/$(echo ${part} | sed s/-/_/g) || exit 1;
    done

Specifically on utils/validate_data_dir.sh --no-feats $dst || exit 1;

But doesnt give any specific output or complains, the full run.sh output:

============================================================================
                                LibriSpeech                               
============================================================================
============================================================================
                       Data Preparation (stage:0)                          
============================================================================
local/download_and_untar.sh: data part dev-clean was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part test-clean was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part dev-other was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part test-other was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part train-clean-100 was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part train-clean-360 was already successfully extracted, nothing to do.
local/download_and_untar.sh: data part train-other-500 was already successfully extracted, nothing to do.
Downloading file '3-gram.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'3-gram.arpa.gz' already exists and appears to be complete
Downloading file '3-gram.pruned.1e-7.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'3-gram.pruned.1e-7.arpa.gz' already exists and appears to be complete
Downloading file '3-gram.pruned.3e-7.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'3-gram.pruned.3e-7.arpa.gz' already exists and appears to be complete
Downloading file '4-gram.arpa.gz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'4-gram.arpa.gz' already exists and appears to be complete
Downloading file 'g2p-model-5' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'g2p-model-5' already exists and appears to be complete
Downloading file 'librispeech-lm-corpus.tgz' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'librispeech-lm-corpus.tgz' already exists and appears to be complete
Downloading file 'librispeech-vocab.txt' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'librispeech-vocab.txt' already exists and appears to be complete
Downloading file 'librispeech-lexicon.txt' into '/mnt/kingston/asr-datasets/neural-sp//local/lm'...
'librispeech-lexicon.txt' already exists and appears to be complete
utils/data/get_utt2dur.sh: segments file does not exist so getting durations from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file headers, using wav-to-duration
utils/data/get_utt2dur.sh: computed /mnt/kingston/asr-datasets/neural-sp//dev_clean/utt2dur
Usage: utils/validate_data_dir.sh [--no-feats] [--no-text] [--non-print] [--no-wav] [--no-spk-sort] <data-dir>
The --no-xxx options mean that the script does not require 
xxx.scp to be present, but it will check it if it is present.
--no-spk-sort means that the script does not require the utt2spk to be 
sorted by the speaker-id in addition to being sorted by utterance-id.
--non-print ignore the presence of non-printable characters.
By default, utt2spk is expected to be sorted by both, which can be 
achieved by making the speaker-id prefixes of the utterance-ids
e.g.: utils/validate_data_dir.sh data/train

@jiwidi
Copy link
Author

jiwidi commented Jan 3, 2021

@hirofumi0810 I managed to skip last problem skipping the data validation step (assumming al processing went right) and now I'm stuck at the LM training, it fails due to a cudnn error. I think its related with my cuda installation/rtx3090 and the code. This has already happened to me with different frameworks already. I have run pytest on the neural_sp root and all 501 test passed so I dont know how to debug it.

Running:

../../../neural_sp/bin/lm/train.py         --corpus librispeech         --config conf/lm/rnnlm.yaml         --n_gpus 1         --cudnn_benchmark true         --train_set /n/work2/inaguma/corpus/librispeech/dataset_lm/train_100_vocab100_wpbpe10000_external.tsv         --dev_set /n/work2/inaguma/corpus/librispeech/dataset_lm/dev_clean_100_vocab100_wpbpe10000.tsv         --eval_sets /n/work2/inaguma/corpus/librispeech/dataset_lm/dev_other_100_vocab100_wpbpe10000.tsv                  /n/work2/inaguma/corpus/librispeech/dataset_lm/test_clean_100_vocab100_wpbpe10000.tsv                  /n/work2/inaguma/corpus/librispeech/dataset_lm/test_other_100_vocab100_wpbpe10000.tsv         --unit wp         --dict /n/work2/inaguma/corpus/librispeech/dict/train_100_wpbpe10000.txt         --wp_model /n/work2/inaguma/corpus/librispeech/dict/train_100_bpe10000.model         --model_save_dir /n/work2/inaguma/results/librispeech/lm         --stdout true --resume

Generates this error:

2021-01-03 20:36:39,060 neural_sp.models.base line:108 INFO: torch.backends.cudnn.enabled: True
Traceback (most recent call last):
  File "../../../neural_sp/bin/lm/train.py", line 347, in <module>
    save_path = pr.runcall(main)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/cProfile.py", line 121, in runcall
    return func(*args, **kw)
  File "../../../neural_sp/bin/lm/train.py", line 178, in main
    model.cuda()
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 260, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
    self.flatten_parameters()
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
[1]    172006 bus error (core dumped)  ../../../neural_sp/bin/lm/train.py --corpus librispeech --config  --n_gpus 1 

Have you encountered this error before? Any tips to solve it or debug it?

@hirofumi0810
Copy link
Owner

@jiwidi --benchmark false in run.sh will fix this.

@jiwidi
Copy link
Author

jiwidi commented Jan 6, 2021

@hirofumi0810 Hi! Thanks for the help.

I tried that and now it fails in another step. It does starts the first minibatch though.

  0%|                                                                                                                           | 0/982390016 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/kingston/github/neural_sp/examples/librispeech/s5/../../../neural_sp/bin/lm/train.py", line 353, in <module>
    save_path = pr.runcall(main)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/cProfile.py", line 121, in runcall
    return func(*args, **kw)
  File "/mnt/kingston/github/neural_sp/examples/librispeech/s5/../../../neural_sp/bin/lm/train.py", line 227, in main
    loss, hidden, observation = model(ys_train, state=hidden)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/lm/lm_base.py", line 55, in forward
    loss, state, observation = self._forward(ys, state)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/lm/lm_base.py", line 63, in _forward
    logits, out, new_state = self.decode(ys_in, state=state, mems=state)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/lm/rnnlm.py", line 220, in decode
    ys_emb = self.glu(ys_emb)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/neural_sp/models/modules/glu.py", line 26, in forward
    return F.glu(self.fc(xs), dim=-1)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 67, in forward
    return F.linear(input, self.weight, self.bias)
  File "/mnt/kingston/github/neural_sp/tools/neural_sp/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 1354, in linear
    output = input.matmul(weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/THC/THCBlas.cu:258

Do you know of anyone who has successfully run this code in rtx 3000 series cards?

@hirofumi0810
Copy link
Owner

@jiwidi Are you able to train ASR models in stage-4? (by skipping stage-3)

@jiwidi
Copy link
Author

jiwidi commented Jan 27, 2021

@hirofumi0810 Hi

Sorry Ive been out the last weeks, its a busy week for me this one but will try it on the weekend. Thanks

@agarwalchaitanya
Copy link

agarwalchaitanya commented Feb 3, 2021

facing the same error during installation:nvcc fatal : Unsupported gpu architecture 'compute_30'

@Neukiru
Copy link

Neukiru commented Jan 1, 2022

jiwidi

Hi me and my colleague have run the model with aishell2 recipe on rtx3090. We had the same compute 30 problem and resolved it by commenting out 1 or 2 lines in regarding cmake file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants