Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'source_2_path' when training ConvTasNet using enh_single mode #694

Open
lingzhic opened this issue Mar 19, 2024 · 3 comments
Open
Labels
question Further information is requested

Comments

@lingzhic
Copy link

lingzhic commented Mar 19, 2024

I am training ConvTasNet on Librimix train-100 dataset. It works fine when I train it using sep_noisy mode, while it prompts such an error when I train it using enh_single mode:

Results from the following experiment will be stored in exp/train_convtasnet_3rd_causal
Stage 2: Training
/O/asteroid/asteroid/models/conv_tasnet.py:89: UserWarning: In causal configuration cumulative layer normalization (cgLN)or channel-wise layer normalization (chanLN)  must be used. Changing cLN to cLN
  warnings.warn(
/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python train.py --exp_dir exp/train_convtasnet_3rd_causal - ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W CUDAAllocatorConfig.h:30] Warning: expandable_segments not supported on this platform (function operator())
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type           | Params
---------------------------------------------
0 | model     | ConvTasNet     | 5.1 M 
1 | loss_func | PITLossWrapper | 0     
---------------------------------------------
5.1 M     Trainable params
0         Non-trainable params
5.1 M     Total params
20.202    Total estimated model params size (MB)
{'data': {'n_src': 2,
          'sample_rate': 8000,
          'segment': 3,
          'task': 'enh_single',
          'train_dir': 'data/wav8k/min/train-100',
          'valid_dir': 'data/wav8k/min/dev'},
 'filterbank': {'kernel_size': 16, 'n_filters': 512, 'stride': 8},
 'main_args': {'exp_dir': 'exp/train_convtasnet_3rd_causal', 'help': None},
 'masknet': {'bn_chan': 128,
             'hid_chan': 512,
             'mask_act': 'relu',
             'n_blocks': 8,
             'n_repeats': 3,
             'skip_chan': 128},
 'optim': {'lr': 0.001, 'optimizer': 'adam', 'weight_decay': 0.0},
 'positional arguments': {},
 'training': {'batch_size': 14,
              'early_stop': True,
              'epochs': 200,
              'half_lr': True,
              'num_workers': 4}}
Drop 0 utterances from 13900 (shorter than 3 seconds)
Drop 0 utterances from 13900 (shorter than 3 seconds)
Sanity Checking: |          | 0/? [00:00<?, ?it/s]Traceback (most recent call last):
  File "O/asteroid/egs/librimix/ConvTasNet/train.py", line 146, in <module>
    main(arg_dic)
  File "O/asteroid/egs/librimix/ConvTasNet/train.py", line 112, in main
    trainer.fit(system)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage
    self._run_sanity_check()
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check
    val_loop.run()
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 128, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'source_2_path'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "O/asteroid/asteroid/data/librimix_dataset.py", line 106, in __getitem__
    source_path = row[f"source_{i + 1}_path"]
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/series.py", line 1112, in __getitem__
    return self._get_value(key)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/series.py", line 1228, in _get_value
    loc = self.index.get_loc(label)
  File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'source_2_path'

And here is my run.sh file:

#!/bin/bash

# Exit on error
set -e
set -o pipefail

# If you haven't generated LibriMix start from stage 0
# Main storage directory. You'll need disk space to store LibriSpeech, WHAM noises
# and LibriMix. This is about 500 Gb
storage_dir=O/asteroid/datasets

# After running the recipe a first time, you can run it from stage 3 directly to train new models.

# Path to the python you'll use for the experiment. Defaults to the current python
# You can run ./utils/prepare_python_env.sh to create a suitable python environment, paste the output here.
python_path=python

# Example usage
# ./run.sh --stage 3 --tag my_tag --task sep_noisy --id 0,1

# General
stage=0  # Controls from which stage to start
tag=""  # Controls the directory name associated to the experiment
# You can ask for several GPUs using id (passed to CUDA_VISIBLE_DEVICES)
id=$CUDA_VISIBLE_DEVICES
out_dir=librimix # Controls the directory name associated to the evaluation results inside the experiment directory

# Network config
n_blocks=8		# Number of conv blocks in each repeat
n_repeats=3		# Number of repeats in the Conv-TasNet
mask_act=relu
# Training config
epochs=200
batch_size=14
num_workers=4
half_lr=yes
early_stop=yes
# Optim config
optimizer=adam
lr=0.001
weight_decay=0.
# Data config
sample_rate=8000
mode=min		# max for val_acc, min for val_loss
n_src=2			# Number of voice sources in the speech
segment=3
task=enh_single 	# one of 'enh_single', 'enh_both', 'sep_clean', 'sep_noisy'

eval_use_gpu=1
# Need to --compute_wer 1 --eval_mode max to be sure the user knows all the metrics
# are for the all mode.
compute_wer=0
eval_mode=

. utils/parse_options.sh


sr_string=$(($sample_rate/1000))
suffix=wav${sr_string}k/$mode

if [ -z "$eval_mode" ]; then
  eval_mode=$mode
fi

train_dir=data/$suffix/train-100
valid_dir=data/$suffix/dev
test_dir=data/wav${sr_string}k/$eval_mode/test

if [[ $stage -le  0 ]]; then
	echo "Stage 0: Generating Librimix dataset"
	if [ -z "$storage_dir" ]; then
		echo "Need to fill in the storage_dir variable in run.sh to run stage 0. Exiting"
		exit 1
	fi
  . local/generate_librimix.sh --storage_dir $storage_dir --n_src $n_src
fi

if [[ $stage -le  1 ]]; then
	echo "Stage 1: Generating csv files including wav path and duration"
  . local/prepare_data.sh --storage_dir $storage_dir --n_src $n_src
fi

# Generate a random ID for the run if no tag is specified
uuid=$($python_path -c 'import uuid, sys; print(str(uuid.uuid4())[:8])')
if [[ -z ${tag} ]]; then
	tag=${uuid}
fi

expdir=exp/train_convtasnet_${tag}
mkdir -p $expdir && echo $uuid >> $expdir/run_uuid.txt
echo "Results from the following experiment will be stored in $expdir"


if [[ $stage -le 2 ]]; then
  echo "Stage 2: Training"
  mkdir -p logs
  CUDA_VISIBLE_DEVICES=$id $python_path train.py --exp_dir $expdir \
		--n_blocks $n_blocks \
		--n_repeats $n_repeats \
		--mask_act $mask_act \
		--epochs $epochs \
		--batch_size $batch_size \
		--num_workers $num_workers \
		--half_lr $half_lr \
		--early_stop $early_stop \
		--optimizer $optimizer \
		--lr $lr \
		--weight_decay $weight_decay \
		--train_dir $train_dir \
		--valid_dir $valid_dir \
		--sample_rate $sample_rate \
		--n_src $n_src \
		--task $task \
		--segment $segment | tee logs/train_${tag}.log
	cp logs/train_${tag}.log $expdir/train.log

	# Get ready to publish
	mkdir -p $expdir/publish_dir
	echo "librimix/ConvTasNet" > $expdir/publish_dir/recipe_name.txt
fi


if [[ $stage -le 3 ]]; then
	echo "Stage 3 : Evaluation"

	if [[ $compute_wer -eq 1 ]]; then
	  if [[ $eval_mode != "max" ]]; then
	    echo "Cannot compute WER without max mode. Start again with --stage 2 --compute_wer 1 --eval_mode max"
	    exit 1
	  fi

    # Install espnet if not instaled
    if ! python -c "import espnet" &> /dev/null; then
        echo 'This recipe requires espnet. Installing requirements.'
        $python_path -m pip install espnet_model_zoo
        $python_path -m pip install jiwer
        $python_path -m pip install tabulate
    fi
  fi

  $python_path eval.py \
    --exp_dir $expdir \
    --test_dir $test_dir \
  	--out_dir $out_dir \
  	--use_gpu $eval_use_gpu \
  	--compute_wer $compute_wer \
  	--task $task | tee logs/eval_${tag}.log

	cp logs/eval_${tag}.log $expdir/eval.log
fi

Could you please suggest if there is any issue in the run.sh configuration?

Thanks,
Colin

@lingzhic lingzhic added the question Further information is requested label Mar 19, 2024
@MunbongChoi
Copy link

I got the same error as you when I ran it with DPRNNTasNet on a Librimix dataset, did you ever solve it?

@lingzhic
Copy link
Author

lingzhic commented Apr 1, 2024

I got the same error as you when I ran it with DPRNNTasNet on a Librimix dataset, did you ever solve it?

It seems to be the dataset object issue. Check line 106 in the asteroid/data/librimix_dataset.py and you will see why it happens.

@MunbongChoi
Copy link

I've analyzed your code to resolve that issue and found that n_src and task don't match. I'd recommend either changing task or lowering n_src.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants