Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rave 2.3.1 doesn't work on my Ubuntu #273

Open
Federico8691 opened this issue Dec 19, 2023 · 45 comments
Open

Rave 2.3.1 doesn't work on my Ubuntu #273

Federico8691 opened this issue Dec 19, 2023 · 45 comments

Comments

@Federico8691
Copy link

Dear everyone,
I am sorry for bothering everyone again but I really do not know where to hit my head in order to solve these kinds of questions being no computer scientist.
Since I updated Rave to its latest version (2.3.1) it doesn't work anymore.
Here is a screenshot of the kind of errors I get when I try to run the training over an already preprocessed dataset.
Any possible help would be welcomed and thanks for your time.

best,

Federico
linux

@Federico8691
Copy link
Author

This is what is happening when I try to start my training now.image
image

Any solution ?

thanks

Federico

@Federico8691
Copy link
Author

Now I am not even able to preprocess the audiofiles.
These is my user path
image
image

Sorry for pressing you over all of this, but I am working with this tool and now I am stuck.

Many thanks,

Federico

@caillonantoine
Copy link
Collaborator

caillonantoine commented Dec 20, 2023 via email

@Federico8691
Copy link
Author

Hi Antoine,
thanks for your kind reply.
Maybe I could move back to v1. I made many models with that version. How should I proceed. No idea about how to do it.
Many thanks !

Federico

@Federico8691
Copy link
Author

On the website it states:

The original implementation of the RAVE model can be restored using

git checkout v1

What odes it means?

thanks

Federico

@caillonantoine
Copy link
Collaborator

caillonantoine commented Dec 20, 2023 via email

@Federico8691
Copy link
Author

Hi Antoine,

Just moved back to 2.1.1.
I am still getting the same error message.
image

@Federico8691
Copy link
Author

I do not understand. Everything seems ok. It was working until this morning.

@Federico8691
Copy link
Author

It looks like it is trying to access that directory but nothing is in there. I am using the same files I was using in my last training session.
Here my path:
image

So it makes no sense to me.
Any possible way out?

@Federico8691
Copy link
Author

What I do not get is the (no such file or directory), but it is there, in front of my eyes. Maybe it is a problem with the installation of Rave inside miniconda3. A directory mismatch?

@Federico8691
Copy link
Author

When I abort the process, I get tons of lines of error but at the end here is what is printed.
image

@Federico8691
Copy link
Author

Is there any way to get my hands on v1 on my linux machine? It was working so well. My best models have been done with v1.

@Federico8691
Copy link
Author

Any help, suggestion?

@caillonantoine
Copy link
Collaborator

caillonantoine commented Dec 20, 2023 via email

@caillonantoine
Copy link
Collaborator

caillonantoine commented Dec 20, 2023 via email

@Federico8691
Copy link
Author

Using this in the terminal?

git clone https://github.com/acids-ircam/RAVE
cd RAVE
pip install -r requirements.txt

and then ..

You can now use python cli_helper.py to start a new training !

thanks!

@vidalfer
Copy link

I am running this on Google Colab with the command below and using acids-rave==2.1.1:

!rave train --config "v2" --db_path '/content/dataset' --name "testRave" --val_every 2500

It worked for me after I addressed an issue with the second --config flag, which is for the regularization method. By removing this flag, the training started normally.

@Federico8691
Copy link
Author

I am running this on Google Colab with the command below and using acids-rave==2.1.1:

!rave train --config "v2" --db_path '/content/dataset' --name "testRave" --val_every 2500

It worked for me after I addressed an issue with the second --config flag, which is for the regularization method. By removing this flag, the training started normally.

Hi,
I am running all of this on Linux with a 3090 Nvidia, but for some reason unknown to me I am not able to let it work again.
Version 2.1.1 was ok until I installed the update 2.3.1 , a very very bad idea.
Using Colab is mostly impossible, it takes ages to train a model :-D

@domkirke
Copy link
Collaborator

Obviously you have a problem with your data paths, no influence of RAVE version here. Double check the paths of your input folder (dragging / dropping into the console to retrieve that path maybe).

Just a precision : since 2.3.1 RAVE is not constraining torch to be 1.13, in order to be compatible with new devices. I just advised you to update your libraries :)

@Federico8691
Copy link
Author

Axel wrote:

Obviously you have a problem with your data paths, no influence of RAVE version here. Double check the paths of your input folder (dragging / dropping into the console to retrieve that path maybe).

This is exactly what I do as you can see from this image:

image

I have the miniconda3 folder (where Rave is then installed) and two additional folder; the Blippo_dataset contains the audio files and the other the preprocessed one called myDataset, plus the runs folder.
So in order to preprocess what I do is to run the command with this two paths inside.

rave preprocess --input_path 'Blippo_dataset..' --output_path 'myDataset'

Then when I run it I get this:

image

Just a precision : since 2.3.1 RAVE is not constraining torch to be 1.13, in order to be compatible with new devices. I just advised you to update your libraries :)

I followed your suggestion as you wrote me.

Maybe it is a problem where Rave is installed?
When I run the command pip install acids-RAVE from which directory should be performed this command?
Does Rave needs to be in miniconda3 folder. Because everything I make an installation it has been put there.

Thanks for your kind help

@Federico8691
Copy link
Author

So dear friends,
I spent all night trying to get Rave working on my Ubuntu machine with no success.
Probably it is a problem with my miniconda installation or some directory conflict. I have no idea. I tried both 2.1.1 and the most recent one 2.3.1.
image
I followed all your instructions step by step. I know Rave is ok, but I find myself in a very difficult position.
I am professor at Saint Louis College of Music, I need tools for my work, I am an Ircam Forum subscriber since 1996, and I've been a strong supporter and endorser of Rave since the very beginning (until last week, where I made a full presentation of its potentials at Institute of Sonology in Den Haag). Without any support from you I cannot bring my commitment further. It is a pity because its potential, so the only thing I can do is to offer money for an online support to anyone of good will.
This is my last effort because I really cannot spend days and nights wandering without any reference material, tutorials, anything to put the final user in the position to be autonomous.

Looking forward to an answer.

best to you,

Federico

@domkirke
Copy link
Collaborator

Federico,
as written in the FAQ (https://github.com/acids-ircam/RAVE?tab=readme-ov-file#frequently-asked-question-faq), and as I remembered in other issues, this problem is due to the facts that your sounds are not long enough to feed a casual training configuration of RAVE. The classic preprocessing pipeline requires at least 2 * n_signal (default: 131072) audio files, hence about 5 seconds.

Please refer to the FAQ for the answers and let me know.

PS : Regarding python environments, I think you wouldn't loose time (especially if you use it on a daily basis) reading basic python environment handling with Internet resources (https://realpython.com/python-virtual-environments-a-primer/). Environements can be painful even for advanced users, so I would strongly advise you to get in touch with familiar commands if you really need it.

@Federico8691
Copy link
Author

Hi Axel,

thanks for your kind answer.
This is strange because I was training the same dataset (6 hours of material) with 2.1.1 with no problem. The preprocessing went very well and so did the training. So this is new to me.
I will try to make adjustment this evening when I am back from the UNI.
Could we arrange a google meet for tomorrow afternoon around 5 pm. We can discuss arrangements in private then. It would be of great help. Many thanks!

@domkirke
Copy link
Collaborator

There was actually a bug in 2.1 that prevented random cropping during training, pertaining overfitting problems. This problem was fixed.
I redirect you to how to solve the problem in the FAQ (that you read, I imagine)
https://github.com/acids-ircam/RAVE?tab=readme-ov-file#frequently-asked-question-faq
try with a sample size of 65536, it will come back to the RAVEv1 behaviour.

@chrizzlemadizzle
Copy link

Getting the same error message (working on colab). Trying with v2.3.0 (installed with !/content/miniconda/bin/pip install acids-rave==2.3). Using --config v2 --config default. All steps preprocessing, training and exporting used to work fine using the same dataset three days ago.

/content /content/drive/MyDrive/AI/RAVE/vivaZweiTraining/2023-12-17 dataset length: 0:19:13.195828: : 195it [00:10, 19.29it/s] /content/miniconda/lib/python3.9/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /content/miniconda/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") Traceback (most recent call last): File "/content/miniconda/bin/rave", line 8, in <module> sys.exit(main()) File "/content/miniconda/lib/python3.9/site-packages/scripts/main_cli.py", line 30, in main app.run(train.main) File "/content/miniconda/lib/python3.9/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/content/miniconda/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/miniconda/lib/python3.9/site-packages/scripts/train.py", line 159, in main model = rave.RAVE(n_channels=FLAGS.channels) File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/content/miniconda/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.__traceback__) from None File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/content/miniconda/lib/python3.9/site-packages/rave/model.py", line 188, in __init__ self.decoder = decoder(n_channels=n_channels) File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/content/miniconda/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.__traceback__) from None File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 516, in meta_call_wrapper return cls_meta.__call__(new_cls, *args, **kwargs) File "/content/miniconda/lib/python3.9/site-packages/rave/blocks.py", line 675, in __init__ waveform_module = normalization( File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/content/miniconda/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.__traceback__) from None File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/content/miniconda/lib/python3.9/site-packages/rave/blocks.py", line 20, in normalization return weight_norm(module) File "/content/miniconda/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py", line 132, in weight_norm WeightNorm.apply(module, name, dim) File "/content/miniconda/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py", line 50, in apply module.register_parameter(name + '_g', Parameter(norm_except_dim(weight, 2, dim).data)) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous In call to configurable 'normalization' (<function normalization at 0x7df9d8429e50>) In call to configurable 'GeneratorV2' (<class 'rave.blocks.GeneratorV2'>) In call to configurable 'RAVE' (<class 'rave.model.RAVE'>)

@domkirke
Copy link
Collaborator

Are you resuming a checkpoint? I cannot reproduce this bug from a fresh training

@augustross3
Copy link

augustross3 commented Dec 21, 2023

also getting this same error when using 2.3.1 on fresh training

install + preprocess worked fine then training caused

I1221 20:56:26.757378 140430877622912 resource_reader.py:50] system_path_file_exists:v3.gin
E1221 20:56:26.757585 140430877622912 resource_reader.py:55] Path not found: v3.gin
I1221 20:56:26.757645 140430877622912 resource_reader.py:50] system_path_file_exists:/opt/conda/lib/python3.10/site-packages/rave/v3.gin
E1221 20:56:26.757689 140430877622912 resource_reader.py:55] Path not found: /opt/conda/lib/python3.10/site-packages/rave/v3.gin
I1221 20:56:26.757906 140430877622912 resource_reader.py:50] system_path_file_exists:configs/v2.gin
E1221 20:56:26.758018 140430877622912 resource_reader.py:55] Path not found: configs/v2.gin
I1221 20:56:26.758452 140430877622912 resource_reader.py:50] system_path_file_exists:configs/v1.gin
E1221 20:56:26.758562 140430877622912 resource_reader.py:55] Path not found: configs/v1.gin
I1221 20:56:26.774069 140430877622912 resource_reader.py:50] system_path_file_exists:configs/adain.gin
E1221 20:56:26.774199 140430877622912 resource_reader.py:55] Path not found: configs/adain.gin
I1221 20:56:26.775029 140430877622912 resource_reader.py:50] system_path_file_exists:configs/snake.gin
E1221 20:56:26.775145 140430877622912 resource_reader.py:55] Path not found: configs/snake.gin
I1221 20:56:26.779302 140430877622912 resource_reader.py:50] system_path_file_exists:configs/descript_discriminator.gin
E1221 20:56:26.779419 140430877622912 resource_reader.py:55] Path not found: configs/descript_discriminator.gin
/opt/conda/lib/python3.10/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/opt/conda/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Traceback (most recent call last):
File "/opt/conda/bin/rave", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/scripts/main_cli.py", line 30, in main
app.run(train.main)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/opt/conda/lib/python3.10/site-packages/scripts/train.py", line 159, in main
model = rave.RAVE(n_channels=FLAGS.channels)
File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/opt/conda/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.traceback) from None
File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/opt/conda/lib/python3.10/site-packages/rave/model.py", line 188, in init
self.decoder = decoder(n_channels=n_channels)
File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/opt/conda/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.traceback) from None
File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 516, in meta_call_wrapper
return cls_meta.call(new_cls, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/rave/blocks.py", line 675, in init
waveform_module = normalization(
File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/opt/conda/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.traceback) from None
File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/opt/conda/lib/python3.10/site-packages/rave/blocks.py", line 20, in normalization
return weight_norm(module)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py", line 132, in weight_norm
WeightNorm.apply(module, name, dim)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py", line 50, in apply
module.register_parameter(name + '_g', Parameter(norm_except_dim(weight, 2, dim).data))
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
In call to configurable 'normalization' (<function normalization at 0x7fb7e8931b40>)
In call to configurable 'GeneratorV2' (<class 'rave.blocks.GeneratorV2'>)
In call to configurable 'RAVE' (<class 'rave.model.RAVE'>)

@domkirke
Copy link
Collaborator

Can you please give

  • your system, torch / torch audio version
  • full preprocess command
  • train preprocess command
    and, if a folder is created, the .gin inside summarizing the architecture of the training

I tried with v2 and v3, and on GitHub Actions these configurations pass the test. There must be something wrong in your config or in your databaae

@augustross3
Copy link

Hi domkirke

My system is an RTX 3090 24GB/AMD EPYC 7551P/32GB ram/Torch = 2.1.2/Torchaudio = 2.1.2
Operating System: Linux 5.15.0-83-generic #92~20.04.1-Ubuntu SMP Mon Aug 21 14:00:49 UTC 2023

Steps from a completely clean system install

  1. pip install acids-rave
    Successful
  2. conda install ffmpeg
    Successful
  3. Grabbed audio for training
  4. rave preprocess --input_path audio/ --output_path dataset/
    dataset length: 0:57:15.810249: : 579it [00:03, 148.08it/s]
    Successful
  5. rave train --config v3 --db_path dataset/ --out_path model/ --name brute --val_every 2500
    (the exact same error as posted above)
    Fails

You mentioned "if a folder is created, the .gin inside summarizing the architecture of the training". I'm not sure where this folder would be located. If you're referring to the dataset folder from step 4, the resulting folder did not have a .gin file inside it.

If you need anymore information please let me know. Thanks in advance.

@domkirke
Copy link
Collaborator

Ok, got that ; it is a problem with the default amount of audio channels. Meanwhile, Please add --channels X (X being your number of channels: 1 for mono, 2 for stereo, etc). Will add that to the README.md, and fix that in the next version. Does it work? Thanks!

@augustross3
Copy link

Unfortunately still getting an error when adding this flag. It's a different error this time though. Maybe I need to remake the dataset with this channel flag?

rave train --config v3 --db_path dataset/ --out_path model/ --name brute --channels 2 --val_every 2500

Traceback (most recent call last):
File "/opt/conda/bin/rave", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/scripts/main_cli.py", line 30, in main
app.run(train.main)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/opt/conda/lib/python3.10/site-packages/scripts/train.py", line 139, in main
n_channels = rave.dataset.get_training_channels(FLAGS.db_path, FLAGS.channels)
File "/opt/conda/lib/python3.10/site-packages/rave/dataset.py", line 167, in get_training_channels
raise RuntimeError('[Error] Requested number of channels is %s, but dataset has %s channels')%(FLAGS.channels, dataset_channels)
NameError: name 'FLAGS' is not defined

@domkirke
Copy link
Collaborator

There is a log error I also corrected, but it seems that yes you preprocessed your dataset as a mono dataset and asked for a stereo. Also add --channels 2 to your preprocess step, and --channels 2 in the train step

@augustross3
Copy link

Got it. Not sure if this is a my system issue but for some reason when I try and preprocess the dataset with channels 2 flag I get a threading error thats seemingly freezes the process. I tried again using the original command (which I assume defaults to mono) and it worked though for some reason. Is there a flag to limit the number of threads or something like that? Appreciate your help with this.

rave preprocess --input_path audio/ --output_path dataset/ --channels 2
dataset length: 0:07:55.544671: : 73it [00:03, 48.32it/s]Exception in thread Thread-5 (accepter):
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/multiprocessing/managers.py", line 194, in accepter
t.start()
File "/opt/conda/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
dataset length: 0:30:54.624218: : 301it [00:20, 190.21it/s]

  • Gets stuck here

BUT

rave preprocess --input_path audio/ --output_path dataset/
dataset length: 0:57:15.810249: : 579it [00:03, 151.52it/s]

  • Successful

@domkirke
Copy link
Collaborator

Works fine in my configuration, both mac and linux. What gives ulimit -u on your computer?

@augustross3
Copy link

augustross3 commented Dec 22, 2023

ulimit -u
unlimited

Hmm. Not sure why this would be the case. Think I may try a different machine. Have been using a cloud machine for training.

Edit 2: Redid steps on a new machine. Same issue

!rave preprocess --input_path audio/ --output_path dataset/ --channels 2

0it [00:00, ?it/s]Exception in thread Thread-5 (accepter):
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/multiprocessing/managers.py", line 194, in accepter
t.start()
File "/opt/conda/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

@Federico8691
Copy link
Author

hi to everyone ,

this is the same error I was getting before the mess.

Same content, you can double check it over here:
the first block of lines reproduce exact problem I had before.

image

also getting this same error when using 2.3.1 on fresh training

install + preprocess worked fine then training caused

I1221 20:56:26.757378 140430877622912 resource_reader.py:50] system_path_file_exists:v3.gin E1221 20:56:26.757585 140430877622912 resource_reader.py:55] Path not found: v3.gin I1221 20:56:26.757645 140430877622912 resource_reader.py:50] system_path_file_exists:/opt/conda/lib/python3.10/site-packages/rave/v3.gin E1221 20:56:26.757689 140430877622912 resource_reader.py:55] Path not found: /opt/conda/lib/python3.10/site-packages/rave/v3.gin I1221 20:56:26.757906 140430877622912 resource_reader.py:50] system_path_file_exists:configs/v2.gin E1221 20:56:26.758018 140430877622912 resource_reader.py:55] Path not found: configs/v2.gin I1221 20:56:26.758452 140430877622912 resource_reader.py:50] system_path_file_exists:configs/v1.gin E1221 20:56:26.758562 140430877622912 resource_reader.py:55] Path not found: configs/v1.gin I1221 20:56:26.774069 140430877622912 resource_reader.py:50] system_path_file_exists:configs/adain.gin E1221 20:56:26.774199 140430877622912 resource_reader.py:55] Path not found: configs/adain.gin I1221 20:56:26.775029 140430877622912 resource_reader.py:50] system_path_file_exists:configs/snake.gin E1221 20:56:26.775145 140430877622912 resource_reader.py:55] Path not found: configs/snake.gin I1221 20:56:26.779302 140430877622912 resource_reader.py:50] system_path_file_exists:configs/descript_discriminator.gin E1221 20:56:26.779419 140430877622912 resource_reader.py:55] Path not found: configs/descript_discriminator.gin /opt/conda/lib/python3.10/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /opt/conda/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") Traceback (most recent call last): File "/opt/conda/bin/rave", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/scripts/main_cli.py", line 30, in main app.run(train.main) File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/opt/conda/lib/python3.10/site-packages/scripts/train.py", line 159, in main model = rave.RAVE(n_channels=FLAGS.channels) File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/opt/conda/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/opt/conda/lib/python3.10/site-packages/rave/model.py", line 188, in init self.decoder = decoder(n_channels=n_channels) File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/opt/conda/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 516, in meta_call_wrapper return cls_meta.call(new_cls, *args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/rave/blocks.py", line 675, in init waveform_module = normalization( File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/opt/conda/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/opt/conda/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/opt/conda/lib/python3.10/site-packages/rave/blocks.py", line 20, in normalization return weight_norm(module) File "/opt/conda/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py", line 132, in weight_norm WeightNorm.apply(module, name, dim) File "/opt/conda/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py", line 50, in apply module.register_parameter(name + '_g', Parameter(norm_except_dim(weight, 2, dim).data)) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous In call to configurable 'normalization' (<function normalization at 0x7fb7e8931b40>) In call to configurable 'GeneratorV2' (<class 'rave.blocks.GeneratorV2'>) In call to configurable 'RAVE' (<class 'rave.model.RAVE'>)

So we must to find a way to solve this.

@chrizzlemadizzle
Copy link

chrizzlemadizzle commented Dec 22, 2023

Are you resuming a checkpoint? I cannot reproduce this bug from a fresh training

Yes, I was. Still getting the same error.

However, I if Istart a fresh training with v2.3.0 using the same dataset with these configurations I can get it running (resuming works, too).

!/content/miniconda/bin/rave preprocess --input_path $dataset --output_path $preprocessed_dataset --channels 2 !/content/miniconda/bin/rave train --config v2 --config default --db_path $preprocessed_dataset --name $name --val_every 100 --channels 2

Howver when trying to export !/content/miniconda/bin/rave export --run $model_dir --streaming --channels 2 --fidelity 0.999 I receive the following error:

INFO:root:library loading INFO:root:DEBUG I1222 09:36:18.636941 136991208879168 export.py:495] building rave I1222 09:36:18.654077 136991208879168 resource_reader.py:50] system_path_file_exists:/content/drive/MyDrive/AI/RAVE/vivaZweiTraining/2023-12-22-testing/config.gin E1222 09:36:18.654393 136991208879168 resource_reader.py:55] Path not found: /content/drive/MyDrive/AI/RAVE/vivaZweiTraining/2023-12-22-testing/config.gin Traceback (most recent call last): File "/content/miniconda/bin/rave", line 8, in <module> sys.exit(main()) File "/content/miniconda/lib/python3.9/site-packages/scripts/main_cli.py", line 38, in main app.run(export.main) File "/content/miniconda/lib/python3.9/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/content/miniconda/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/miniconda/lib/python3.9/site-packages/scripts/export.py", line 500, in main gin.parse_config_file(config_file) File "/content/miniconda/lib/python3.9/site-packages/gin/config.py", line 2457, in parse_config_file raise IOError(err_str.format(config_file, prefixes)) OSError: Unable to open file: /content/drive/MyDrive/AI/RAVE/vivaZweiTraining/2023-12-22-testing/config.gin. Searched config paths: [''].

Moreover, using v2.3.1 with these configuartions seems to work fine, too, for training and resuming training.

!/content/miniconda/bin/rave preprocess --input_path $dataset --output_path $preprocessed_dataset --channels 2 !/content/miniconda/bin/rave train --config v3 --config wasserstein --db_path $preprocessed_dataset --name $name --val_every 100 --channels 2

But when trying to export the model I get this error which I don't know the reason for. Did I maybe not training long enough? !/content/miniconda/bin/rave export --run $model_dir --streaming --channels 2 --fidelity 0.999

INFO:root:library loading INFO:root:DEBUG I1222 09:22:53.708134 137753289249856 export.py:495] building rave /content/miniconda/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") /content/miniconda/lib/python3.9/site-packages/torchaudio/transforms/_transforms.py:94: UserWarning: return_complex argument is now deprecated and is not effective.torchaudio.transforms.Spectrogram(power=None) always returns a tensor with complex dtype. Please remove the argument in the function call. warnings.warn( I1222 09:22:54.934781 137753289249856 export.py:505] model found : /content/drive/MyDrive/AI/RAVE/vivaZweiTraining/2023-12-18-testing/runs/viva_35fc26584b/version_1/checkpoints/epoch-epoch=0023.ckpt Traceback (most recent call last): File "/content/miniconda/bin/rave", line 8, in <module> sys.exit(main()) File "/content/miniconda/lib/python3.9/site-packages/scripts/main_cli.py", line 38, in main app.run(export.main) File "/content/miniconda/lib/python3.9/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/content/miniconda/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/miniconda/lib/python3.9/site-packages/scripts/export.py", line 513, in main pretrained.load_state_dict( File "/content/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for RAVE: size mismatch for encoder.encoder.net.0.weight_v: copying a param with shape torch.Size([96, 32, 7]) from checkpoint, the shape in current model is torch.Size([96, 16, 7]). size mismatch for decoder.net.32.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 1]). size mismatch for decoder.net.32.weight_v: copying a param with shape torch.Size([64, 96, 7]) from checkpoint, the shape in current model is torch.Size([32, 96, 7]). size mismatch for discriminator.discriminators.0.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]). size mismatch for discriminator.discriminators.1.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]). size mismatch for discriminator.discriminators.2.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]). size mismatch for discriminator.discriminators.3.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]). size mismatch for discriminator.discriminators.4.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]). size mismatch for discriminator.discriminators.5.band_convs.0.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.5.band_convs.1.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.5.band_convs.2.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.5.band_convs.3.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.5.band_convs.4.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.6.band_convs.0.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.6.band_convs.1.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.6.band_convs.2.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.6.band_convs.3.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.6.band_convs.4.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.7.band_convs.0.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.7.band_convs.1.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.7.band_convs.2.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.7.band_convs.3.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]). size mismatch for discriminator.discriminators.7.band_convs.4.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).

@domkirke
Copy link
Collaborator

@Federico8691 Can you add --channels 1 do the train command, as indicated in the message just before yours?

@domkirke
Copy link
Collaborator

@chrizzlemadizzle v3.2.0 had trouble with wasserstein and spherical configs, leading to global fixes and v3.2.1. So, if I summarize well : you are not able to export a Wasserstein model trained with v3.2.1? Could you make another issue, as it is not the topic of the current one? Thanks :-)

@augustross3
Copy link

augustross3 commented Dec 22, 2023

ulimit -u unlimited

Hmm. Not sure why this would be the case. Think I may try a different machine. Have been using a cloud machine for training.

Edit 2: Redid steps on a new machine. Same issue

!rave preprocess --input_path audio/ --output_path dataset/ --channels 2

0it [00:00, ?it/s]Exception in thread Thread-5 (accepter): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/multiprocessing/managers.py", line 194, in accepter t.start() File "/opt/conda/lib/python3.10/threading.py", line 935, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread

Wanted to make a quick update to this. Using a clean install on the exact same machines I rolled back to v2.3 and recreated the exact steps (ie doing --channels 2 and the process described previously) and RAVE worked flawlessly and is now training. Leads me to believe something may have happened in 2.3.1 with the threading?

EDIT: In attempting to export the model from this version/setup I'm getting a "size mismatch" error

rave export --run model/brute_caeb149cc6 --streaming --channels 2 --output modelfinal/

root@C.7946089:/workspace$ rave export --run model/brute_caeb149cc6 --streaming --channels 2 --output modelfinal/
INFO:root:library loading
INFO:root:DEBUG
I1222 11:00:21.188149 139990510551680 export.py:495] building rave
/opt/conda/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/opt/conda/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py:94: UserWarning: return_complex argument is now deprecated and is not effective.torchaudio.transforms.Spectrogram(power=None) always returns a tensor with complex dtype. Please remove the argument in the function call.
warnings.warn(
I1222 11:00:21.866852 139990510551680 export.py:505] model found : model/brute_caeb149cc6/version_0/checkpoints/best.ckpt
Traceback (most recent call last):
File "/opt/conda/bin/rave", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/scripts/main_cli.py", line 38, in main
app.run(export.main)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/opt/conda/lib/python3.10/site-packages/scripts/export.py", line 513, in main
pretrained.load_state_dict(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RAVE:
size mismatch for encoder.encoder.net.0.weight_v: copying a param with shape torch.Size([96, 32, 7]) from checkpoint, the shape in current model is torch.Size([96, 16, 7]).
size mismatch for decoder.net.32.weight_g: copying a param with shape torch.Size([64, 1, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 1]).
size mismatch for decoder.net.32.weight_v: copying a param with shape torch.Size([64, 96, 7]) from checkpoint, the shape in current model is torch.Size([32, 96, 7]).
size mismatch for discriminator.discriminators.0.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]).
size mismatch for discriminator.discriminators.1.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]).
size mismatch for discriminator.discriminators.2.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]).
size mismatch for discriminator.discriminators.3.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]).
size mismatch for discriminator.discriminators.4.convs.0.0.weight_v: copying a param with shape torch.Size([32, 2, 5, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 5, 1]).
size mismatch for discriminator.discriminators.5.band_convs.0.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.5.band_convs.1.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.5.band_convs.2.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.5.band_convs.3.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.5.band_convs.4.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.6.band_convs.0.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.6.band_convs.1.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.6.band_convs.2.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.6.band_convs.3.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.6.band_convs.4.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.7.band_convs.0.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.7.band_convs.1.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.7.band_convs.2.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.7.band_convs.3.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).
size mismatch for discriminator.discriminators.7.band_convs.4.0.0.weight_v: copying a param with shape torch.Size([32, 4, 3, 9]) from checkpoint, the shape in current model is torch.Size([32, 2, 3, 9]).

@Federico8691
Copy link
Author

Out of topic Axel,
But does Rave run on MacOSX on M2 machines?
Just a curiosity

@Federico8691
Copy link
Author

Hi to everyone

still nothing new. I made all the steps Axel requested. I am still getting same problem. I created a new directory, and I put an old batch of audiofiles I used for some training a while ago without problems.

here is the outcome:

image

the funny thing is that in the --output_path directory Rave creates these two files:

image

I tried both with 2.1.1 and with 2.3.1

I have no idea about what is happening here.

Keep in mind that I wanted to present some of my research at March Forum 2024. (I will be in Paris).

So I am stuck.

Looking forward to your answer (I am working even if it is Christmas :-D)

@Federico8691
Copy link
Author

Works fine in my configuration, both mac and linux. What gives ulimit -u on your computer?

I made this check myself, running ulimit -u I got 256513

What does it means?

@Federico8691
Copy link
Author

Dear Axel,

I was able to solve the preprocess, I just updated ffmpeg encoder.
It was all there. Now I will give a try to the training.

@Federico8691
Copy link
Author

There was a missing developer library called libsox-dev.
I installed the libsox-dev package and the training started.
Let's see how it evolves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants