Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a model with multiple GPUs #18

Open
zasr99 opened this issue Mar 22, 2024 · 9 comments
Open

Training a model with multiple GPUs #18

zasr99 opened this issue Mar 22, 2024 · 9 comments

Comments

@zasr99
Copy link

zasr99 commented Mar 22, 2024

Hello @ameraner,
After changing n_gpus=2 for model training, when performing prediction tasks, I encountered the issue 'len(layer_names) != len(filtered_layers)'. The layer_names in the weight file .hdf5 only contains 8 layers, which is significantly different from the model structure. Do you know where the problem lies, or do I need to change other parameters besides n_gpus when using multiple GPUs for training?
屏幕截图 2024-03-22 193406
屏幕截图 2024-03-22 193447

@ameraner
Copy link
Owner

ameraner commented Apr 2, 2024

Hi, I'm sorry, I've never seen this issue.. it could be that the multi-gpu handling, or the file opening, has changed in Keras in the meanwhile? What version of Keras are you using?

@zasr99
Copy link
Author

zasr99 commented Apr 2, 2024

Hi, I'm sorry, I've never seen this issue.. it could be that the multi-gpu handling, or the file opening, has changed in Keras in the meanwhile? What version of Keras are you using?

Thank you for your reply! I use tensorflow-gpu=1.15 and Keras=2.3.1. When I use tensorflow-gpu=1.15 and Keras=2.2.4, multi-GPU training cannot be performed. But when Keras=2.3.1 uses multi-GPU training, the problem I raised will appear again.

屏幕截图 2024-04-02 175504
As shown in the figure above, when tensorflow-gpu=1.15 and Keras=2.2.4, the error message appears when using n_gpu>=2

@ameraner
Copy link
Owner

ameraner commented Apr 2, 2024

and from which line of the dsen2-cr code is this originating?

@zasr99
Copy link
Author

zasr99 commented Apr 2, 2024

and from which line of the dsen2-cr code is this originating?

屏幕截图 2024-04-02 183114
This is the complete error message when using tensorflow-gpu=1.15 and Keras=2.2.4 for multi-GPU training.
After checking the information, it seems that this is caused by the mismatch between tensorflow-gpu and Keras versions.
So I changed to use tensorflow-gpu=1.15 and Keras=2.3.1, meanwhile ,the training process can be completed normally, but when using the trained model for prediction, the error message I proposed at the beginning will appear.

@ameraner
Copy link
Owner

ameraner commented Apr 2, 2024

googling the error I foung this issue: tensorflow/tensorflow#30728, maybe you can try downgrading tensorflow and tensorflow-gpu to 1.13.1?

@zasr99
Copy link
Author

zasr99 commented Apr 2, 2024

googling the error I foung this issue: tensorflow/tensorflow#30728, maybe you can try downgrading tensorflow and tensorflow-gpu to 1.13.1?

I actually also tried tensorflow-gpu=1.13.1 and Keras=2.2.4. But the result is the same. I also tried to use the same multiple GPUs as during training for prediction tasks, but another error message appeared.
屏幕截图 2024-04-02 190247

(The question I raised at the beginning was a problem that occurs when using single GPU prediction)
In general, in my attempts, the model trained with multiple GPUs will report errors whether it is using a single GPU or multiple GPUs for prediction tasks, but the error problems are different.

@zasr99
Copy link
Author

zasr99 commented Apr 2, 2024

googling the error I foung this issue: tensorflow/tensorflow#30728, maybe you can try downgrading tensorflow and tensorflow-gpu to 1.13.1?

From my point of view, it is because there is a problem with the model weights trained on multiple GPUs. There are only 8 layer_names, ['input_1', 'input_2', 'lambda_18', 'lambda_19', 'lambda_20', 'lambda_21' , 'model_1', 'lambda_17'].
屏幕截图 2024-04-02 191311

Normally there should be layer_names such as Conv2D in the middle.
屏幕截图 2024-04-02 191752

@ameraner
Copy link
Owner

ameraner commented Apr 2, 2024

I'm sorry, unfortunately I don't know how to further help with this, as I don't have the capacity to debug this anymore. The issue with the layer names is indeed odd, particularly since the model save and load are handled by native keras functions...

@zasr99
Copy link
Author

zasr99 commented Apr 2, 2024

I'm sorry, unfortunately I don't know how to further help with this, as I don't have the capacity to debug this anymore. The issue with the layer names is indeed odd, particularly since the model save and load are handled by native keras functions...

It's okay. I'll look into it. Thank you very much for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants