Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training own MDM takes too long time #179

Open
CMY-CTO opened this issue Dec 6, 2023 · 2 comments
Open

Training own MDM takes too long time #179

CMY-CTO opened this issue Dec 6, 2023 · 2 comments

Comments

@CMY-CTO
Copy link

CMY-CTO commented Dec 6, 2023

Hi,

About the 'requiring only about three days of training on a single mid-range GPU' mentioned in the paper, there are some doubts I met and hope you can help me resolve:

First of all, I didn't make any changes to the model or the weights and I used the University's Server (A100 GPU) as the step and command mentioned in the ReadMe.
Screen Shot 2023-12-06 at 12 15 28

And the problem is that it takes too long time.
Specifically, Action2Motion with two datasets takes many minutes in one epoch, Text2Motion and Unconstrained takes about twenty seconds in one epoch. In other words, I need at least one week or two weeks to train any one of the MDM.
BTW, the loss seems normal, and training records screenshots are attached below.
Action2Motion
Screen Shot 2023-12-06 at 12 28 27
Unconstrained
Screen Shot 2023-12-06 at 12 29 29

I also attached my Hardware Configuration screenshots below.
Screen Shot 2023-12-06 at 12 23 40

I guess the problem may be the arg.json because I didn't change any resource file.

Looking forward to your early reply~
Thank you!

@GuyTevet
Copy link
Owner

GuyTevet commented Dec 7, 2023

Something seems odd here. We tested the code on NVIDIA GeForce RTX 2080 Ti which is significantly weaker on paper compared to your A100, yet, it took about 5GB of memory and runs at about 6.5 iterations/sec.

@CMY-CTO
Copy link
Author

CMY-CTO commented Dec 19, 2023

Hi,

Thank you for your information!

BTW, I found that the main reason that affects the MDM training speed may be the Sharing Logic of the University Server, I guess?

Screen Shot 2023-12-19 at 11 15 38 Screen Shot 2023-12-19 at 11 15 53 Just like the screenshots attached show, when I(i.e., PID=2904280) began training the MDM, the `Power Usage` increased by `71W`, and `GPU Memory Usage` increased by `12701MiB`. Does it look normal?

And the speed is still not as quick as expected: for action2motion_humanact12, which is about 20 seconds per epoch; for action2motion_uestc, which is about 3 minutes per epoch; and for unconstrained_humanact12, which is about 20 seconds per epoch. To be honest, it's a bit puzzling for me.

Looking forward to your early reply~
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants