Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results in real-time videos #24

Open
ahsan3803 opened this issue Dec 11, 2020 · 12 comments
Open

Results in real-time videos #24

ahsan3803 opened this issue Dec 11, 2020 · 12 comments

Comments

@ahsan3803
Copy link

I tested some real-time videos from YouTube and I noticed that results are not good as your paper is showing. I compared MotioNet results with VideoPose after converting VideoPose results to BVH file.

As you compared your approach results with different approaches (here) and its clearly seems that MotionNet is performing well as compared to others but when I tested with different videos then I got shocked results. Check the attached gif files as sample results that I got.

part_1

part_2

part_3

part_4

Ignore the viewing angles of results I just set to get better view, you can judge with current view of every gif file.

MotioNet insures the geometric skeleton and rotational parameters which are key things to drive any 3D character, VideoPose didn't use any of these parameters but still results are stable and better as compared to MotioNet.

So did you compare after applying some filters?

If you wanna test these videos by yourself then I can send you these videos.

Note: Both approaches are using pre-trained weights without any filter or smoothness.

@Shimingyi
Copy link
Owner

Hi @ahsan3803 ,

Thanks for your work! It's so clear to show the difference between our network production and others. I'd like to explain more following your demonstration.

There are several reasons:

  1. Videopose will predict 1 frame output from 243 frames input, but our network is fully-output which means 243 inputs will generate 243 outputs. In the test time, we also don't apply shifting to generate overlaying in the input sequence.
    You can imagine: In their method, the 121th frame results will be generated from the 0th-243rd frame, and 122th frame results will be generated from the 1st-244th frame. It's essentially a strong temporal consistency compared to ours in the input data level.

  2. We haven't used the same architecture in our network, because the main idea of this paper is using FK in the network so we can generate rotation with a consistent skeleton directly. The architecture we used is simple, but still can get comparable errors in the h36m dataset. 52mm(ours) vs 48mm(videopose). I will recommend using a stronger network architecture to extend this FK idea to reach a better performance if we need to use it in the real application.

  3. In our demonstration, I haven't applied any filter here but I applied different scaling factors in global root position(it should be defined manually in our method). So you can find the improvement in which we can recover a global root position so it looks fancy. Compared to VideoPose, our methods can also produce expressive motion when given a solid 2d input, your 2nd video also shows that our leg can reach a better height. But yes, sometimes it's not smooth. I am considering if I need to release some basic filters on the output.

  4. But for the 3th and a part in the 4th video, there should be something ignored. I will check it : )

And I have another question for you, after having the position of Videopose outputs, how do you convert it to bvh file? Using some IK solvers?

Feel free to let me know your ideas. Thanks!

@ahsan3803
Copy link
Author

ahsan3803 commented Dec 11, 2020

@Shimingyi Thanks for explanation.
Regarding consistency of skeleton I have another question. How you calculate the absolute value of bone length as we can't get it even from dataset GT.

For the the 3rd and 4th video I just set better view in video creation but I think you can judge it with current view.

After getting skeleton information, I convert pose data to joints angles. Furthermore I use Euler method as representation.
I read your paper and checked the code that you are using Quaternions representation and order is zyx, whereas I follow the xzy order.

As Quaternions is best practice especially in machine level but still I can get good results using Euler. May be in future I will change it to Quaternions

I also have question about foot contact, I can see that code is using foot contact loss but how we can get predictions about foot contact? I mean same as joints rotations, bones etc from forward_fk? or you just use it within network to enhance the final results and is there any foot contact GT from dataset?

@Shimingyi
Copy link
Owner

@ahsan3803
We don't predict the absolute value of bone length, we just predict a proportion of body skeleton. In our method, the gt of bone length will be scaled by its average length, so our network will also generate this kind of 'proportion'. You can find it in this code.

Euler angle is ok if you don't use any learning optimization method. But I am also cautious of your transformation from position to rotations. Do you use a ground truth skeleton? or do something like average? Ok riging? And how do you convert it to rotation?

Regarding foot contact, we have two losses here. Firstly, we will extract the gt of foot contact from the dataset and then predict it as a part in E_q output, so we can apply a reconstrution_loss of foot contact here. And in addition, we will constraint the velocity to be 0 when the foot is predicted as 'contact'. Related code: code.

@JinchengWang
Copy link

This might be only tangentially related, but I noticed in the future work section of the paper you said

... there is no notion of contact constraints in the system, thus the resulting motion may suffer from foot sliding. This could potentially be solved by predicting the foot contact conditions and using a loss function term that evaluates the positional constraints as done by Lee et al. [2018b].

I feel like that's similar to what you are doing in the paper?

@Shimingyi
Copy link
Owner

Hi @JinchengWang, the idea is similar but it's different.

You can imagine what causes foot sliding when you see a foot motion. On important thing is the foot moving, we consistant the velocity of foot to zero so it can be more stable. But another important thing is the height between the ground and foot joint, only when we can guarantee that the velocity=0 and height'=0, we can say it's stable.

But in our system, we don't reconstruct the environment so the network doesn't account for any physical constraints with ground. In this SIGGRAPH Asia, there is a paper called PhysCap can do it well, it's worth having a reading.

@JinchengWang
Copy link

@Shimingyi I looked at the paper "Interactive Character Animation by Learning Multi-Objective Control" by Lee et al. It seems they addressed this problem by using a foot contact loss, which is almost identical to the L_FC term in your paper, instead of by "enforcing" this constraint, as in PhysCap?

While it is true that there is no visible foot sliding in their end result, the way you wrote it seems to imply they are using a fundamentally different method. (Which, I think, is not the case, assuming that I understand both paper correctly.)

BTW, great fan of the PhysCap paper as well! On hindsight, there is a glaring lack of cooperation between the pose estimation and humanoid robot control communities...

@ahsan3803
Copy link
Author

ahsan3803 commented Jan 5, 2021

Hi @Shimingyi

For BVH conversion I use this tool for VideoPose.

As some other members are also asking about foot contacts and in your work foot contacts and global root position is key factor for any motion capture system.
I checked foot contacts from MotioNet and it seems that foot contact values are varying too much even in same actions. To deal and discuss with you I arranged some frames and foot contact values accordingly. Kindly check the attached images with frame numbers accordingly.

Just take these 2D-frames as examples.
Below image is based on 2D detections from OpenPose to check the foot.
combine

Below image have foot contact labels of above mentioned example
combine_value

We can see that foot contact values are varying. Furthermore I wanna ask that what are the standard to set foot contacts? I mean if height from foot is less than 20mm above the ground (as mentioned in paper section 3.3) then it will be consider as foot is in contact with ground. What if height of foot from ground is more than 20mm (it may be 50mm, 100mm, even more than 200mm etc....) so in this situation values of foot contact from MotioNet are always constant or will be changed according to height from ground? In other words what is highest or lowest value if foot is in or not in contact? I'm asking it because from above results, values are varying too much and seems that MotioNet is not using binary flags in term of foot contacts.

My other question is about global root position that in paper you used it as loss and there is method about it in Animation file.

So in getting BVH here you didn't make use of foot contacts and global position. This functionality will be supported in next commit or there is no plan to release this functionality?

Looking forward for your response.

@Shimingyi
Copy link
Owner

Hi @ahsan3803 ,

Thanks so much for your useful feedback! Those experiments can help others to understand the system better. Let me explain more in this thread:

  1. The reason of the varying in test video. The input of this network is rooted_2d pose, the 'rooted' means the root joint of all the input 2d pose will be placed (0, 0). For this test video, that means the network didn't aware that the girl is jumping, just a sequence of a centered girl who is stretching her legs, an ambiguity will be happened for predicting foot contact label in this case. We use it in our system, because in the h36m dataset, almost all motions are standard walking, running and sitting so there are not so many conflicts. I will recommend to observe the foot contact label value in test data of h36m, it won't be a binary number but will be close to 1 or 0 based on our loss function here.

  2. We want to use these two functions to describe a relationship between motion and contact. When you are putting down your right foot, we expect the following contact signal will be seen as 'contacted' and then the network will generate a local motion which the velocity of the right foot has been constrained. It's the motivation of these two losses.

  3. We use a relative number to define the threshold of the contact label. You can find it with this code. We didn't assume the height of ground will be always 0, we will select the lowest 1/5 foot heights to estimate a ground height and then apply a threshold (20mm) on this estimated ground.

  4. Now we just released the code about how to use the foot contact label as a loss in network, we didn't provide any post processing tools but actually after having the contact label we can run Ik to optimize the motion to be more stable. I am considering if I need to do open source that part because it's too engineering, which is not my intention for this repo... But I can do it after finishing other smoothing updates and bug fixing. Before that, you can follow deep-motion-editing to understand it better : )

Best,
Mingyi

@ahsan3803
Copy link
Author

@Shimingyi Thanks so much for explanation

  1. I got your points that in h3.6m datasets there are no such complex motions or actions. All the actions like walking, smoking,
    discussion, taking photos, running and sitting etc. (upto 17 scenarios) are not such complex as any wild video and the most
    important point I just noted in h3.6m datasets, that in these actions at same time (like jump) both foots will not leave the
    ground this is we can say is weak perspective to rely on h3.6m datasets and then to use it in real-time videos. Is it right that
    using current pre-trained weights of MotioNet it can't guarantee for jump like videos that both foots will leave the ground?
    What if we add more real-time based data to train it?

  2. Here one question arise that if we talk about velocity that velocity will be zero or same as in paper (section 3.3) if foot is in
    contact so in sometimes foot will be in contact but ground level velocity will not be zero, I mean to say foot is moving on the
    ground continuously like sliding and height is zero but velocity is not zero. So how MotioNet will deal it? I'm asking about it
    that if we add some real-time datasets to train MotioNet. Yeah you already mentioned PhysCap which can deal it in better way
    but it's not opensource.

  3. This point I got it.

  4. Yeah it's too engineering and as you know that currently the MotioNet (opensource) is the only model that can provide these
    two important functionalities for motion capture. So we should wait for this wonderful step 😜. We are waiting for next big
    commit.

@Shimingyi
Copy link
Owner

  1. Yes, now the pre-trained model cannot handle the motion like jumping or others. A realiable reconstructed motion should consider two things: local joint motion and global movement, but current model are all focus on the first one. If you need to improve the performance in these cases, I will suggest to model the relationship between ground and foot using a better way rather than using a contact model which is too weak. We can have more discussion about how to design it if you are interested.

  2. MotioNet cannot handle the slidng in ground, because the motivation of the velocity loss is to avoiding the sliding. In animaiton field people don't like it so we disign this consistency in a learning manner. In some way we simiplify this problem, but we hope it can bring inspirations.

I will keep working in this field so just feel free to contact me if you need help or any idea can be shared!

@rutuja1409
Copy link

So did you compare after applying some filters?

Hello @ahsan3803
I am having issues in running the github. Can you let me know the steps as you have successfully been able to run it.
Thank you.

@UsamaShaikh1
Copy link

After getting skeleton information, I convert pose data to joints angles. Furthermore I use Euler method as representation.

Hi @ahsan3803 , can you please guide me as to how you were calculating joint angles from pose data,, i am lost there. I would really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants