Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[request] Depth estimation documentation, training code and / or model weights #54

Open
patricklabatut opened this issue Apr 24, 2023 · 37 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@patricklabatut
Copy link
Contributor

patricklabatut commented Apr 24, 2023

Related issues:

@patricklabatut patricklabatut added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 24, 2023
@patricklabatut patricklabatut self-assigned this Apr 24, 2023
@patricklabatut patricklabatut changed the title [request] Depth estimation training code and / or model weights [request] Depth estimation documentation, training code and / or model weights Apr 24, 2023
@kfzyqin
Copy link

kfzyqin commented Apr 27, 2023

Can you give an expected timeframe on when depth estimation will be available?

@tnarek
Copy link

tnarek commented Apr 27, 2023

I'd also be interested to hear.

@yuvfried
Copy link

yuvfried commented Apr 30, 2023

I'd also be happy if you could share the semantic segmentation heads. The one that produces the results on the web demo.
Thx!

@hblanken
Copy link

hblanken commented May 4, 2023

Would be excellent to obtain depth estimation output per image. Supportive of this enhancement!

@mirlansmind
Copy link

segmentation head similar to the demo please

@patricklabatut patricklabatut pinned this issue May 5, 2023
@stofe95
Copy link

stofe95 commented May 7, 2023

Also interested in acquiring depth info per image, really cool!

@jonathan-besuchet
Copy link

Also very interested to have the depth estimation head model documentation (and model/weights if possible).

@shahabe
Copy link

shahabe commented May 16, 2023

@patricklabatut Thank you so much for the main code.
Would you please update us about the timeline of delivering the depth-estimation code as well.
Please let us know if any help is needed.

@wuzihaoo
Copy link

Could you please release the segmentation part?

@woctezuma
Copy link

Could you please release the segmentation part?

@ttppss
Copy link

ttppss commented May 24, 2023

Very interested and waiting for your release!

@imbinwang
Copy link

Cool!

@bloodhunt3r
Copy link

very interested in releasing the depth estimation head

@kootsZhin
Copy link

Interested in depth estimation head as well (or any documentation on how to reproduce the results using provided models)

@ray8828
Copy link

ray8828 commented May 29, 2023

Interested in the depth part also!

@JuliusJacobsohn
Copy link

@patricklabatut could you maybe shed some light on the decision to not release the depth estimation parts immediately?
I'm not much into deep learning research, but if you trained and tested it, is it a lot of effort to just publish it? Or am I to naive?

@Ale-Burzio
Copy link

@patricklabatut amazing work! any approximate timeline on if/when a trained depth estimation head could be released?

@leesunfreshing
Copy link

I would love to learn the news about the depth

@kanishkanarch
Copy link

I would also appreciate an example code for depth estimation. Can't do much with the model's output embeddings yet.
Thanks!

@Cindy0725
Copy link

Very interested in the depth estimation code! I tried to add linear head but actually I don't know how to convert the (batch_size, num_of_tokens, feature_dim) tensor to (batch_size, 256 image_width, image_height) to get the paper's result on SUNRGBD.

@fumin
Copy link

fumin commented Jun 13, 2023

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!
Can't wait to try it on my videos!

@patricklabatut
Copy link
Contributor Author

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!

Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).

@hblanken
Copy link

Would be awesome if someone train a depth estimation head on top of the provided backbone (dinov2_vitl14_pretrain.pth). Any thoughts on who/how and estimated eta?

@sfchen94
Copy link

I would also like to request an estimated release date for the depth estimation pre-train head. Thank you.

@Jimlee079
Copy link

Jimlee079 commented Jun 26, 2023

Two questions about the "DPT decoder" mentioned in 7.3 Dense Recognition Tasks-Depth estimation part. I search for the DPT source code, do the "DPT decoder" refers to its refinenet? If yes, I'm curious on why you choose this decoder . Thank you!

@dariocazzani
Copy link

@patricklabatut - any updates on the depth estimation code?
I am having a hard time reproducing with the same quality you show in the paper

@emojilearning
Copy link

@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper

Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.

@Cindy0725
Copy link

@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper

Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.

Hi how much RMSE did you get for depth estimation with DPT decoder? For NYUv2 or SUNRGBD? I am really interested in the results. Thank you very much! @emojilearning

@mbanani
Copy link

mbanani commented Jul 2, 2023

Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.

I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.

I am basing my experiments on this part describing the simplest setup lin . 1 for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's output

lin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).

Below i detail my attempt based on the details provided in the paper:

Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.

Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of batch x 1536 x 33 x 44

import torch
import einops as E

vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)

patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]

_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14

patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)                   
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1)              

Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.

Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.

Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.


As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!

@YirayWang
Copy link

YirayWang commented Jul 4, 2023

Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.

I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.

I am basing my experiments on this part describing the simplest setup lin . 1 for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's output

lin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).

Below i detail my attempt based on the details provided in the paper:

Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.

Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of batch x 1536 x 33 x 44

import torch
import einops as E

vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)

patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]

_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14

patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)                   
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1)              

Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.

Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.

Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.

As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!

Hi @mbanani, thanks for sharing research details. I also concentrate on depth estimation task based on dinov2 backbone and obtained an unexpected result.
for the simplest setup lin. 1 stated in the paper,
firstly, I used the kitti dataset. for data preprocess, i just slightly resize the origin RGB image to satisfy "height(or width) % 14 == 0",
while the dense depth groundtruth was resized using 'nearest' mode.
I totally agree with the step of Feature Extraction you described.
for Depth estimation, I think the vision transformer backbone used in dinov2 naturally provide a spatially low-resolution feature,
but with more embedding dimensions. I was also confused is there any operations to rescale the features to original image size instead of directly upsample by 4 and successively by 3.5. I tried the Unet decoder structure (no concat in my case), with successively upsampling by 2, 2, 2 and 1.75. between the two upsample blocks, conv2d was used to extract features and change the embedding dimension. Finally, the linear head was trained as a regression task using scale invariant loss.
However, at the inference stage, the estimated depth (the selected image also from kitti) was unexpected. Especially for the scene where many cars parked on the side road.

Above is my experience and opinion, thank you

@52THANOS
Copy link

@

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!

Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).

when a trained depth estimation head could be released?

@patricklabatut patricklabatut unpinned this issue Jul 24, 2023
@patricklabatut patricklabatut pinned this issue Jul 24, 2023
@FrankFeng-23
Copy link

@

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!

Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).

when a trained depth estimation head could be released?

Same quest here. I would really appreciate it if a depth estimation head is available.

@dodatw
Copy link

dodatw commented Aug 7, 2023

same here.

@NielsRogge
Copy link

NielsRogge commented Nov 13, 2023

Hi folks,

Just added support for DPT + DINOv2 in 🤗 Transformers: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DPT/DPT_inference_notebook_(depth_estimation).ipynb.

We've extended the DPT model (which is one of the best depth estimation decoders) to now also leverage DINOv2 as backbone. It can be created as follows:

from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation

backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_features=["stage1", "stage2", "stage3", "stage4"]
config = DPTConfig(backbone_config=backbone_config)

model = DPTForDepthEstimation(config)

Transferred all checkpoints to the hub: https://huggingface.co/models?pipeline_tag=depth-estimation&other=dinov2&sort=trending.

@palol
Copy link

palol commented Nov 15, 2023

@NielsRogge thanks for the support!

Question ~ if I already have DINOv2 embeddings extracted, is there a way for me to run them through the depth estimation portion only?

@NielsRogge
Copy link

Hi @palol, yes that's possible, you could do it as follows:

from transformers import DPTForDepthEstimation

model = DPTForDepthEstimation.from_pretrained("facebook/dpt-dinov2-small-kitti")

# note: we need to set a certain height and width (this is normally the height and width of the image passed to the model)
height = width = 518
patch_size = model.config.backbone_config.patch_size
patch_height = height // patch_size
patch_width = width // patch_size
hidden_states = model.neck(dino_features, patch_height, patch_width)
predicted_depth = model.head(hidden_states)

Note that the dino_features here need to be a list of 4 feature maps extracted from a DINOv2-small model in this case (as we're loading facebook/dpt-dinov2-small-kitti from the hub), across the 4 stages that correspond to the small one (which is stage [3, 6, 9, 12]). This is because the DPT head uses feature maps/embeddings from 4 different layers of DINOv2.

@palol
Copy link

palol commented Nov 15, 2023

@NielsRogge thanks for the solution. So this means that enough of the backbone has to be preserved to follow the "lin. 4" protocol. Do you have any support for the "lin. 1" protocol, that only uses the last layer of the frozen transformer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests