[request] Depth estimation documentation, training code and / or model weights #54

patricklabatut · 2023-04-24T22:41:05Z

Related issues:

kfzyqin · 2023-04-27T07:52:07Z

Can you give an expected timeframe on when depth estimation will be available?

tnarek · 2023-04-27T14:51:48Z

I'd also be interested to hear.

yuvfried · 2023-04-30T10:55:29Z

I'd also be happy if you could share the semantic segmentation heads. The one that produces the results on the web demo.
Thx!

hblanken · 2023-05-04T02:15:09Z

Would be excellent to obtain depth estimation output per image. Supportive of this enhancement!

mirlansmind · 2023-05-04T17:05:43Z

segmentation head similar to the demo please

stofe95 · 2023-05-07T13:30:12Z

Also interested in acquiring depth info per image, really cool!

jonathan-besuchet · 2023-05-11T13:29:03Z

Also very interested to have the depth estimation head model documentation (and model/weights if possible).

shahabe · 2023-05-16T01:21:10Z

@patricklabatut Thank you so much for the main code.
Would you please update us about the timeline of delivering the depth-estimation code as well.
Please let us know if any help is needed.

wuzihaoo · 2023-05-19T18:38:26Z

Could you please release the segmentation part?

woctezuma · 2023-05-19T20:22:24Z

Could you please release the segmentation part?

[request] Semantic segmentation documentation, training code and / or model weights #55

ttppss · 2023-05-24T14:18:31Z

Very interested and waiting for your release!

imbinwang · 2023-05-26T09:23:25Z

Cool!

bloodhunt3r · 2023-05-29T11:47:21Z

very interested in releasing the depth estimation head

kootsZhin · 2023-05-29T15:35:49Z

Interested in depth estimation head as well (or any documentation on how to reproduce the results using provided models)

ray8828 · 2023-05-29T20:23:53Z

Interested in the depth part also!

JuliusJacobsohn · 2023-05-31T10:49:23Z

@patricklabatut could you maybe shed some light on the decision to not release the depth estimation parts immediately?
I'm not much into deep learning research, but if you trained and tested it, is it a lot of effort to just publish it? Or am I to naive?

Ale-Burzio · 2023-06-06T09:13:04Z

@patricklabatut amazing work! any approximate timeline on if/when a trained depth estimation head could be released?

leesunfreshing · 2023-06-09T16:28:56Z

I would love to learn the news about the depth

kanishkanarch · 2023-06-10T15:42:23Z

I would also appreciate an example code for depth estimation. Can't do much with the model's output embeddings yet.
Thanks!

Cindy0725 · 2023-06-13T09:17:58Z

Very interested in the depth estimation code! I tried to add linear head but actually I don't know how to convert the (batch_size, num_of_tokens, feature_dim) tensor to (batch_size, 256 image_width, image_height) to get the paper's result on SUNRGBD.

fumin · 2023-06-13T13:58:01Z

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!
Can't wait to try it on my videos!

patricklabatut · 2023-06-13T21:04:14Z

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!

Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).

hblanken · 2023-06-18T03:21:50Z

Would be awesome if someone train a depth estimation head on top of the provided backbone (dinov2_vitl14_pretrain.pth). Any thoughts on who/how and estimated eta?

sfchen94 · 2023-06-21T15:53:05Z

I would also like to request an estimated release date for the depth estimation pre-train head. Thank you.

Jimlee079 · 2023-06-26T12:00:43Z

Two questions about the "DPT decoder" mentioned in 7.3 Dense Recognition Tasks-Depth estimation part. I search for the DPT source code, do the "DPT decoder" refers to its refinenet? If yes, I'm curious on why you choose this decoder . Thank you!

dariocazzani · 2023-06-28T00:01:19Z

@patricklabatut - any updates on the depth estimation code?
I am having a hard time reproducing with the same quality you show in the paper

emojilearning · 2023-06-28T09:15:10Z

@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper

Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.

Cindy0725 · 2023-06-28T09:29:12Z

@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper

Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.

Hi how much RMSE did you get for depth estimation with DPT decoder? For NYUv2 or SUNRGBD? I am really interested in the results. Thank you very much! @emojilearning

mbanani · 2023-07-02T23:21:57Z

Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.

I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.

I am basing my experiments on this part describing the simplest setup lin . 1 for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's output

lin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).

Below i detail my attempt based on the details provided in the paper:

Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.

Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of batch x 1536 x 33 x 44

import torch
import einops as E

vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)

patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]

_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14

patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)                   
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1)

Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.

Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.

Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.

As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!

YirayWang · 2023-07-04T08:13:29Z

Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.

I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.

I am basing my experiments on this part describing the simplest setup lin . 1 for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's output

lin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).

Below i detail my attempt based on the details provided in the paper:

Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.

Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of batch x 1536 x 33 x 44
import torch
import einops as E

vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)

patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]

_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14

patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)                   
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1)              
Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.

Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.

Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.

As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!

Hi @mbanani, thanks for sharing research details. I also concentrate on depth estimation task based on dinov2 backbone and obtained an unexpected result.
for the simplest setup lin. 1 stated in the paper,
firstly, I used the kitti dataset. for data preprocess, i just slightly resize the origin RGB image to satisfy "height(or width) % 14 == 0",
while the dense depth groundtruth was resized using 'nearest' mode.
I totally agree with the step of Feature Extraction you described.
for Depth estimation, I think the vision transformer backbone used in dinov2 naturally provide a spatially low-resolution feature,
but with more embedding dimensions. I was also confused is there any operations to rescale the features to original image size instead of directly upsample by 4 and successively by 3.5. I tried the Unet decoder structure (no concat in my case), with successively upsampling by 2, 2, 2 and 1.75. between the two upsample blocks, conv2d was used to extract features and change the embedding dimension. Finally, the linear head was trained as a regression task using scale invariant loss.
However, at the inference stage, the estimated depth (the selected image also from kitti) was unexpected. Especially for the scene where many cars parked on the side road.

Above is my experience and opinion, thank you

52THANOS · 2023-07-19T06:20:02Z

@

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!

Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).

when a trained depth estimation head could be released?

FrankFeng-23 · 2023-07-30T12:14:39Z

@

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!

Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).

when a trained depth estimation head could be released?

Same quest here. I would really appreciate it if a depth estimation head is available.

dodatw · 2023-08-07T07:17:28Z

same here.

NielsRogge · 2023-11-13T17:44:04Z

Hi folks,

Just added support for DPT + DINOv2 in 🤗 Transformers: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DPT/DPT_inference_notebook_(depth_estimation).ipynb.

We've extended the DPT model (which is one of the best depth estimation decoders) to now also leverage DINOv2 as backbone. It can be created as follows:

from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation

backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_features=["stage1", "stage2", "stage3", "stage4"]
config = DPTConfig(backbone_config=backbone_config)

model = DPTForDepthEstimation(config)

Transferred all checkpoints to the hub: https://huggingface.co/models?pipeline_tag=depth-estimation&other=dinov2&sort=trending.

palol · 2023-11-15T17:54:58Z

@NielsRogge thanks for the support!

Question ~ if I already have DINOv2 embeddings extracted, is there a way for me to run them through the depth estimation portion only?

NielsRogge · 2023-11-15T18:29:19Z

Hi @palol, yes that's possible, you could do it as follows:

from transformers import DPTForDepthEstimation

model = DPTForDepthEstimation.from_pretrained("facebook/dpt-dinov2-small-kitti")

# note: we need to set a certain height and width (this is normally the height and width of the image passed to the model)
height = width = 518
patch_size = model.config.backbone_config.patch_size
patch_height = height // patch_size
patch_width = width // patch_size
hidden_states = model.neck(dino_features, patch_height, patch_width)
predicted_depth = model.head(hidden_states)

Note that the dino_features here need to be a list of 4 feature maps extracted from a DINOv2-small model in this case (as we're loading facebook/dpt-dinov2-small-kitti from the hub), across the 4 stages that correspond to the small one (which is stage [3, 6, 9, 12]). This is because the DPT head uses feature maps/embeddings from 4 different layers of DINOv2.

palol · 2023-11-15T21:24:57Z

@NielsRogge thanks for the solution. So this means that enough of the backbone has to be preserved to follow the "lin. 4" protocol. Do you have any support for the "lin. 1" protocol, that only uses the last layer of the frozen transformer?

patricklabatut added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 24, 2023

patricklabatut self-assigned this Apr 24, 2023

patricklabatut mentioned this issue Apr 24, 2023

How to finetune on downstream depth estimation task? #46

Closed

patricklabatut changed the title ~~[request] Depth estimation training code and / or model weights~~ [request] Depth estimation documentation, training code and / or model weights Apr 24, 2023

patricklabatut mentioned this issue Apr 24, 2023

Will the other heads be released (eg. depth estimation) #14

Closed

patricklabatut assigned patricklabatut and unassigned patricklabatut Apr 24, 2023

patricklabatut pinned this issue May 5, 2023

patricklabatut mentioned this issue May 17, 2023

How to implement depth estimation with the provided model #97

Closed

patricklabatut unpinned this issue Jul 24, 2023

patricklabatut pinned this issue Jul 24, 2023

NielsRogge mentioned this issue Aug 3, 2023

DINOv2 is now available in HF Transformers (with tutorial) #153

Open

[request] Depth estimation documentation, training code and / or model weights #54

[request] Depth estimation documentation, training code and / or model weights #54

Comments

patricklabatut commented Apr 24, 2023 • edited

kfzyqin commented Apr 27, 2023

tnarek commented Apr 27, 2023

yuvfried commented Apr 30, 2023 • edited

hblanken commented May 4, 2023

mirlansmind commented May 4, 2023

stofe95 commented May 7, 2023

jonathan-besuchet commented May 11, 2023

shahabe commented May 16, 2023

wuzihaoo commented May 19, 2023

woctezuma commented May 19, 2023

ttppss commented May 24, 2023

imbinwang commented May 26, 2023

bloodhunt3r commented May 29, 2023

kootsZhin commented May 29, 2023

ray8828 commented May 29, 2023

JuliusJacobsohn commented May 31, 2023

Ale-Burzio commented Jun 6, 2023

leesunfreshing commented Jun 9, 2023

kanishkanarch commented Jun 10, 2023

Cindy0725 commented Jun 13, 2023

fumin commented Jun 13, 2023

patricklabatut commented Jun 13, 2023

hblanken commented Jun 18, 2023

sfchen94 commented Jun 21, 2023

Jimlee079 commented Jun 26, 2023 • edited

dariocazzani commented Jun 28, 2023

emojilearning commented Jun 28, 2023

Cindy0725 commented Jun 28, 2023

mbanani commented Jul 2, 2023 • edited

YirayWang commented Jul 4, 2023 • edited

52THANOS commented Jul 19, 2023

FrankFeng-23 commented Jul 30, 2023

dodatw commented Aug 7, 2023

NielsRogge commented Nov 13, 2023 • edited

palol commented Nov 15, 2023

NielsRogge commented Nov 15, 2023

palol commented Nov 15, 2023

patricklabatut commented Apr 24, 2023 •

edited

yuvfried commented Apr 30, 2023 •

edited

Jimlee079 commented Jun 26, 2023 •

edited

mbanani commented Jul 2, 2023 •

edited

YirayWang commented Jul 4, 2023 •

edited

NielsRogge commented Nov 13, 2023 •

edited