Skip to content

Latest commit

 

History

History
158 lines (129 loc) · 13 KB

File metadata and controls

158 lines (129 loc) · 13 KB

Log

Changes Result Ideas for next steps
Run notebook https://www.kaggle.com/phoenix9032/center-resnet-starter public LB 0.020, place ~660/800 Seems low place ->Test more popular notebook by Rusian
Run notebook https://www.kaggle.com/hocop1/centernet-baseline public LB 0.031, place 550/800 Better start, seems okay as a baseline. -> Implement prediction pipeline yourself
Own implementation of previous notebook (in particular data loading and inference pipeline). Only use model weights by hocop1 / ruslan public LB 0.006 something is still wrong, despite predicted images seem okay :/ -> compare code to original notebook
Fixed stuff in reimplementation and integrated optimize_xy function. Now, prediction is exactly equal using original weights public LB 0027 Should also score 0.031 ? -> Try with my weights and see what happens
Exact same using my weights public LB 0.034, place 502/810 surprisingly large difference. -> Next: Evaluate improvement idea collection and start gradually
Increased model input size from 1024x320 to 1536x512 (without modifying layers. Maybe another convolution would be necessary, because effective window size decreases?). Added another upsample convolution to increase output size to 384x128 (before 128x40) x - aborted training after epoch 4 mask loss dominates. -> Change weights s.t. mask and regr loss are similar
Changed weights, so that mask and regr loss are same order of magnitude, see https://www.kaggle.com/c/pku-autonomous-driving/discussion/115673 public LB 0.062, place 109/820 :) -> Next: choose next improvement idea
Major change: Switched from binary loss to focal loss for mask. Minor change: Excluded five erroneous images from training set public LB 0 :( Mask seems great, but regression values are totally wrong
In training: Changed regression loss by extracting binary mask from heatmap mask. Also changed learning rate scheduler slightly. In prediction post-processing: Disable optimization if optimized values don't make sense. (20200112_focal_loss_v2 / model_6) public LB 0.044 My impression is that learning is not yet finished. Could achieve better score by training more -> Train focal loss more and disable LR decay. Also, why is loss_regr so high? Was 0.13 before, now at 0.51? Tried learning more, but then overfits... -> Focal loss not effective? ->
(In parallel to training above): Trained a new model with focal loss, image augmentation and usage of provided masks. public LB 0.005 Somehow very few car predictions, especially near cars. Focal loss w/out image augmentation & w/out mask is much better. Is something off with augmentation after all?! No, I guess just too much augmentation and current usage of mask not effective. Rather concat mask to image!
Changes: (1) Found issue with focal loss ! labels had values >1, which resulted in slightly weird loss terms (2) Use mask through concatenation instead of through simply masking camera image (3) Significantly reduce image augmentation 0.055 (without aug) & 0.041 (with aug and shorter training) Training was not complete, but had to hand in results due to deadline. regr_loss was 0.45 at the end, but I believe it would still have gone down (much) further... A pity, but a fun and learning experience anyway.

Improvement idea collection

Effective window size

Lessons learned

  • kaggle specific
    • Simply download the predictions.csv output file from a notebook to evaluate its score instead of rerunning the whole notebook and waiting 12h
    • Instead of starting code from scratch rather refactor the existing notebook code. While starting from scratch is a greater learning experience and produces more structured code (in my view), it is quite difficult and cumbersome to get all details right in the reimplementation and thus have a defined starting point.
  • More general
    • Buy a decent GPU. Solution now was to use either a free google colab GPU via ssh and pycharm remote (possible through ngrok "hack") or sometimes the free kaggle GPU. However, both have disadvantages (see details below). Often, I would start 1-2 trainings in the evening and I would find both of them aborted the next morning :/.
      • Usage of the kaggle GPU is limited to 30h/week and importing the code is tedious. Moreover, during commiting, you cannot see any output and thus detect e.g. a nan-loss.
      • Google colab sessions have an official time limit of 12h, but in reality training often already stopped after e.g. 4-5h or sometimes even earlier to my surprise. Moreover, I was never able to acquire a GPU backend in the day, only in the evenings.
      • At the end, I invested into a google cloud GPU (P100). While it was much more productive than colab, the connection was not 100% stable, the server was sometimes unavailable and analyzing output images requires a time-consuming download at first.
    • Work with larger validation set (currently only 1%) so that can be used for real evaluation (because test set online can only be used 2x a day)
    • Only change one thing at a time. Due to the limited GPU availability, I sometimes changed several things at one time and could not assess the effects of each change. Resist the temptation...
    • Compare losses not only between epochs, but also between different models. If something is significantly different, question it.
    • Pytorch parameters num_workers>2 and pin_memory=True speed up training by a factor of ~3
    • Research even more before implementing.

Idea for more efficient GPU use

  • Use local GPU and a tiny model (e.g. 1/8 of channels) to test code locally
  • Use 2 GPUs for actual training (so that I can train and test new stuff in parallel !):
  • github vs ssh
    • github
      • (+) quick to setup
      • (-) cannot debug or set breakpoints
    • ssh + pycharm remote
      • (-) need to run ssh, ngrok and click event. Potentially more unstable?
      • (+) code is copied instantaneously
  • -> add additional params (to overwrite) via argparse
    • location of data
    • flag_simplify_model (only for local running!)

How to use (paid) google cloud service?

Overview

Connect to server via SSH