Skip to content

Latest commit

 

History

History
34 lines (23 loc) · 2.33 KB

task_grouping.md

File metadata and controls

34 lines (23 loc) · 2.33 KB

March 2020

tl;dr: Answers this question: which tasks are better trained with others?

Overall impression

The paper uses the dataset from CVPR 2018 best paper taskonomy, which studies the task transferability. Task grouping studies the multi-task learnability. The paper founds that they correlate inversely.

The goal of multitask learning is two-fold.

  • find best performance (with the regularization power from training with other tasks)
  • reduce inference time

One of the key insights from the paper is:

The inclusion of an additional task in a network can potentially improve the accuracy that can be achieved on the existing tasks, even though the performance of the added task might be poor.

Key ideas

  • Optimal grouping is better than single multi-task network, or multiple single-task network.
    • For example, the best strategy found by this paper is train 2.5 networks, 2 full-sized networks with 2 tasks each and the third half-sized network train the fifth task. However the fifth task is needed to regularize the first two full-sized network to gain optimal performance for the first four tasks.
  • Given enough computation resource, training individual single networks are better, but sometimes need other tasks to help with regularization
  • Task grouping (multitask learning) is inversely correlated with task transferability. Thus it is better to train dissimilar tasks together. --> This is somewhat counter-intuitive. The authors argue that this will give more meaningful regularization.
  • The paper proposed two methods to reduce computation burden
    • Early stopping: validation score at 0.2 epoch already correlates with final score pretty well. This saves 20x computation resource.
    • High order approximation: train all single and dual task models and use them to approximate higher order grouping. This reduces computation from exponential combination to quadratic.

Technical details

  • Hard parameter sharing: same backbone/encoder
  • Soft parameter sharing: Same architecture and L2 distance penalty between corresponding weights. Or added peephole connection between corresponding weights. --> This does not improve inferenece time.

Notes

  • Questions and notes on how to improve/revise the current work