Awesome-Multi-Task-Learning

Paper List for Multi-Task Learning (focus on architectures and optimization for MTL). (Since 2016)

Contents:

Architectures for Multi-Task Learning
- Manually Designed Architectures
- Learning Architectures
Optimization for Multi-Task Learning
- Loss Balancing
- Gradient Balancing
Others

Architectures for Multi-Task Learning

Manually Designed Architectures

[TaskPrompter] Ye, H., & Xu, D. TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding. ICLR, 2023.

Notes: learn task-generic and task-specific representation as well as cross-task interaction simultaneously and does not require the design of different types of network modules.
[Polyhistor] Liu, Y. C., Ma, C. Y., Tian, J., He, Z., & Kira, Z. Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks. NeurIPS, 2022.

Notes: propose Polyhistor and Polyhistor-Lite, consisting of Decomposed HyperNetworks and Layer-wise Scaling Kernels, to share information across different tasks with a few trainable parameters.
[MulT] Bhattacharjee, D., Zhang, T., Süsstrunk, S., & Salzmann, M. MulT: An End-to-End Multitask Learning Transformer. CVPR, 2022.

Notes: at the heart of their approach is a shared attention mechanism modeling the dependencies across the tasks.

Q: How to select the reference task if the surface normal task not in these tasks? Can we automatically choose the reference task?
[Medusa] Spencer, J., Bowden, R., & Hadfield, S. Medusa: Universal Feature Learning via Attentional Multitasking. CVPR workshop, 2022.

Notes: shared feature attention (spatial attention) masks relevant backbone features for each task, allowing it to learn a generic representation; a novel Multi-Scale Attention head allows the network to better combine per-task features from different scales when making the final prediction.
[CA-MTL] Pilault, J., & Pal, C. Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data. ICLR, 2021.

Notes: consisting of a new conditional attention mechanism as well as a set of task-conditioned modules (conditional alignment and conditional layer normalization).
[HyperGrid Transformer] Tay, Y., Zhao, Z., Bahri, D., Metzler, D., & Juan, D. C. HyperGrid Transformers: Towards A Single Model for Multiple Tasks. ICLR, 2021.

Notes: leverages Grid-wise Decomposable Hyper Projections (HyperGrid), a hypernetwork-based projection layer for task conditioned weight generation.

Q: Does the task embedding is a learnable vector or a fixed vector?
[CTN] Popovic, N., Paudel, D. P., Probst, T., Sun, G., & Van Gool, L. Compositetasking: Understanding images by spatial composition of tasks. CVPR, 2021.

Notes: design a convolutional neural network that performs multiple, pixelwise tasks; feed every image along with a composition of spatially distributed multiple task requests to execute pixel-specific tasks; the proposed method for CompositeTasking learns by task-specific batch normalization.
[ATRC] Brüggemann, D., Kanakis, M., Obukhov, A., Georgoulis, S., & Van Gool, L. Exploring relational context for multi-task dense prediction. ICCV, 2021.

Notes: different source-target task pairs benefit from different context types; in order to automate the selection process, they sample the pool of all available contexts (i.e., global, local, T-label, S-label, none) for each task pair using differentiable NAS techniques.
[TSNs] Sun, G., Probst, T., Paudel, D. P., Popović, N., Kanakis, M., Patel, J., ... & Van Gool, L. Task Switching Network for Multi-task Learning. ICCV, 2021.

Notes: share all parameters among all tasks and do not require any task-specific modules (parameters); multiple tasks are performed by switching between them, performing one task at a time.
[HGNN] Guo, P., Deng, C., Xu, L., Huang, X., & Zhang, Y. Deep multi-task augmented feature learning via hierarchical graph neural network. ECML PKDD, 2021.

Notes: HGNN consists of two-level graph neural networks; in the low level, an intra-task GNN is responsible of learning a powerful representation for each data point in a task by aggregating its neighbors. Based on the learned representation, a task embedding can be generated for each task in a similar way to max pooling; in the second level, an inter-task GNN updates task embeddings of all the tasks based on the attention mechanism to model task relations; the task embedding of one task is used to augment the feature representation of data points in this task.
[AFANet] Cui, C., Shen, Z., Huang, J., Chen, M., Xu, M., Wang, M., & Yin, Y. Adaptive feature aggregation in deep multi-task convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 2021.

An extension of Deep adaptive feature aggregation in multi-task convolutional neural networks. CIKM, 2020.
[PSD] Zhou, L., Cui, Z., Xu, C., Zhang, Z., Wang, C., Zhang, T., & Yang, J. Pattern-structure diffusion for multi-task learning. CVPR, 2020.

Notes: the motivation is pattern structures high-frequently recur within intra-task also across tasks; mine and propagate task-specific and task-across pattern structures in the task-level space for multiple tasks in two different ways, i.e., intra-task and inter-task PSD.
[RCM] Kanakis, M., Bruggemann, D., Saha, S., Georgoulis, S., Obukhov, A., & Gool, L. V. Reparameterizing convolutions for incremental multi-task learning without task interference. ECCV, 2020.

Notes: decompose each convolution into a shared part that acts as a filter bank encoding common knowledge, and task-specific modulators that adapt this common knowledge uniquely for each task; the shared part is not trainable to explicitly avoid negative transfer.
[MTI-Net] Vandenhende, S., Georgoulis, S., & Gool, L. V. Mti-net: Multi-scale task interaction networks for multi-task learning. ECCV, 2020.

Notes: they show that two tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales, and vice versa.
[AM-CNN] Lyu, K., Li, Y., & Zhang, Z. Attention-aware multi-task convolutional neural networks. TIP, 2020.

Notes: automatically learns appropriate sharing through end-to-end training; the attention mechanism (SE block) is introduced into their architecture to suppress redundant contents contained in the representations; the shortcut connection is adopted to preserve useful information.

Q1: Which part of their model shows "automatically learns appropriate sharing"? Is the shared Bottleneck Layer in the AM-CNN module?

Q2: Can this method combines with spatial attention to further improve the performance?

Q3: They use the one-task networks trained on the corresponding tasks to initialize the networks. This seems to require a lot of training time.
[PLE] Tang, H., Liu, J., Zhao, M., & Gong, X. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. RecSys, 2020.

Notes: separate shared components and task-specific components explicitly and adopts a progressive routing mechanism to extract and separate deeper semantic knowledge gradually.
[AFA] Shen, Z., Cui, C., Huang, J., Zong, J., Chen, M., & Yin, Y. Deep adaptive feature aggregation in multi-task convolutional neural networks. CIKM, 2020.

Notes: a dynamic aggregation mechanism is designed to allow each task to adaptively determine the degree to which the feature aggregation of different tasks is needed according to the feature dependencies; the AFA consists of two main ingredients: the Channel-wise Aggregation Module (CAM) and Spatial-wise Aggregation Module (SAM).
[NDDR-CNN, NDDR-CNN-Shortcut] Gao, Y., Ma, J., Zhao, M., Liu, W., & Yuille, A. L. Nddr-cnn: Layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. CVPR, 2019.

Motivation: Why would we assume that the low- and mid-level features for different tasks in MTL should be identical, especially when the tasks are loosely related? If not, is it optimal to share the features until the last convolutional layer?

Notes: they hypothesize that these features, obtained from multiple feature descriptors (i.e., different CNN levels from multiple tasks), contain additional discriminative information of input data, which should be exploited in MTL towards better performance; they concatenate all the task-specific features with the same spatial resolution from different tasks according to the feature channel dimension, conduct discriminative dimensionality reduction on the concatenated features.
[MTAN, DWA] Liu, S., Johns, E., & Davison, A. J. End-to-end multi-task learning with attention. CVPR, 2019.

Notes: consists of a single shared network containing a global feature pool, together with a soft-attention module for each task; Dynamic Weight Average (DWA), which adapts the task weighting over time by considering the rate of change of the loss for each task.

Q: Can we replace the $K$ task-specific attention networks with one task-specific attention network to further reduce the parameters? This task-specific attention network can receive the task embeddings and the shared features as input.
[PAP] Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., & Yang, J. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. CVPR, 2019.

Notes: the motivation behind it comes from the statistic observation that pattern-affinitive pairs recur much frequently across different tasks as well as within a task; conduct two types of propagations, cross-task propagation and task-specific propagation, to adaptively diffuse those similar patterns.

Q: What is the finally learned $\alpha$ ? Since it can represents the task relationship to some extent.
[ASTMT] Maninis, K. K., Radosavovic, I., & Kokkinos, I. Attentive single-tasking of multiple tasks. CVPR, 2019.

Notes: a network is trained on multiple tasks, but performs one task at a time; use data-dependent modulation signals (task-specific SE block) that enhance or suppress neuronal activity in a task-specific manner; use task-specific Residual Adapter blocks that latch on to a larger architecture in order to extract task-specific information which is fused with the representations extracted by a generic backbone; reduce task interference by forcing the task gradients to be statistically indistinguishable through adversarial training, ensuring that the common backbone architecture serving all tasks is not dominated by any of the task-specific gradients.
[MRAN] Zhao, J., Du, B., Sun, L., Zhuang, F., Lv, W., & Xiong, H. Multiple relational attention network for multi-task learning. KDD, 2019.

Notes: consists of three attention-based relationship learning modules: task-task relationship, feature-feature interaction, and task-feature dependence.
[MT-DNN] Liu, X., He, P., Chen, W., & Gao, J. Multi-task deep neural networks for natural language understanding. ACL, 2019.

Notes: a hard parameter sharing architecture for NLP.
[Sluice] Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. Latent multi-task architecture learning. AAAI, 2019.

Notes: learning (a) what layers to share between deep recurrent neural networks, but also (b) which parts of those layers to share, and with what strength, as well as (c) a mixture model of skip connections at the architecture’s outer layer.
[HMTL] Sanh, V., Wolf, T., & Ruder, S. A hierarchical multi-task approach for learning embeddings from semantic tasks. AAAI, 2019.

Notes: The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model.
[PS-MCNN] Cao, J., Li, Y., & Zhang, Z. Partially shared multi-task convolutional neural network with local constraint for face attribute learning. CVPR, 2018.

Notes: the key idea of PS-MCNN lies in sharing a common network for all the groups to learn shared features, and constructing group specific network for each group from the beginning of the architecture to its end to learn task specific features; four Task Specific Networks (TSNets) and one Shared Network (SNet) are connected by Partially Shared (PS) structures to learn better shared and task specific representations.
[PAD-Net] Xu, D., Ouyang, W., Wang, X., & Sebe, N. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. CVPR, 2018.

Notes: produces a set of intermediate auxiliary tasks providing rich multi-modal data for learning the target tasks; design and investigate three different multi-modal distillation modules for deep multi-modal data fusion.
[MMoE] Ma, J., Zhao, Z., Yi, X., Chen, J., Hong, L., & Chi, E. H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. KDD, 2018.

Notes: adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task; the gating networks take the input features and output softmax gates assembling the experts with different weights, allowing different tasks to utilize experts differently.

Q1: How to determine the number of experts? Is it same to the task number?

Q2: What is the architecture of each expert? Is each expert same to the shared bottom? If this, the model parameters of MMoE will become so large.

Q3: Why the shared-bottom method is still inferior to MMoE on two identical tasks? (Figure 4(c)) Does it means even for two totally relevant tasks, we still cannot share all the parameters in the bottom layers?
[UberNet] Kokkinos, I. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. CVPR, 2017.

Notes: an image pyramid is formed by successive down-sampling operations, and each image is processed by a CNN with tied weights.
[JMT] Hashimoto, K., Xiong, C., Tsuruoka, Y., & Socher, R. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. EMNLP, 2017.

Notes: successively growing its depth to solve increasingly complex tasks.
[multinet] Bilen, H., & Vedaldi, A. Integrated perception with recurrent multi-task neural networks. NeurIPS, 2016.

Notes: not only deep image features are shared between tasks, but where tasks can interact in a recurrent manner by encoding the results of their analysis in a common shared representation of the data.
[Cross-stitch] Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. Cross-stitch networks for multi-task learning. CVPR, 2016.

Notes: automatically learn an optimal combination of shared and task-specific representations.
[MNC] Dai, J., He, K., & Sun, J. Instance-aware semantic segmentation via multi-task network cascades. CVPR, 2016.

Notes: decompose the instance-aware semantic segmentation task into three different and related sub-tasks; expect that each sub-task is simpler than the original instance segmentation task, and is more easily addressed by convolutional networks; their network cascades have three stages, each of which addresses one sub-task.
Søgaard, A., & Goldberg, Y. Deep multi-task learning with low level tasks supervised at lower layers. ACL, 2016.

Notes: different tasks supervision can happen at different layers; use different layers for different tasks.
[FCNN] Li, X., Zhao, L., Wei, L., Yang, M. H., Wu, F., Zhuang, Y., ... & Wang, J. Deepsaliency: Multi-task deep neural network model for salient object detection. TIP, 2016.

Notes: carry out the task of saliency detection in conjunction with the task of object class segmentation, which share a convolution part with 15 layers; use two networks performing the two tasks by sharing features, which forms a tree structured network architecture.

Learning Architectures

[Recon] Shi, G., Li, Q., Zhang, W., Chen, J., & Wu, X. M. Recon: Reducing Conflicting Gradients From the Root For Multi-Task Learning. ICLR, 2023.

Notes: for each shared network layer, select the layers with high conflict scores, and turn them to task-specific layers.
[AutoMTL] Zhang, L., Liu, X., & Guan, H. AutoMTL: A Programming Framework for Automating Efficient Multi-Task Learning. NeurIPS, 2022.

Notes: very similar to AdaShare.
[LearnToBranch] Guo, P., Lee, C. Y., & Ulbricht, D. Learning to branch for multi-task learning. ICML, 2020.

Notes: propose a tree-structured network design space that can automatically learn how to branch a network.

Q: How to decide the number of child nodes?
[AdaShare] Sun, X., Panda, R., Feris, R., & Saenko, K. Adashare: Learning what to share for efficient deep multi-task learning. NeurIPS, 2020.

Notes: learn the sharing pattern through a task-specific policy that selectively chooses which layers to execute for a given task in the multi-task network.

Q1: It has to perform times forward pass since each task has different paths, which is time-consuming.

Q2: Will it be effective to extend AdaShare for finding a fine-grained channel sharing pattern across tasks?
[MTL-NAS] Gao, Y., Bai, H., Jie, Z., Ma, J., Jia, K., & Liu, W. Mtl-nas: Task-agnostic neural architecture search towards general-purpose multi-task learning. CVPR, 2020.

Notes: disentangle the GP-MTL networks into single-task backbones, and a hierarchical and layerwise features sharing/fusing scheme across them; fix the single-task backbone branches and search good inter-task edges for hierarchical and layerwise feature fusion/embedding.

Q: How can this method generalize to multiple task (i.e., task numbers greater 3)?
[BMTAS] Bruggemann, D., Kanakis, M., Georgoulis, S., & Van Gool, L. Automated Search for Resource-Efficient Branched Multi-Task Networks. BMVC, 2020.

Notes: automatically define branching (tree-like) structures in the encoding stage of a multi-task neural network.
[Sparse sharing] Sun, T., Shao, Y., Li, X., Liu, P., Yan, H., Qiu, X., & Huang, X. Learning sparse sharing architectures for multiple tasks. AAAI, 2020.

Notes: start with an over-parameterized base network, from which each task extracts a subnetwork; employ a binary mask matrix to select subnet from the base network for each task; employ Iterative Magnitude Pruning (IMP) to generate subnets for tasks.

Q: The IMP algorithm is time-consuming. Can we learn the binary mask matrix and network parameters together in an end-to-end manner?
[BMTN] Vandenhende, S., Georgoulis, S., De Brabandere, B., & Van Gool, L. Branched multi-task networks: Deciding what layers to share. BMVC, 2020.

Notes: automatically decide what layers to share and where to branch out, by leveraging the employed tasks’ affinities; rely on representation similarity analysis (RSA) to measure task affinity scores; the task affinity (and dissimilarity) scores are calculated a priori, this allows them to determine the task clustering offline.
[DEN] Ahn, C., Kim, E., & Oh, S. Deep elastic networks with model selection for multi-task learning. ICCV, 2019.

Notes: the proposed method consists of an estimator and a selector, the estimator can produce multiple different network models of different configurations in a hierarchical structure, the selector chooses a model dynamically from a pool of candidate models given an input instance.
[SFG] Bragman, F. J., Tanno, R., Ourselin, S., Alexander, D. C., & Cardoso, J. Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. ICCV, 2019.

Notes: the SFGs learns to allocate kernels in each convolution layer into either ''specialist'' groups or a ''shared'' trunk, which are specific to or shared across different tasks, respectively.
[SNR] Ma, J., Zhao, Z., Chen, J., Li, A., Hong, L., & Chi, E. H. Snr: Sub-network routing for flexible parameter sharing in multi-task learning. AAAI, 2019.

Notes: modularize the shared low-level hidden layers into multiple layers of subnetworks, and controls the connection of sub-networks with learnable latent variables to achieve flexible parameter sharing.
[Routing Network] Rosenbaum, C., Klinger, T., & Riemer, M. Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning. ICLR, 2018.

Notes: Routing is the process of iteratively applying the router to select a sequence of function blocks to be composed and applied to the input vector.
[Soft order] Meyerson, E., & Miikkulainen, R. Beyond Shared Hierarchies: Deep Multitask Learning through Soft Layer Ordering. ICLR, 2018.

Notes: in the soft ordering approach, a joint model learns how to apply shared layers in different ways at different depths for different tasks as it simultaneously learns the parameters of the layers themselves.
[CMTR] Liang, J., Meyerson, E., & Miikkulainen, R. Evolutionary architecture search for deep multitask networks. GECCO, 2018.

Notes: based on soft ordering architecture, methods for evolving the modules of this architecture and for evolving the overall topology or routing between modules are evaluated.
[FAFS] Lu, Y., Kumar, A., Zhai, S., Cheng, Y., Javidi, T., & Feris, R. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. CVPR, 2017.

Notes: starts with a thin multi-layer network and dynamically widens it in a greedy manner during training; creates a tree-like deep architecture, on which similar tasks reside in the same branch until at the top layers.

Optimization for Multi-Task Learning

Loss Balancing

[RW] Lin, B., Ye, F., Zhang, Y., & Tsang, I. W. Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning. TMLR, 2022.

Notes: propose the Random Weighting (RW) methods, including Random Loss Weighting (RLW) and Random Gradient Weighting (RGW), where an MTL model is trained with random loss/gradient weights sampled from a distribution.
[Auto-λ] Liu, S., James, S., Davison, A. J., & Johns, E. Auto-Lambda: Disentangling Dynamic Task Relationships. TMLR, 2022.

Notes: explore dynamic task relationships parameterised by task-specific weightings; the design of Auto- $\lambda$ borrows the concept of lookahead methods in meta learning literature, to update parameters at the current state of learning, based on the observed effect of those parameters on a future state; formulate a bi-level optimisation problem.

Finding: Task relationships are consistent. The learned weightings with both the NYUv2 and CityScapes datasets are nearly identical, given the same optimisation strategies, independent of the network architectures; Task relationships are asymmetric; Task relationships are dynamic.

Q: Why "high weightings emerge when tasks are strongly related"? Can the loss weighting be used to represent the task relationships?
[MOML] Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., & Zhang, Y. Multi-objective meta learning. NeurIPS, 2021.

Notes: MOML formulates the objective function of meta learning with multiple objectives as a Multi-Objective Bi-Level optimization Problem (MOBLP) where the upper-level subproblem is to solve several possibly conflicting objectives for the meta learner; formulate the optimization problem in MTL as an MOBLP based on the split of the entire training dataset and solve this problem based on the MOML framework.
[LSB] Lee, J. H., Lee, C., & Kim, C. S. Learning Multiple Pixelwise Tasks Based on Loss Scale Balancing. ICCV, 2021.

Notes: dynamically adjusts the linear weights to learn all tasks effectively by balancing the loss scale (the product of the loss value and its weight) periodically.

Q: Does 3rd period is useful? Since 2nd period has achieved that all tasks are learned at a similar pace, which means the difficulty of all task are equal according to the measurement of Eq.(9).
[MultiNet++, GLS] Chennupati, S., Sistu, G., Yogamani, S., & A Rawashdeh, S. Multinet++: Multi-stream feature aggregation and geometric loss strategy for multi-task learning. CVPR workshop, 2019.

Notes: they propose a multi-stream multi-task network to take advantage of using feature representations from preceding frames in a video sequence for joint learning of segmentation, depth, and motion; they express the total loss of a multi-task learning problem as geometric mean of individual task losses, they refer to this as Geometric Loss Strategy (GLS).
[Uncertainty] Kendall, A., Gal, Y., & Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. CVPR, 2018.

Notes: weighs multiple loss functions by considering the homoscedastic uncertainty of each task.

Q1: Their results show that there is usually not a single optimal weighting for all tasks. Therefore, what is the optimal weighting? Is multi-task learning is an ill-posed optimisation problem without a single higher-level goal?

Q2: Why do the semantics and depth tasks outperform the semantics and instance tasks results? Clearly the three tasks explored in this paper are complimentary and useful for learning a rich representation about the scene. But can we quantify relationships between tasks?

Q3: Where the optimal location is for splitting the shared encoder network into separate decoders for each task? And, what network depth is best for the shared multi-task representation?
[Dynamic] Guo, M., Haque, A., Huang, D. A., Yeung, S., & Fei-Fei, L. Dynamic task prioritization for multitask learning. ECCV, 2018.

Notes: automatically prioritize more difficult tasks by adaptively adjusting the mixing weight of each task’s loss objective.
[SPMTL] Li, C., Yan, J., Wei, F., Dong, W., Liu, Q., & Zha, H. Self-paced multi-task learning. AAAI, 2017.

Notes: learn the multi-task model in a self-paced regime, from easy instances and tasks to hard instances and tasks.
[spMTL] Murugesan, K., & Carbonell, A. Self-paced multitask learning with shared knowledge. IJCAI, 2017.

Notes: start with easier set of tasks, and gradually introduces more difficult ones to build the shared knowledge base.
[AW] He, K., Wang, Z., Fu, Y., Feng, R., Jiang, Y. G., & Xue, X. Adaptively weighted multi-task deep network for person attribute classification. MM, 2017.

Notes: propose an adaptively weighted multi-task deep framework to jointly learn multiple person attributes, and a validation loss trend algorithm to automatically update the weights of weighted loss layer; the weights are dynamically changed in the training process according to the generalization ability of each task learner, which are approximately measured by the validation set; the trained model of one task with lower generation ability should be set higher weight than those models of the other tasks.
[SMTL, OSMTL] Murugesan, K., Liu, H., Carbonell, J., & Yang, Y. Adaptive smoothed online multi-task learning. NeurIPS, 2016.

Notes: efficiently learns multiple related tasks by estimating the task relationship matrix from the data; maybe can be formulated as an end-to-end training procedure.

Gradient Balancing

[FAMO] Liu, B., Feng, Y., Stone, P., & Liu, Q. FAMO: Fast Adaptive Multitask Optimization. FAMO: Fast Adaptive Multitask Optimization. NeurIPS, 2023.

Notes: decreases task losses in a balanced way using $\mathcal{O}(1)$ space and time.
[MoCo] Fernando, H. D., Shen, H., Liu, M., Chaudhury, S., Murugesan, K., & Chen, T. Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Approach. ICLR, 2023.

Notes: develop a stochastic multi-objective gradient correction (MoCo) method for multi-objective optimization; address the convergence issues of stochastic MGDA and provably converges to a Pareto stationary point under several stochastic MOO settings.
[Nash-MTL] Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., & Fetaya, E. Multi-task learning as a bargaining game. ICML, 2022.

Notes: frame the gradient combination step in MTL as a bargaining game and use the Nash bargaining solution to find the optimal update direction.
[Seq.Reptile] Lee, S., Lee, H., Lee, J., & Hwang, S. J. Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning. ICLR, 2022.

Notes: they want their model to maximally retain the knowledge of the pretrained model by finding a good trade-off between minimizing the downstream MTL loss and maximizing the cosine similarity between the task gradients; in order to consider gradient alignment across tasks as well, they propose to let the inner-learning trajectory consist of mini-batches randomly sampled from all tasks, which they call Sequential Reptile.

Q: Where do they demonstrate "it is crucial for those tasks to align gradients between them in order to maximize knowledge transfer while minimizing negative transfer"? In other world, is it helpful for model learning to increase the cosine similarity between the task gradients?
[RotoGrad] Javaloy, A., & Valera, I. RotoGrad: Gradient Homogenization in Multitask Learning. ICLR, 2022.

Notes: jointly homogenize gradient magnitudes and directions; address the magnitude differences by re-weighting task gradients at each step of the learning, while encouraging learning those tasks that have converged the least thus far; address the conflicting directions by smoothly rotating the shared feature space differently for each task, seamlessly aligning gradients in the long run.
[CAGrad] Liu, B., Liu, X., Jin, X., Stone, P., & Liu, Q. Conflict-averse gradient descent for multi-task learning. NeurIPS, 2021.

Notes: minimizes the average loss function, while leveraging the worst local improvement of individual tasks to regularize the algorithm trajectory.

Q: Does the largest loss will dominate the update? Since the Eq.(2) is based on the absolute value of the loss.
[IMTL, IMTL-G, IMTL-L] Liu, L., Li, Y., Kuang, Z., Xue, J. H., Chen, Y., Yang, W., ... & Zhang, W. Towards Impartial Multi-task Learning. ICLR, 2021.

Notes: IMTL-G(rad), learn the scaling factors such that the aggregated gradient of task-shared parameters has equal projections onto the raw gradients of individual tasks; IMTL-L(oss), automatically learn a loss weighting parameter for each task so that the weighted losses have comparable scales and the effect of different loss scales from various tasks can be canceled-out.

Q1: What is the benefit of "the aggregated gradient of task-shared parameters has equal projections onto the raw gradients of individual tasks"? or Why to do this?

Q2: Can we use different learning rate for different tasks to achieve the same goal of IMTL-G (treat all tasks equally so that they progress in the same speed and none is left behind)?
[GradVac] Wang, Z., Tsvetkov, Y., Firat, O., & Cao, Y. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. ICLR, 2021.

Notes: relax the assumption of PCGrad (any two tasks must have the same gradient similarity objective of zero), set adaptive gradient similarity objectives; operate on both negative and positive gradient similarity.

Q: Is it necessary to operate on the positive gradient similarity?
[MTAdam] Malkiel, I., & Wolf, L. MTAdam: Automatic Balancing of Multiple Training Loss Terms. EMNLP, 2021.

Notes: generalize the Adam optimization algorithm to handle multiple loss terms, and the guiding principle is that for every layer, the gradient magnitude of the terms should be balanced; the MTAdam computes the derivative of each loss term separately, infers the first and second moments per parameter and loss term, and calculates a first moment for the magnitude per layer of the gradients arising from each loss, this magnitude is used to continuously balance the gradients across all layers, in a manner that both varies from one layer to the next and dynamically changes over time.
[PCGrad] Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., & Finn, C. Gradient surgery for multi-task learning. NeurIPS, 2020.

Notes: projects a task’s gradient onto the normal plane of the gradient of any other task that has a conflicting gradient.

Q1: Does the Theorem 2 in their paper means that PCGrad only works under such three conditions, what if the three conditions are not satisfied?

Q2: Is the tragic triad of multi-task learning a major factor in making optimization for multi-task learning challenging? They seem do not answer this question in their experiments.
[GradDrop] Chen, Z., Ngiam, J., Huang, Y., Luong, T., Kretzschmar, H., Chai, Y., & Anguelov, D. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. NeurIPS, 2020.

Notes: select one sign (positive or negative) based on the distribution of gradient values, and mask out all gradient values of the opposite sign.

Q: About the formulation of Gradient Positive Sign Purity, what if the gradients not in an order of magnitude? The largest gradient may dominate the sign.
[GradNorm] Chen, Z., Badrinarayanan, V., Lee, C. Y., & Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. ICML, 2018.

Notes: automatically balances training in deep multitask models by dynamically tuning gradient magnitudes; establish a common scale for gradient magnitudes, and balance training rates of different tasks.

Q: Does it necessary to make the gradient magnitudes of different tasks to a common scale? or Why?
[MGDA-UB] Sener, O., & Koltun, V. Multi-task learning as multi-objective optimization. NeurIPS, 2018.

Notes: cast multi-task learning as multi-objective optimization; provide an upper bound for the multiple-gradient descent algorithm (MGDA) optimization objective and show that it can be computed via a single backward pass.

Q: Why the derivate of the representations can be computed in a single backward pass, while for the shared parameter needs $T$ times?
[Modulation] Zhao, X., Li, H., Shen, X., Liang, X., & Wu, Y. A modulation module for multi-task learning with applications in image retrieval. ECCV, 2018.

Notes: propose a general modulation module, which can be inserted into any convolutional neural network architecture, to encourage the coupling and feature sharing of relevant tasks while disentangling the learning of irrelevant tasks with minor parameters addition; equipped with this module, gradient directions from different tasks can be enforced to be consistent for those shared parameters.

Q1: Is it harmful if we force the gradient direction of irrelevant tasks to be consistent with the shared parameters? Since the objective of this work is to make the gradient directions of different tasks to be consistent, does it will be worse if the gradient direction of irrelevant tasks to be consistent?

Q2: Why "Since the task-specific masks/projection matrices are learnable, we observe that the training process will naturally mitigate the destructive interference by reducing the average across-task gradient angles" ? What term in the loss function guide this phenomenon?

Q3: They simply apply a channel-wise scaling vector in their method, what will happen if we use both channel attention and spatial attention?

Others

[SRDML] Bai, G., & Zhao, L. Saliency-Regularized Deep Multi-Task Learning. KDD, 2022.

Notes: model the task relation as the similarity between tasks’ input gradients.

Q1: Why tasks’ input gradients can be regard as an measurement of task relations?

Q2: It seems only can be applied to classification tasks.
[TAG] Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., & Finn, C. Efficiently identifying task groupings for multi-task learning. NeurIPS, 2021.

Notes: suggest a measure of inter-task affinity that can be used to systematically and efficiently determine task groupings for multi-task learning; measure inter-task affinity by training all tasks together in a single multi-task network and quantifying the effect to which one task’s gradient update would affect another task’s loss.

Finding: the relationships among tasks change throughout training as measured by inter-task affinity (this maybe because the different data in different training step); how tasks should be trained together does not simply depend on the relationships among tasks, but also on detailed aspects of the model and training.

Q: "Approximating higher-order affinity scores for each network consisting of three or more tasks" is time-consuming especially when the task number is large.
[DSelect-k] Hazimeh, H., Zhao, Z., Chowdhery, A., Sathiamoorthy, M., Chen, Y., Mazumder, R., ... & Chi, E. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. NeurIPS, 2021.

Notes: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.
[GTTN] Zhang, Y., Zhang, Y., & Wang, W. Multi-task learning via generalized tensor trace norm. KDD, 2021.

Notes: exploits all possible tensor flattenings and it is defined as the convex sum of matrix trace norms of all possible tensor flattenings.
[TRL] Strezoski, G., Noord, N. V., & Worring, M. Many task learning with task routing. ICCV, 2019.

Notes: introduce a task-routing mechanism allowing tasks to have separate in-model data flows; apply a channel-wise task-specific binary mask over the convolutional activations, the masks are generated randomly and kept constant.

Q: What if the randomly generated masks is not suitable? Is it better to make the masks learnable parameters?
[RCMTL] Yao, Y., Cao, J., & Chen, H. Robust task grouping with representative tasks for clustered multi-task learning. KDD, 2019.

Notes: select a subset of tasks that can represent the whole tasks, all tasks can be clustered into different groups based on these representative tasks.

⚠️ Very similar to Flexible clustered multi-task learning by learning representative tasks. TPAMI, 2016.
[AMTFL, Deep-AMTFL]Lee, H. B., Yang, E., & Hwang, S. J. Deep asymmetric multi-task feature learning. ICML, 2018.

Notes: introduce an asymmetric autoencoder term that allows reliable predictors for the easy tasks to have high contribution to the feature learning while suppressing the influences of unreliable predictors for more difficult tasks; the reconstruction is done in the feature space and in a non-linear manner.
[MRN] Long, M., Cao, Z., Wang, J., & Yu, P. S. Learning multiple tasks with multilinear relationship networks. NeurIPS, 2017.

Notes: mainly based on tensor normal distributions; jointly learning transferable features and multilinear relationships, alleviate the dilemma of negative-transfer in feature layers and under-transfer in classifier layer.
[DMTRL] Yang, Y., & Hospedales, T. M. Deep Multi-task Representation Learning: A Tensor Factorisation Approach. ICLR, 2017.

Notes: generalise matrix factorisation-based multi-task ideas to tensor factorisation; Tucker Decomposition; Tensor Train (TT) Decomposition.
[TNRDMTL] Yang, Y., & Hospedales, T. M. Trace Norm Regularised Deep Multi-Task Learning. ICLR workshop, 2017.

Notes: the parameters from all models are regularised by the tensor trace norm.
[CILICIA] Sarafianos, N., Giannakopoulos, T., Nikou, C., & Kakadiaris, I. A. Curriculum learning for multi-task classification of visual attributes. ICCV workshop, 2017.

Notes: individual tasks are grouped based on their correlation so that two groups of strongly and weakly correlated tasks are formed; the two groups of tasks are learned in a curriculum learning setup by transferring the acquired knowledge from the strongly to the weakly correlated; the learning process within each group is performed in a multi-task classification setup.
[HC-MTL] Liu, A. A., Su, Y. T., Nie, W. Z., & Kankanhalli, M. Hierarchical clustering multi-task learning for joint human action grouping and recognition. TPAMI, 2017.

Notes: formulate the objective function with two latent variables, model parameters and grouping information, for joint optimization; decompose it into two sub-tasks, multi-task learning and task relatedness discovery, and iteratively solve the two sub-tasks.
[HD-MTL] Fan, J., Zhao, T., Kuang, Z., Zheng, Y., Zhang, J., Yu, J., & Peng, J. HD-MTL: Hierarchical deep multi-task learning for large-scale visual recognition. TIP, 2017.

Notes: multiple sets of deep features are extracted from the different layers; a visual tree is learned by assigning the visually-similar atomic object classes with similar learning complexities into the same group, it can provide a good environment for identifying the inter-related learning tasks automatically.
[AMTL] Lee, G., Yang, E., & Hwang, S. Asymmetric multi-task learning based on task relatedness and loss. ICML, 2016.

Notes: avoid negative transfer problem; allow for asymmetric information transfer between the tasks.

Assumption: each underlying model parameter is succinctly represented by the linear combination of other parameters.
[MITL] Lin, K., Xu, J., Baytas, I. M., Ji, S., & Zhou, J. Multi-task feature interaction learning. KDD, 2016.

Notes: exploit task relatedness in the form of shared representations in both the original input space and the interaction space among features.
[FCMTL] Zhou, Q., & Zhao, Q. Flexible clustered multi-task learning by learning representative tasks. TPAMI, 2016.

Notes: a subset of tasks (representative tasks) in multi-task learning can be used to represent other tasks due to the similarity among multiple tasks; allow one task to be clustered into multiple clusters, and with different weights.

Please create an issue or contact 12032913@mail.sustech.edu.cn if you find we missed some papers.

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
README.md		README.md
Venue.md		Venue.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Venue.md

Venue.md

Repository files navigation

Awesome-Multi-Task-Learning

Architectures for Multi-Task Learning

Manually Designed Architectures

Learning Architectures

Optimization for Multi-Task Learning

Loss Balancing

Gradient Balancing

Others

About

Releases

Packages

Pengxin-Guo/Awesome-Multi-Task-Learning

Folders and files

Latest commit

History

README.md

README.md

Venue.md

Venue.md

Repository files navigation

Awesome-Multi-Task-Learning

Architectures for Multi-Task Learning

Manually Designed Architectures

Learning Architectures

Optimization for Multi-Task Learning

Loss Balancing

Gradient Balancing

Others

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages