Skip to content

Multitask models in DeepPavlov Dream

dimakarp1996 edited this page Jul 8, 2023 · 1 revision

Introduction

This page is about multitask English model, which is utilized in Dream. This model is utilized for solving several classification tasks in the DeepPavlov Dream dialogue system at once.

Specifically, the model is used to solve the following tasks:

  • Emotion classification
  • Sentiment classification
  • Toxicity classification
  • Factoid classification
  • MIDAS intent classification
  • DeepPavlov topic classification
  • CoBot topic classification
  • CoBot DialogAct topic classification
  • Cobot DialogAct intent classification

Here one can see the full list of classes for all these tasks. Such models allow for saving up a computational power, by utilizing a single backbone model instead of many ones.

How does the model work?

The model uses plain linear layers on top of the distilbert-base-cased backbone. It is as simple as that.

This architecture is also explained in the DeepPavlov manual. The model is supported since the version 1.1.1 of DeepPavlov.

Another manual which explains the using of that model is here

The work of this model is also examined in this article.

How the model can be called?

This model uses port 8087. Name of the annotator: combined_classification. This model can be called either as an annotator (with the url http://0.0.0.0:8087/model) or a postannotator (with the url http://0.0.0.0:8087/batch_model).

The examples of the external calls to this model can be found here.

What are the training data?

In DREAM, new multitask “9 in 1” model was trained on the following single-label datasets:

Task Class number Dataset Training samples Notes
Sentiment classification 3 DynaSent(r1+r2) 94k In previous multitask models SST was used(8k samples) what led to the overfit.
Factoid classification 2 YAHOO 3.6k
Emotion classification 7 go_emotions 42k Emotions in this dataset were Ekman-grouped. And only singlelabel samples were used
MIDAS intent classification 15 Midas ~9k Head is singlelabel, only semantic classes were used as if in the DREAM now.
DeepPavlov Topic classification 25 DeepPavlov Topics 1.8m Class names for this classifier and for Cobot replacement classifiers are different, so special functions for every such topic were added to support this difference. (Example )
Toxicity classification 8 Kaggle dataset 170k To make the classifier single-label, non_toxic class was added to this dataset, and it was set to be equal to (1 - sim of the toxic probs). Then all toxic probs were normalized, and the class with the maximal probability was selected.
CoBot topics 22 private DREAM-2 dataset 216k The most frequent “garbage” class (Phatic) was excluded, and all multilabel examples were converted to the single-label format. All these measures made model less likely to overfit on the “garbage” classes, thus improving it’s quality on the real-world data.
CoBot dialogact topics 11 private DREAM-2 dataset 127k The most frequent “garbage” class (other) was excluded, all multilabel examples were converted to the single-label format, and history support was also removed. All these measures made model less likely to overfit on the “garbage” classes, thus improving it’s quality on the real-world data.
CoBot dialogact intents 11 private DREAM-2 dataset 316k All multilabel examples were converted to the single-label format, and history support was also removed. All these measures made model less likely to overfit on the “garbage” classes, thus improving it’s quality on the real-world data.

There also were tried the ideas of improving architecture: using task_specific tokens in concatenation with the CLS for classification, or instead of the CLS. Which was not successful. However, increasing of the batch size from 32 to 640(for distilled model) or 320(for ordinary model) yielded an improvement.

What are the metrics?

We have measured the metrics of the multitask model for the bert-base-uncased and distilbert-base-uncased. For the later model, we have explored utilizing MIDAS with history or without it, and we have settled on using MIDAS without history to additionally cut down on computational time.

Training the model without history from scratch, almost didn't impact performance and paradoxically yielded some improvement for Midas(setting 3 VS setting 2). And it allowed to use 2x less max sequence length and have only 1 model prediction to cache, which decreased the prediction time from 0.73 sec to 0.55 sec.

Setting 1 2 3 4 5
Task / model dataset modification? Train size Singletask, distilbert base uncased,batch 640 Multitask, distilbert base uncased, batch 640 Multitask, distilbert base uncased, batch 640, all tasks trained without history Singletask, bert base uncased, batch 320 Multitask, bert base uncased
batch 320
Use history in MIDAS training data yes yes no yes yes
Emotion classification (go_emotions) converted to multi-class 39.5к 70.47/70.30 68.18/67.86 67.59/67.32 71.48/71.16 67.27/67.23
Toxic classification(Kaggle) converted to single label 1.62m 94.53/93.64 93.84/93.5 93.86/93.41 94.54/93.15 93.94/93.4
Sentiment classification(DynaSent, v1+v2) no 94k 74.75/74.63 72.55/72.21 72.22/71.9 75.95/75.88 75.65/75.62
Factoid classification(Yahoo) no 3.6k 81.69/81.66 81.02/81.07 80.0/79.86 84.41/84.44 80.34/80.09
Midas classification only semantic classes 7.1k 80.53/79.81(with history) 72.73/71.56 (with history)
62.26 /60.68 (without history)
73.69/73.26(without history) 82.3/82.03(with history) 77.01/76.38(with history)
DeepPavlov Topics classification no 1.8m 87.48/87.43 86.98/86.9 87.01/87.05 88.09/88.1 87.43/87.47
Cobot topics classification converted to single label no history, removed 1 widespread garbage class Phatic 216k 79.88/79.9 77.31/77.36 77.45/77.35 80.68/80.67 78.21/78.22
Cobot dialogact topics classification converted to single label no history, removed 1 widespread garbage class Other 127k 76.81/76.71 76.92/76.79 76.8/76.7 77.02/76.97 76.86/76.74
Cobot dialogact intents classification converted to single label no history 318k 77.07/77.7 76.83/76.76 76.65/76.57 77.28/77.72 76.96/76.89
Total(9in1) 4218k 80.36/80.20 78.48/78.22 78.36/78.15 81.31/81.12 79.3/79.11
GPU memory used, Mb 2418*9=21762 2420 2420 3499*9=31491 3501
Test inference time, sec ( for the tests) 0.76 0.55 ~ 1.33

For the sake of achieving the best trade-off between the memory use, inference time and test metrics, the model from Setting 3 (Multitask, distilbert base uncased, batch 640) is used in DeepPavlov Dream.

How much memory does this model save?

If we treated absolutely all models as singletask, 6 models that are being replaced by the current combined classifier in DREAM(emo,toxic,sentiment and 3 cobot models) would have taken ~3500*6 ~ 21000 Mb of the GPU memory. Midas classifier would have taken ~3500 Mb of the GPU memory, as it takes in current dev. We don’t count topic classifier model as it is unclear what kind of singletask topic classifiers we would have used. If we would have used distilbert-like topic classifier, it would have taken ~2418 Mb of the GPU memory. In this case, replacements of all singletask models would have taken ~27000 Mb of the GPU memory. Compared to this setting, our Multitask gives ~91% GPU memory economy.

This economy is caused by the replacement of many BERTs by one BERT, and also by the transformer-agnosticity that helped to quickly add distilbert-base-uncased instead of the BERT-base-uncased.

CPU memory use for the multitask model: 2909 Mb. If treating absolutely all models as singletask, the economy estimate is 25948 + 25940.6 = 22308 Mb. Compared to this seitting, our Multitask gives ~87% CPU memory economy.

Compared to the previous dev ( where multitask 6in1 bert-base is already used), our multitask gives ~75% GPU memory economy, ~57% CPU economy and ~80-85% postannotation inference time economy. (Inference time economy is due to the fact that current multitask is much faster than Midas thanks to the transformer-agnosticity, and we no longer need to use them both).