Multitask models in DeepPavlov Dream

Introduction

This page is about multitask English model, which is utilized in Dream. This model is utilized for solving several classification tasks in the DeepPavlov Dream dialogue system at once.

Specifically, the model is used to solve the following tasks:

Emotion classification
Sentiment classification
Toxicity classification
Factoid classification
MIDAS intent classification
DeepPavlov topic classification
CoBot topic classification
CoBot DialogAct topic classification
Cobot DialogAct intent classification

Here one can see the full list of classes for all these tasks. Such models allow for saving up a computational power, by utilizing a single backbone model instead of many ones.

How does the model work?

The model uses plain linear layers on top of the distilbert-base-cased backbone. It is as simple as that.

This architecture is also explained in the DeepPavlov manual. The model is supported since the version 1.1.1 of DeepPavlov.

Another manual which explains the using of that model is here

The work of this model is also examined in this article.

How the model can be called?

This model uses port 8087. Name of the annotator: combined_classification. This model can be called either as an annotator (with the url http://0.0.0.0:8087/model) or a postannotator (with the url http://0.0.0.0:8087/batch_model).

The examples of the external calls to this model can be found here.

What are the training data?

In DREAM, new multitask “9 in 1” model was trained on the following single-label datasets:

Task	Class number	Dataset	Training samples	Notes
Sentiment classification	3	DynaSent(r1+r2)	94k	In previous multitask models SST was used(8k samples) what led to the overfit.
Factoid classification	2	YAHOO	3.6k
Emotion classification	7	go_emotions	42k	Emotions in this dataset were Ekman-grouped. And only singlelabel samples were used
MIDAS intent classification	15	Midas	~9k	Head is singlelabel, only semantic classes were used as if in the DREAM now.
DeepPavlov Topic classification	25	DeepPavlov Topics	1.8m	Class names for this classifier and for Cobot replacement classifiers are different, so special functions for every such topic were added to support this difference. (Example )
Toxicity classification	8	Kaggle dataset	170k	To make the classifier single-label, non_toxic class was added to this dataset, and it was set to be equal to (1 - sim of the toxic probs). Then all toxic probs were normalized, and the class with the maximal probability was selected.
CoBot topics	22	private DREAM-2 dataset	216k	The most frequent “garbage” class (Phatic) was excluded, and all multilabel examples were converted to the single-label format. All these measures made model less likely to overfit on the “garbage” classes, thus improving it’s quality on the real-world data.
CoBot dialogact topics	11	private DREAM-2 dataset	127k	The most frequent “garbage” class (other) was excluded, all multilabel examples were converted to the single-label format, and history support was also removed. All these measures made model less likely to overfit on the “garbage” classes, thus improving it’s quality on the real-world data.
CoBot dialogact intents	11	private DREAM-2 dataset	316k	All multilabel examples were converted to the single-label format, and history support was also removed. All these measures made model less likely to overfit on the “garbage” classes, thus improving it’s quality on the real-world data.

There also were tried the ideas of improving architecture: using task_specific tokens in concatenation with the CLS for classification, or instead of the CLS. Which was not successful. However, increasing of the batch size from 32 to 640(for distilled model) or 320(for ordinary model) yielded an improvement.

What are the metrics?

We have measured the metrics of the multitask model for the bert-base-uncased and distilbert-base-uncased. For the later model, we have explored utilizing MIDAS with history or without it, and we have settled on using MIDAS without history to additionally cut down on computational time.

Training the model without history from scratch, almost didn't impact performance and paradoxically yielded some improvement for Midas(setting 3 VS setting 2). And it allowed to use 2x less max sequence length and have only 1 model prediction to cache, which decreased the prediction time from 0.73 sec to 0.55 sec.

Setting			1	2	3	4	5
Task / model	dataset modification?	Train size	Singletask, distilbert base uncased,batch 640	Multitask, distilbert base uncased, batch 640	Multitask, distilbert base uncased, batch 640, all tasks trained without history	Singletask, bert base uncased, batch 320	Multitask, bert base uncased batch 320
Use history in MIDAS training data			yes	yes	no	yes	yes
Emotion classification (go_emotions)	converted to multi-class	39.5к	70.47/70.30	68.18/67.86	67.59/67.32	71.48/71.16	67.27/67.23
Toxic classification(Kaggle)	converted to single label	1.62m	94.53/93.64	93.84/93.5	93.86/93.41	94.54/93.15	93.94/93.4
Sentiment classification(DynaSent, v1+v2)	no	94k	74.75/74.63	72.55/72.21	72.22/71.9	75.95/75.88	75.65/75.62
Factoid classification(Yahoo)	no	3.6k	81.69/81.66	81.02/81.07	80.0/79.86	84.41/84.44	80.34/80.09
Midas classification	only semantic classes	7.1k	80.53/79.81(with history)	72.73/71.56 (with history) 62.26 /60.68 (without history)	73.69/73.26(without history)	82.3/82.03(with history)	77.01/76.38(with history)
DeepPavlov Topics classification	no	1.8m	87.48/87.43	86.98/86.9	87.01/87.05	88.09/88.1	87.43/87.47
Cobot topics classification	converted to single label no history, removed 1 widespread garbage class Phatic	216k	79.88/79.9	77.31/77.36	77.45/77.35	80.68/80.67	78.21/78.22
Cobot dialogact topics classification	converted to single label no history, removed 1 widespread garbage class Other	127k	76.81/76.71	76.92/76.79	76.8/76.7	77.02/76.97	76.86/76.74
Cobot dialogact intents classification	converted to single label no history	318k	77.07/77.7	76.83/76.76	76.65/76.57	77.28/77.72	76.96/76.89
Total(9in1)		4218k	80.36/80.20	78.48/78.22	78.36/78.15	81.31/81.12	79.3/79.11
GPU memory used, Mb			2418*9=21762	2420	2420	3499*9=31491	3501
Test inference time, sec ( for the tests)				0.76	0.55		~ 1.33

For the sake of achieving the best trade-off between the memory use, inference time and test metrics, the model from Setting 3 (Multitask, distilbert base uncased, batch 640) is used in DeepPavlov Dream.

How much memory does this model save?

If we treated absolutely all models as singletask, 6 models that are being replaced by the current combined classifier in DREAM(emo,toxic,sentiment and 3 cobot models) would have taken ~3500*6 ~ 21000 Mb of the GPU memory. Midas classifier would have taken ~3500 Mb of the GPU memory, as it takes in current dev. We don’t count topic classifier model as it is unclear what kind of singletask topic classifiers we would have used. If we would have used distilbert-like topic classifier, it would have taken ~2418 Mb of the GPU memory. In this case, replacements of all singletask models would have taken ~27000 Mb of the GPU memory. Compared to this setting, our Multitask gives ~91% GPU memory economy.

This economy is caused by the replacement of many BERTs by one BERT, and also by the transformer-agnosticity that helped to quickly add distilbert-base-uncased instead of the BERT-base-uncased.

CPU memory use for the multitask model: 2909 Mb. If treating absolutely all models as singletask, the economy estimate is 25948 + 25940.6 = 22308 Mb. Compared to this seitting, our Multitask gives ~87% CPU memory economy.

Compared to the previous dev ( where multitask 6in1 bert-base is already used), our multitask gives ~75% GPU memory economy, ~57% CPU economy and ~80-85% postannotation inference time economy. (Inference time economy is due to the fact that current multitask is much faster than Midas thanks to the transformer-agnosticity, and we no longer need to use them both).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly