Skip to content

Latest commit

 

History

History
261 lines (202 loc) · 8.44 KB

finetune.md

File metadata and controls

261 lines (202 loc) · 8.44 KB

Finetuning

Introduction

A SyntaxDot model is trained using finetuning. Finetuning takes a pretrained (transformer-based) language model and refines its weights for syntax annotation.

To finetune a model, you need several ingredients:

  • A pretrained language model, including its vocabulary and model configuration.
  • Training and test data. The training data will be used to finetune the model's parameters. The test data is used to evaluate the model's performance during finetuning.
  • A SyntaxDot configuration file.

Getting a pretrained model

SyntaxDot requires pretrained models in the libtorch OutputArchive format, with the expected parameter names. Python scripts to convert some commonly-used model types to the format expected by SyntaxDot are provided in the ../scripts directory.

You could also use an existing pretrained model in the SyntaxDot format. You can download the required parts for the well-performing multilingual XLM-RoBERTa base model here:

SyntaxDot configuration file

The SyntaxDot configuration file contains the SyntaxDot-specific model settings in TOML format, such as the sequence labeling tasks, and biaffine parsing. This is an example configuration file:

[input]
tokenizer = { xlm_roberta = { vocab = "xlm-roberta-base-sentencepiece.bpe.model" } }

[biaffine]
labels = "syntaxdot.biaffine"
head = { dims = 128, head_bias = true, dependent_bias = true }
relation = { dims = 128, head_bias = true, dependent_bias = true }

[labeler]
labels = "sticker.labels"
encoders = [
  { name = "pos-ud", encoder = { sequence = "upos" } },
  { name = "pos-extended", encoder = { sequence = "xpos" } },
  { name = "lemma", encoder = { lemma = "form" } },
  { name = "morph", encoder = { sequence = "feature_string" } },
]

[model]
pretrain_type = "xlm_roberta"
parameters = "epoch-74"
position_embeddings = "model"
pretrain_config = "xlm-roberta-base-config.json"

We will now discuss each section in more detail.

input

The input section describes processing of the input tokens. Currently, this section only configures the tokenizer that should be used. For example,

[input]
tokenizer = { xlm_roberta = { vocab = "xlm-roberta-base-sentencepiece.bpe.model" } }

Uses the XLM-RoBERTa tokenization model, which is in the file xlm-roberta-base-sentencepiece.bpe.model. The supported tokenizers are:

  • albert: for ALBERT models.
  • bert: for BERT-based models
  • xlm_roberta: for XLM-RoBERTa models.

biaffine

The biaffine section configures a biaffine parser. If this section is absent, the model is a pure sequence labeling model, and cannot perform dependency parsing. Here is an example biaffine section:

[biaffine]
labels = "syntaxdot.biaffine"
head = { dims = 128, head_bias = true, dependent_bias = true }
relation = { dims = 128, head_bias = true, dependent_bias = true }

The labels options sets the file name to which the dependency labels should be saved. This is only done once during model preparation (see below).

The head and relation options configure the hidden representations of tokens as used for head prediction and relation prediction. In both cases, dims configures the dimensionality of the hidden representations. head_bias and dependent_bias configure whether a bias should be added for a token's representation as a head or as a dependent. Enabling these biases generally doesn't negatively affect performance, so it is safest to simply enable them.

labeler

This section configures one or more sequence labelers. For example:

[labeler]
labels = "sticker.labels"
encoders = [
  { name = "pos-ud", encoder = { sequence = "upos" } },
  { name = "pos-extended", encoder = { sequence = "xpos" } },
  { name = "lemma", encoder = { lemma = "form" } },
  { name = "morph", encoder = { sequence = "feature_string" } },
]

The labels option specifies the file in which all labels for sequence annotation should be stored. encoders is a list that configures the different sequence annotation layers. Each layer should have a unique name and specify an encoder. We will now discuss the encoder types in more detail.

sequence

The sequence encoder is the simplest form of encoder, it simply takes values from a CoNLL-U layer. Basic sequence encoders have the following form:

{ name = "NAME", encoder = { sequence = "LAYER" } },

where LAYER is one of:

  • upos: universal part-of-speech tag
  • xpos: language-specific part-of-speech tag
  • feature_string: the features field as a string

Note that feature_string is invariant to the feature ordering, since features are internally sorted in lexicographical order.

It is also possible to create annotation layers based on specific features or miscellaneous features. For example, the following sequence encoder uses the TopoField feature of the miscellaneous features, setting the default value to null if the feature is absent:

{ name = "tf", encoder = { sequence = { misc = { feature = "TopoField", default = "null" } } } }

Similarly, we could create an annotation layer using the Number feature:

{ name = "tf", encoder = { sequence = { misc = { feature = "TopoField", default = "null" } } } }

lemma

The lemma encoder (transparently) encodes lemmas to edit trees. The configuration of this encoder is simply as follows:

{ name = "lemma", encoder = { lemma = "form" } },

The only configuration is the back-off strategy, which is used when the predicted edit tree cannot be applied. Here the back-off strategy is to use the form as the lemma. Another possible value is nothing, which will not update the lemma at all in such a case.

model

The final configuration section of the SyntaxDot configuration specifies the model. For example:

[model]
pretrain_type = "xlm_roberta"
parameters = "epoch-74"
position_embeddings = "model"
pretrain_config = "xlm-roberta-base-config.json"

pretrain_type is the type of pretrained model that is used. This is one of:

  • albert
  • bert
  • squeeze_albert
  • squeeze_bert
  • xlm_roberta

The parameters option contains the file with model parameters. This can be any value during finetuning. But after finetuning, this should be set to the parameter file of the best epoch.

The position_embeddings option configures the type of position embeddings that should be used. For finetuning existing models, this should always be set to model, which will use the position embeddings of the model.

Finally, pretrain_config is the filename of the pretrained model configuration.

Finetuning

With the configuration set up, the a model can be trained. The first step is to prepare the label files, which is done with the syntaxdot prepare subcommand:

$ syntaxdot prepare syntaxdot.conf train.conllu

Then, you can start the finetuning of the pretrained model. For example, to finetune a syntax model with XLM-RoBERTa base:

$ syntaxdot finetune \
	syntaxdot.conf \
	xlm-roberta-base.pt \
	train.conllu \
	test.conllu \
	--gpu 0 \
	--label-smoothing 0.03 \
	--maxlen 100 \
	--mixed-precision \
	--warmup 10000

SyntaxDot has more command-line options to configure finetuning, but is the recommended set of options:

  • --gpu 0: enables GPU training using GPU 0.
  • --label-smoothin 0.03: enables label smoothing, distributing 0.03 of the 'probability mass' to incorrect labels. This is a form of regularization and generally results in better models.
  • --maxlen 100: excludes sentences of more than 100 pieces, this can be used in conjunction with the batch-size option to control the amount of GPU memory needed during training.
  • --mixed-precision: use mixed precision. This lowers GPU memory use and speeds up training considerably, but is only supported on NVIDIA GPUs with Tensor Cores (Volta/Turing GPUs or later generations).
  • --warmup 10000: use 10,000 steps of learning rate warmup. This avoids that early weight update steps are too large.

After finetuning is done, SyntaxDot will report the best epoch. Don't forget to update the parameters option in your SyntaxDot configuration to use the parameters from the best epoch!