In this report we fine-tune multilingual BERT model on a RuSentiment dataset for Russian language sentiment analysis. We highlight the usage of StagedML primitives for constucting models and running experiments.
Resources and related work:
- RuSentiment project by Text-Machine Lab
- Dostoevsky project by Bureaucratic Labs
- Multilingual BERT by Google Research
import altair as alt
import pandas as pd
import numpy as np
from rusentiment_experiment import *
from stagedml.stages.all import *
We fine-tuned the following BERT models on a RuSentiment dataset:
- Multilingual BERT by Google Research, with different learning rates.
- Ranomly initialized BERT without pre-training.
Multilingual BERT model is defined based on
all_multibert_finetune_rusentiment
stage of StagedML. We change some
default
parameters.
def all_multibert_finetune_rusentiment1(m:Manager, lr:Optional[float]=None):
lr_ = lr if lr is not None else learning_rates[0]
def _nc(c):
mklens(c).name.val='rusent-pretrained'
mklens(c).train_batch_size.val=8
mklens(c).train_epoches.val=5
mklens(c).lr.val=lr_
return redefine(all_multibert_finetune_rusentiment, new_config=_nc)(m) # end1
Dependency graph of this stage illustrates entities we are going to calculate in this experiment:
with prepare_markdown_image('rusent_graph.png') as path:
depgraph([all_multibert_finetune_rusentiment1], path)
Here,
basebert-multi-cased
is a stage which downloads multilingual BERT checkpoint published by Google Research.fetchrusent
stage downloads the RuSentiment dataset. Unfortunately, this dataset is no longer available in public, so we are using our private copy.tfrecord-rusent
converts the Dataset to the TensorFlow Records format.rusent-pretrained
stage creates the BERT model, loads the pretrained checkpoint and performs the fine-tuning on the RuSentiment TFRecords.- Stages mentioned above are declared in the StagedML top-level collection
We define randomly-initialized BERT by disabling the initialization of the default version of the above model:
def all_multibert_finetune_rusentiment0(m:Manager):
def _nc(c):
mklens(c).name.val='rusent-random'
mklens(c).bert_ckpt_in.val=None
return redefine(all_multibert_finetune_rusentiment1, new_config=_nc)(m) #end0
We will train models with a number of different learning rates:
print(learning_rates)
[2e-05, 5e-05, 0.0001]
rref0=realize(instantiate(all_multibert_finetune_rusentiment0))
rrefs=[realize(instantiate(all_multibert_finetune_rusentiment1,lr=lr)) \
for lr in learning_rates]
For every model trained, we read it’s validation logs and plot the accuracy.
cols={'steps':[],'accuracy':[],'name':[],'lr':[]}
for rref in [rref0]+rrefs:
es=tensorboard_tensors(rref,'valid','accuracy')
assert len(es)>0
cols['steps'].extend([e.step for e in es])
cols['accuracy'].extend([te2float(e) for e in es])
cols['name'].extend([mklens(rref).name.val for _ in es])
cols['lr'].extend([mklens(rref).lr.val for _ in es])
altair_print(alt.Chart(DataFrame(cols)).mark_line().encode(
x='steps', y='accuracy', color='lr:N', strokeDash='name'), f'accuracy.png')
t=BeautifulTable(max_width=1000)
t.set_style(BeautifulTable.STYLE_MARKDOWN)
t.width_exceed_policy=BeautifulTable.WEP_ELLIPSIS
t.column_headers=['Learning rate','Accuracy']
t.numeric_precision=6
for rref in rrefs:
t.append_row([str(mklens(rref).lr.val),
te2float(tensorboard_tensors(rref,'eval','eval_accuracy')[-1])])
print(t)
Learning rate | Accuracy |
---|---|
2e-05 | 0.711957 |
5e-05 | 0.710938 |
0.0001 | 0.679348 |
We build confusion matrix by (a) defininig a simple model runner and (b)
realizing a stage which uses this runner to calculate the matrix data.
Model runner loads the model referenced by rref
and process a list of
sentences defined by the user.
class Runner:
def __init__(self, rref:RRef):
self.rref=rref
self.tokenizer=FullTokenizer(
vocab_file=mklens(rref).tfrecs.bert_vocab.syspath,
do_lower_case=mklens(rref).tfrecs.lower_case.val)
self.max_seq_length=mklens(rref).max_seq_length.val
self.model=BertClsModel(
BuildArgs(rref2dref(rref), store_context(rref), None, {}, {}))
bert_finetune_build(self.model)
self.model.model.load_weights(mklens(rref).checkpoint_full.syspath)
def eval(self, sentences:List[str], batch_size:int=10)->List[List[float]]:
@tf.function
def _tf_step(inputs):
return self.model.model_eval(inputs, training=False)
def _step(inputs:List[Tuple[Any,Any,Any]]):
return _tf_step((tf.constant([i[0] for i in inputs], dtype=tf.int64),
tf.constant([i[1] for i in inputs], dtype=tf.int64),
tf.constant([i[2] for i in inputs], dtype=tf.int64)))\
.numpy().tolist()
buf=[]
outs:List[List[float]]=[]
for i,s in enumerate(sentences):
ie=InputExample(guid=f"eval_{i:04d}", text_a=s, label='dummy')
f=convert_single_example(
10, ie, ['dummy'], self.max_seq_length, self.tokenizer)
buf.append((f.input_ids, f.input_mask, f.segment_ids))
if len(buf)>=batch_size:
outs.extend(_step(buf))
buf=[]
outs.extend(_step(buf))
return outs
# runner ends
Stage for calculating the confusion matrix data is defined as follows:
def bert_rusentiment_evaluation(m:Manager, stage:Stage)->DRef:
def _realize(b:Build):
build_setoutpaths(b,1)
r=Runner(mklens(b).model.rref)
df=read_csv(mklens(b).model.tfrecs.refdataset.output_tests.syspath)
labels=sorted(df['label'].value_counts().keys())
df['pred']=[labels[np.argmax(probs)] for probs in r.eval(list(df['text']))]
confusion={l:{l2:0.0 for l2 in labels} for l in labels}
for i,row in df.iterrows():
confusion[row['label']][row['pred']]+=1.0
confusion={l:{l2:i/sum(items.values()) for l2,i in items.items()} \
for l,items in confusion.items()}
with open(mklens(b).confusion_matrix.syspath,'w') as f:
json_dump(confusion,f,indent=4)
df.to_csv(mklens(b).prediction.syspath)
return mkdrv(m, matcher=match_only(), realizer=build_wrapper(_realize),
config=mkconfig({
'name':'confusion_matrix',
'model':stage(m),
'confusion_matrix':[promise, 'confusion_matrix.json'],
'prediction':[promise, 'prediction.csv'],
'version':2,
}))
# eval ends
We combine a confusion matrix stage with best stage defined in the
previous section. The final stage function will be contained in
stage_cm
variable:
stage_finetune=partial(all_multibert_finetune_rusentiment1, lr=min(learning_rates))
stage_cm=partial(bert_rusentiment_evaluation, stage=stage_finetune)
Dependency graph of the combined stage became:
with prepare_markdown_image('cm_graph.png') as path:
depgraph([stage_cm], path)
Finally, we realize the confusion matrix stage and print the matrix
rref=realize(instantiate(stage_cm))
data:dict={'label':[],'pred':[],'val':[]}
for l,items in readjson(mklens(rref).confusion_matrix.syspath).items():
for l2,val in items.items():
data['label'].append(l)
data['pred'].append(l2)
data['val'].append(f"{val:.2f}")
base=alt.Chart(DataFrame(data))
r=base.mark_rect().encode(
y='label:O',x='pred:O', # color='val:Q'
color=alt.Color('val:Q',
scale=alt.Scale(scheme='purpleblue'),
legend=alt.Legend(direction='horizontal')
))
t=base.mark_text().encode(y='label:O',x='pred:O',text='val:Q',
color=alt.condition(
alt.datum.val < 0.5,
alt.value('black'),
alt.value('white')
)
)
altair_print((r+t).properties(width=300,height=300), 'cm.png')