RuSentiment

In this report we fine-tune multilingual BERT model on a RuSentiment dataset for Russian language sentiment analysis. We highlight the usage of StagedML primitives for constucting models and running experiments.

Resources and related work:

RuSentiment project by Text-Machine Lab
- Sentiment Annotation Guidelines at GitHub
- PwC page
Dostoevsky project by Bureaucratic Labs
- Dostoevsky project report
Multilingual BERT by Google Research

import altair as alt
import pandas as pd
import numpy as np
from rusentiment_experiment import *
from stagedml.stages.all import *

BERT Fine-tuning

We fine-tuned the following BERT models on a RuSentiment dataset:

Multilingual BERT by Google Research, with different learning rates.
Ranomly initialized BERT without pre-training.

Multilingual BERT model is defined based on all_multibert_finetune_rusentiment stage of StagedML. We change some default parameters.

def all_multibert_finetune_rusentiment1(m:Manager, lr:Optional[float]=None):
  lr_ = lr if lr is not None else learning_rates[0]
  def _nc(c):
    mklens(c).name.val='rusent-pretrained'
    mklens(c).train_batch_size.val=8
    mklens(c).train_epoches.val=5
    mklens(c).lr.val=lr_
  return redefine(all_multibert_finetune_rusentiment, new_config=_nc)(m) # end1

Dependency graph of this stage illustrates entities we are going to calculate in this experiment:

with prepare_markdown_image('rusent_graph.png') as path:
  depgraph([all_multibert_finetune_rusentiment1], path)

Here,

basebert-multi-cased is a stage which downloads multilingual BERT checkpoint published by Google Research.
fetchrusent stage downloads the RuSentiment dataset. Unfortunately, this dataset is no longer available in public, so we are using our private copy.
tfrecord-rusent converts the Dataset to the TensorFlow Records format.
rusent-pretrained stage creates the BERT model, loads the pretrained checkpoint and performs the fine-tuning on the RuSentiment TFRecords.
Stages mentioned above are declared in the StagedML top-level collection

We define randomly-initialized BERT by disabling the initialization of the default version of the above model:

def all_multibert_finetune_rusentiment0(m:Manager):
  def _nc(c):
    mklens(c).name.val='rusent-random'
    mklens(c).bert_ckpt_in.val=None
  return redefine(all_multibert_finetune_rusentiment1, new_config=_nc)(m) #end0

We will train models with a number of different learning rates:

print(learning_rates)

[2e-05, 5e-05, 0.0001]

rref0=realize(instantiate(all_multibert_finetune_rusentiment0))
rrefs=[realize(instantiate(all_multibert_finetune_rusentiment1,lr=lr)) \
                           for lr in learning_rates]

For every model trained, we read it’s validation logs and plot the accuracy.

cols={'steps':[],'accuracy':[],'name':[],'lr':[]}
for rref in [rref0]+rrefs:
  es=tensorboard_tensors(rref,'valid','accuracy')
  assert len(es)>0
  cols['steps'].extend([e.step for e in es])
  cols['accuracy'].extend([te2float(e) for e in es])
  cols['name'].extend([mklens(rref).name.val for _ in es])
  cols['lr'].extend([mklens(rref).lr.val for _ in es])
altair_print(alt.Chart(DataFrame(cols)).mark_line().encode(
  x='steps', y='accuracy', color='lr:N', strokeDash='name'), f'accuracy.png')

t=BeautifulTable(max_width=1000)
t.set_style(BeautifulTable.STYLE_MARKDOWN)
t.width_exceed_policy=BeautifulTable.WEP_ELLIPSIS
t.column_headers=['Learning rate','Accuracy']
t.numeric_precision=6
for rref in rrefs:
  t.append_row([str(mklens(rref).lr.val),
                te2float(tensorboard_tensors(rref,'eval','eval_accuracy')[-1])])
print(t)

Learning rate	Accuracy
2e-05	0.711957
5e-05	0.710938
0.0001	0.679348

Confusion matrix

We build confusion matrix by (a) defininig a simple model runner and (b) realizing a stage which uses this runner to calculate the matrix data. Model runner loads the model referenced by rref and process a list of sentences defined by the user.

class Runner:
  def __init__(self, rref:RRef):
    self.rref=rref
    self.tokenizer=FullTokenizer(
      vocab_file=mklens(rref).tfrecs.bert_vocab.syspath,
      do_lower_case=mklens(rref).tfrecs.lower_case.val)
    self.max_seq_length=mklens(rref).max_seq_length.val
    self.model=BertClsModel(
      BuildArgs(rref2dref(rref), store_context(rref), None, {}, {}))
    bert_finetune_build(self.model)
    self.model.model.load_weights(mklens(rref).checkpoint_full.syspath)

  def eval(self, sentences:List[str], batch_size:int=10)->List[List[float]]:

    @tf.function
    def _tf_step(inputs):
      return self.model.model_eval(inputs, training=False)

    def _step(inputs:List[Tuple[Any,Any,Any]]):
      return _tf_step((tf.constant([i[0] for i in inputs], dtype=tf.int64),
                       tf.constant([i[1] for i in inputs], dtype=tf.int64),
                       tf.constant([i[2] for i in inputs], dtype=tf.int64)))\
             .numpy().tolist()

    buf=[]
    outs:List[List[float]]=[]
    for i,s in enumerate(sentences):
      ie=InputExample(guid=f"eval_{i:04d}", text_a=s, label='dummy')
      f=convert_single_example(
        10, ie, ['dummy'], self.max_seq_length, self.tokenizer)
      buf.append((f.input_ids, f.input_mask, f.segment_ids))
      if len(buf)>=batch_size:
        outs.extend(_step(buf))
        buf=[]

    outs.extend(_step(buf))
    return outs

# runner ends

Stage for calculating the confusion matrix data is defined as follows:

def bert_rusentiment_evaluation(m:Manager, stage:Stage)->DRef:

  def _realize(b:Build):
    build_setoutpaths(b,1)
    r=Runner(mklens(b).model.rref)
    df=read_csv(mklens(b).model.tfrecs.refdataset.output_tests.syspath)
    labels=sorted(df['label'].value_counts().keys())
    df['pred']=[labels[np.argmax(probs)] for probs in r.eval(list(df['text']))]
    confusion={l:{l2:0.0 for l2 in labels} for l in labels}
    for i,row in df.iterrows():
      confusion[row['label']][row['pred']]+=1.0
    confusion={l:{l2:i/sum(items.values()) for l2,i in items.items()} \
               for l,items in confusion.items()}
    with open(mklens(b).confusion_matrix.syspath,'w') as f:
      json_dump(confusion,f,indent=4)
    df.to_csv(mklens(b).prediction.syspath)

  return mkdrv(m, matcher=match_only(), realizer=build_wrapper(_realize),
    config=mkconfig({
      'name':'confusion_matrix',
      'model':stage(m),
      'confusion_matrix':[promise, 'confusion_matrix.json'],
      'prediction':[promise, 'prediction.csv'],
      'version':2,
      }))

# eval ends

We combine a confusion matrix stage with best stage defined in the previous section. The final stage function will be contained in stage_cm variable:

stage_finetune=partial(all_multibert_finetune_rusentiment1, lr=min(learning_rates))
stage_cm=partial(bert_rusentiment_evaluation, stage=stage_finetune)

Dependency graph of the combined stage became:

with prepare_markdown_image('cm_graph.png') as path:
  depgraph([stage_cm], path)

Finally, we realize the confusion matrix stage and print the matrix

rref=realize(instantiate(stage_cm))
data:dict={'label':[],'pred':[],'val':[]}
for l,items in readjson(mklens(rref).confusion_matrix.syspath).items():
  for l2,val in items.items():
    data['label'].append(l)
    data['pred'].append(l2)
    data['val'].append(f"{val:.2f}")

base=alt.Chart(DataFrame(data))
r=base.mark_rect().encode(
  y='label:O',x='pred:O', # color='val:Q'
  color=alt.Color('val:Q',
        scale=alt.Scale(scheme='purpleblue'),
        legend=alt.Legend(direction='horizontal')
  ))
t=base.mark_text().encode(y='label:O',x='pred:O',text='val:Q',
    color=alt.condition(
        alt.datum.val < 0.5,
        alt.value('black'),
        alt.value('white')
    )
)
altair_print((r+t).properties(width=300,height=300), 'cm.png')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

RuSentiment

BERT Fine-tuning

Confusion matrix

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

RuSentiment

BERT Fine-tuning

Confusion matrix