NL2Bash Transformer

This document describes the results of training TensorFlow official Trnasformer on the NL2Bash dataset introduced by Xi Victoria Lin et al. in their paper Nl2Bash: Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. We use StagedML library to generate this report.

The primary goal of this work is to demonstrate the features of StagedML library which was used to run experiments and generate this report. The secondary goal is to evaluate the NL2BASH dataset on the stock transformer model.

Source code of this report
Combined python code used in the report

We used PWeave to render this report.

import numpy as np
import matplotlib.pyplot as plt

from shutil import copyfile
from itertools import islice
from pylightnix import (
    RRef, Path, realize, realizeMany, instantiate, redefine, mkconfig, promise,
    rref2dref, mksymlink, rref2path, mklens, match_best, match_some )
from stagedml.imports.sys import ( environ, join, environ, makedirs )
from stagedml.stages.all import ( transformer_wmt, all_nl2bashsubtok,
    all_fetchnl2bash )
from analyze import ( read_tensorflow_log, vocab_size, model_size )

Definitions
- Model
- Metrics
- Dataset
Experiments
- Baseline model Run the Transformer with default settings.
- Unshuffled An occasional experiment where we pass unshuffled dataset to the baseline model.
- Bashtokens Pre-parse the training dataset and force tokenizer to add a number of bash-specific subtokens, including commands and flags names.
- Baseline+vocab_sizes Vary target vocabulary size of the baseline vocabulary
- Bashtoken+vocab_sizes Vary target vocabulary sizes of the Bashtoken model
- Bashtoken+1punct Add bash-specific subtokens and suppress multicharacter punctuation subtokens.
Conclusions

Definitions

Model

In this work, we train a copy of TensorFlow Official Transformer model. This model is intended for machine translation task. Typical use case is EN-DE translation. The model uses shared vocabulary of subtokens for both input and target languages. In this work we pretend that Bash is the target language of the model.

The model is defined in TensorFlow Keras API and is located in the transformer_wmt.py. In this file we define the following entities:

TransformerBuild class for storing the mutable state of the model
Set of operations, which includes build, train, evaluate and predict operations. Operations tupically accept the mutable state and do the named modification: build builds the Keras model, train trains it, and so on.
transformer_wmt functions wraps the above actions into Pylightnix entity called Stage. The stage function is an "entry point" of the whole module. It takes the following stage-level arguments:
- m is a technical argument representing Pylightnix dependency resolution context.
- wmt:WmtSubtok is a reference to the upstream stage which provides access to the database and to Subtokenizer.
- num_instances:int=1 is the number of model instances to train. Setting this argument to value greater than one will result in training several independent instances of the model which shares the configuration.
Stage function returns the Derivation reference handler which could be used in downstream stages. Typically, we pass stage functions to realizeMany(instantiate(.)) functions of Pylightnix API. The result of such calls is a list of Realization references which may be used to directly access the instance artifacts.

transformer_wmt defines rather generic version of Transformer which has no NL2Bash-specific settings. We change some of it's configuration parameters in top-level code snippets before running every experiment. Top-level code of the baseline model looks like the following:

def baseline_subtok(m):
  return all_nl2bashsubtok(m, shuffle=True,
                              with_bash_charset=False,
                              with_bash_subtokens=False)

def baseline_transformer(m):
  def _config(c):
    mklens(c).train_steps.val=6*5000
    mklens(c).params.beam_size.val=3 # As in Tellina paper
  return redefine(transformer_wmt,
                  new_config=_config,
                  new_matcher=match_some())(m, baseline_subtok(m), num_instances=5)

train_steps is the total number of batches to train the model. One epoch is defined to contain 5000 steps by default.
The model uses beam_size of 3.
The model uses shared vocabulary, it's final size is 5833.
The number of trainable weights of baseline model is 47090688.

Metrics

We use BLEU metrics to report the model performance. Bleu implementation is taken from the sources of the official Trnasformer model. This metric may differs from the version of BLEU which were used by the authors of NL2BASH paper, so we can't compare results directly.

We applied the metrics to the evaluation subset of the NL2Bash dataset which is a 0.1 part of the original dataset.

Dataset

We print top 5 lines of input and target sentences of the NL2Bash dataset.

rref=realize(instantiate(all_fetchnl2bash))
copyfile(mklens(rref).eval_input_combined.syspath, join(environ['REPORT_OUTPATH'],'eval_input.txt'))
copyfile(mklens(rref).eval_target_combined.syspath, join(environ['REPORT_OUTPATH'],'eval_target.txt'))

with open(mklens(rref).train_input_combined.syspath) as inp, \
     open(mklens(rref).train_target_combined.syspath) as tgt:
  for i, (iline, tline) in islice(enumerate(zip(inp,tgt)),5):
    print(f"#{i}\t[I] {iline.strip()}\n\t[T] {tline.strip()}")

#0	[I] Pass numbers 1 to 100000 as arguments to "/bin/true"
	[T] /bin/true $(seq 1 100000)
#1	[I] Replace "foo" with "bar" in all PHP files in the current directory tree
	[T] find . -name "*.php" -exec sed -i 's/foo/bar/g' {} \;
#2	[I] Search the entire file hierarchy for files ending in '.old' and delete them.
	[T] find / -name "*.old" -delete
#3	[I] Find all directories under /path/to/Dir and set their permission to 755
	[T] sudo find /path/to/Dir -type d -print0 | xargs -0 sudo chmod 755
#4	[I] run "tar -xzvf ..." as user $username
	[T] su $username -c tar xzvf ..

Evaluation inputs
Evaluation targets

Experiments

Baseline transformer

We display the BLEU metrics of the baseline model defined above. We train the model for 6 epoches.

plt.figure(1)
plt.xlabel("Epoches")
plt.title("BLEU-cased, Baseline transformer")

out=Path(join(environ['STAGEDML_ROOT'],'_experiments','nl2bash','baseline'))
makedirs(out, exist_ok=True)
summary_baseline_bleu=[]
for i,rref in enumerate(realizeMany(instantiate(baseline_transformer))):
  mksymlink(rref, out, f'run-{i}', withtime=False)
  baseline_bleu=read_tensorflow_log(join(rref2path(rref),'eval'), 'bleu_cased')
  plt.plot(range(len(baseline_bleu)), baseline_bleu, label=f'run-{i}', color='blue')
  summary_baseline_bleu.append((vocab_size(baseline_transformer),baseline_bleu[4]))

plt.legend(loc='upper left', frameon=True)
plt.grid(True)

\

In subsequent experiments we plot BLEU of the best instance of baseline model.

rref=realize(instantiate(redefine(baseline_transformer,
                                  new_matcher=match_best('bleu.txt'))))
baseline_bleu=read_tensorflow_log(join(rref2path(rref),'eval'), 'bleu_cased')

Below we output predictions of the model.

Evaluation inputs
Evaluation targets
Model prediction

Unshuffled dataset

We occasionally trained the model on unshuffled dataset. As we can see, unshuffled dataset reduces the model's performance significantly.

def unshuffled_subtok(m):
  return all_nl2bashsubtok(m, shuffle=False,
                              with_bash_charset=False,
                              with_bash_subtokens=False)

def unshuffled_transformer(m):
  def _config(c):
    mklens(c).train_steps.val=6*5000
    mklens(c).params.beam_size.val=3 # As in Tellina paper
  return redefine(transformer_wmt,_config)(m, unshuffled_subtok(m))

\

Bash-specific tokens

Originally this experiment was intended to run the model with bash-specific tokens and different vocabulary sizes. Unfortunately, due to subtokenizer API misuse, we in fact measured the performance on the same target vocabulary. We will make the corrections in the next experiment and here we display just the effect of adding bash-specific tokens.

Adding the bash-specifics include:

Changing the Master Character Set of the Subtokenizer by adding ['-','+',',','.'] to the default list of Alphanumeric characters
Pre-parsing the train part of BASH dataset and generating the list of reserved subtokens. The list includes:
- First words of every command. Often those are command names.
- All words starting from -. Often those are flags of bash commands.

def run1(vsize:int):

  def mysubtok(m):
    def _config(d):
      d['target_vocab_size']=vsize  # Doesn't in fact depend on this parameter
      d['vocab_file'] = [promise, 'vocab.%d' % vsize]
      return mkconfig(d)
    return redefine(all_nl2bashsubtok, _config)(m,
                    shuffle=True, with_bash_charset=True, with_bash_subtokens=True)

  def mytransformer(m):
    def _config(c):
      c['train_steps']=5*5000
      c['params']['beam_size']=3 # As in Tellina paper
      return mkconfig(c)
    return redefine(transformer_wmt,_config)(m, mysubtok(m))

  return mysubtok, mytransformer

Results:

\

Changing vocabulary size of Baseline model

We set the target size of the subtoken vocabulary to different values in range [1000, 15000].

Model config:

def run(vsize:int):
  def mysubtok(m):
    def _config(d):
      d['target_vocab_size']=vsize
      d['vocab_file'] = [promise, 'vocab.%d' % vsize]
      d['train_data_min_count']=None
      d['file_byte_limit'] = 1e6 if vsize > 5000 else 1e5
      return mkconfig(d)
    return redefine(all_nl2bashsubtok,_config)(m,
      shuffle=True, with_bash_charset=False, with_bash_subtokens=False)

  def mytransformer(m):
    def _config(c):
      c['train_steps']=6*5000
      c['params']['beam_size']=3 # As in Tellina paper
      return mkconfig(c)
    return redefine(transformer_wmt,_config)(m, mysubtok(m))

  return mysubtok, mytransformer

Results:

\

Changing vocabulary size of Bashtoken model

We set the target size of the subtoken vocabulary to different values in range [1000, 15000].

Model config:

def run2(vsize:int):
  def mysubtok(m):
    def _config(d):
      d['target_vocab_size']=vsize
      d['vocab_file'] = [promise, 'vocab.%d' % vsize]
      d['train_data_min_count']=None
      d['file_byte_limit'] = 1e6 if vsize > 5000 else 1e5
      return mkconfig(d)
    return redefine(all_nl2bashsubtok,_config)(m,
      shuffle=True, with_bash_charset=True, with_bash_subtokens=True)

  def mytransformer(m):
    def _config(c):
      c['train_steps']=6*5000
      c['params']['beam_size']=3 # As in Tellina paper
      return mkconfig(c)
    return redefine(transformer_wmt,_config)(m, mysubtok(m))

  return mysubtok, mytransformer

Results:

\

Single-char punctuation tokens

We now attempt to force the tokenizer to produce single-char tokens for punctuation chars. This would result in no complex tokens like '; / in the vocabulary.

def singlechar_subtok(m):
  vsize=10000
  def _config(c):
    mklens(c).target_vocab_size.val=vsize
    mklens(c).vocab_file.val = [promise, 'vocab.%d' % vsize]
    mklens(c).no_slave_multichar.val = True
    mklens(c).train_data_min_count.val=None
  return redefine(all_nl2bashsubtok,_config)(m)

def singlechar_transformer(m):
  def _config(c):
    mklens(c).train_steps.val=6*5000
    mklens(c).params.beam_size.val=3 # As in Tellina paper
  return redefine(transformer_wmt,
                  new_config=_config,
                  new_matcher=match_some())(m, singlechar_subtok(m), num_instances=5)

Results:

\

Conclusion

Below we plot BLEU, as seen after 5 epoches for different vocabulary sizes in the above experiments.

\

BLEU metrics of a model may differ significantly from run to run. We see more than 2 BLEU points of delta on the same configurations.
We see best performance if vocab_size is in range 6000..8000.
Default vocabulary size of the Transofrmer model (5833) is probably good enough.
Shuffling of the dataset is absolutely necessary.
Forcing vocabulary to contain bash-specific subtokens may be a good decision.
Forbidding multi-character punctuation subtokens probably reduces the accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

NL2Bash Transformer

Contents

Definitions

Model

Metrics

Dataset

Experiments

Baseline transformer

Unshuffled dataset

Bash-specific tokens

Changing vocabulary size of Baseline model

Changing vocabulary size of Bashtoken model

Single-char punctuation tokens

Conclusion

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

NL2Bash Transformer

Contents

Definitions

Model

Metrics

Dataset

Experiments

Baseline transformer

Unshuffled dataset

Bash-specific tokens

Changing vocabulary size of Baseline model

Changing vocabulary size of Bashtoken model

Single-char punctuation tokens

Conclusion