Skip to content


Repository files navigation


This is the source code repository for the ESEC/FSE paper Learning Type Annotation: Is Big Data Enough?

It is derived from Tensorflow model garden.

  • May 2 Updated repository with huggingface models @ kevinjesse/typebert. The dataset has also been uploaded with the same tag kevinjesse/typebert. Please see the huggingface/ and huggingface/ scripts to see how to download and use the model. I prefer to use pytorch/pytorch:1.12.1-cuda11.3-cudnn8-devel docker image.


We have provided a docker image that will have all dependencies install for TypeBert. By downloading the image, you can link the directory of with the model weights through the following steps.

  1. Install docker if you don't have it. Docker

  2. Pull the latest TypeBert image docker pull typebert/typebert

  3. Run docker with the code directory. Model weights can also be placed in this folder and passed via command line to the or To run docker with NVidia GPUs, please use CUDA 11 and cuDNN 8. We use the latest and greatest TensorFlow 2.4. Eventually, in order to run your docker use docker run i.e. docker run --gpus all --rm --mount type=bind,src=/home/typebert,dst=/home/typebert --mount type=bind,src=/data2/typebert_data,dst=/data2/typebertv2_data -it typebert/typebert bashWe choose to keep the typebert code on a data drive. Mounting the directory makes it visible to the docker container. After cloning the code in the following step, make sure to export the model directory in the python path

export PYTHONPATH=$PYTHONPATH:/home/myhome/TypeBert/models/


Clone the code from this repository with git cli. This can be done in the docker or in a binded directory. git clone


We have uploaded compressed directories of the code datasets. These data directories are made up of multiple smaller tf_record files. Each folder contains the same sentence piece tokenizer model trained on typescript. All sentencepeice related files are prefixed "ts_sp".


  • Javascript Corpus: link
  1. First download the javascript corpus. Extact the .tgz file and add it to a your visible folder. tar -xvzf pretrain.tgz


  • TypeBert Type Data: link

Model Weights:

  • Pretrained Weights: link
  • FineTuning Weights on Type Dataset link

Running evaluation on TypeBert data

In your docker, go to your TypeBert/type-bert directory. To run fine-tuning, it would look something like this. The number of training batch is dependent on how many GPUs you use. Heuristically, it was found that 25 per GPU will work.

How to evaluate models

python --mode='predict' --input_meta_data_path=/data2/typebertv2_data/meta_data --train_data_path=/data2/typebertv2_data/train.tf_record --eval_data_path=/data2/typebertv2_data/test.tf_record --bert_config_file=/home/myhome/TypeBert/type-bert/bert_config.json --predict_checkpoint_path=/home/myhome/typebert_my_model/ckpt-272457 --train_batch_size=32 --eval_batch_size=32 --model_dir=/home/myhome/typebert_my_model/ --distribution_strategy=multi_worker_mirrored

This method will return test labels & (indices values numpy files). The test labels are the most useful for computing stats like top 1 accuracy. However top 5 accuracy for example requires the probabilities for each class. We report the top 100 probablities for each prediction as anything greater than that is not particularly useful for these metrics and make the files intractably large.

How to Fine Tune

Fine-tuning is the process of refining the TypeBert model weights for the type inference or sequence tagging task. It starts with a pretrained model on JavaScript and tunes it to TypeScript.

python mode='train_and_eval' --input_meta_data_path=/data2/typebertv2_data/meta_data --train_data_path=/data2/typebertv2_data/train.tf_record --eval_data_path=/data2/typebertv2_data/test.tf_record --bert_config_file=/home/myhome/TypeBert/type-bert/bert_config.json --init_checkpoint=/data2/typebert_pretrain_weights/bert_model_step_100000.ckpt-10 --train_batch_size=100 --eval_batch_size=100 --steps_per_loop=1 --learning_rate=2e-5 --num_train_epochs=4 --model_dir=/home/myhome/typebert_my_model --distribution_strategy=multi_worker_mirrored

How to Pretrain From scratch

Pretrain a BERT architecture for JavaScript. This uses an MLM and NSP task to refine the models prediction of the next token. We have done this costly step for you and uploaded the model weights.

python --input_files=/data2/myhome/pretrain/*.tfrecord --bert_config_file=/home/myhome/TypeBert/type-bert/bert_config.json --distribution_strategy=multi_worker_mirrored --model_dir=/home/myhome/my_typebert_pretrain/ --num_gpus=6 --train_batch_size=100


No description, website, or topics provided.






No releases published


No packages published
