tokenizers-neural-nets-2020-paper

This repository contains every notebook, model, and pickled dataset and other supporting materials used during the research associated to the paper Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches

For more information about the purpose of each file and model, please refer to the paper.

📓 Notebooks

The notebooks related to the data cleaning and preparing are:

downloading_cleaning_codeSearchNetChallenge_dataset.ipynb
preparing_csv_files_from_codeSearchNetChallenge_dataset.ipynb

The notebooks related to the training of AWD-LSTM neural networks are:

20200811_word_model_training_lstm.ipynb
20200811_unigram_model_training_lstm.ipynb
20200811_bpe_model_training_lstm.ipynb
20200811_char_model_training_lstm.ipynb

The notebooks related to the training of AWD-QRNN neural networks are:

20200820_word_model_training_qrnn.ipynb
20200820_unigram_model_training_qrnn.ipynb
20200820_bpe_model_training_qrnn.ipynb
20200820_char_model_training_qrnn.ipynb

The notebooks related to the training of the Transformer models GPT-2, BERT and RoBERTa are:

20200813_gpt2.ipynb
20200817_bert.ipynb
20200824_roberta.ipynb

📖 Models from tokenization

The models and vocab files resulting from each tokenization process and used for AWD-LSTM and AWD-QRNN neural nets are:

word_model.model
word_model.vocab
unigram_model.model
unigram_model.vocab
bpe_model.model
bpe_model.vocab
char_clean_code_challenge_model.model
char_clean_code_challenge_model.vocab

The models and vocab files used for GPT-2, BERT and RoBERTA are under the folders named:

20200813_gpt2-CodeSearchNet-fine-tuned
20200817_fit_head_bert_model-CodeSearchNet
20200817_fine_tuned_bert_model-CodeSearchNet
20200813_roberta-CodeSearchNet_fit_head
20200813_roberta-CodeSearchNet_fine_tuned

There are another Python script for tokenization utils, used in some notebooks named tokenizer_utils.py.

🤖 Trained models

Every model trained using the FastAI library is located under the Zenodo record 10.5281/zenodo.4293857. The name of each one is consistent with the names used in the different notebooks above commented. The models stored using the 🤗HuggingFace's Transformers library are in the folders 20200813_gpt2-CodeSearchNet-fine-tuned, 20200817_fit_head_bert_model-CodeSearchNet, 20200817_fine_tuned_bert_model-CodeSearchNet, 20200813_roberta-CodeSearchNet_fit_head, and 20200813_roberta-CodeSearchNet_fine_tuned.

✅ Results

The CSV files with results from training processes are located under the results folder. The associated notebook that analyzes them is analyzing_results.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
20200813_gpt2-CodeSearchNet-fine-tuned		20200813_gpt2-CodeSearchNet-fine-tuned
20200817_fine_tuned_bert_model-CodeSearchNet		20200817_fine_tuned_bert_model-CodeSearchNet
20200817_fit_head_bert_model-CodeSearchNet		20200817_fit_head_bert_model-CodeSearchNet
20200824_roberta-CodeSearchNet-fine-tuned		20200824_roberta-CodeSearchNet-fine-tuned
20200824_roberta-CodeSearchNet_fit_head		20200824_roberta-CodeSearchNet_fit_head
results		results
.gitattributes		.gitattributes
20200811_bpe_model_training_lstm.ipynb		20200811_bpe_model_training_lstm.ipynb
20200811_char_model_training_lstm.ipynb		20200811_char_model_training_lstm.ipynb
20200811_unigram_model_training_lstm.ipynb		20200811_unigram_model_training_lstm.ipynb
20200811_word_model_training_lstm.ipynb		20200811_word_model_training_lstm.ipynb
20200813_gpt2.ipynb		20200813_gpt2.ipynb
20200817_bert.ipynb		20200817_bert.ipynb
20200820_bpe_model_training-qrnn.ipynb		20200820_bpe_model_training-qrnn.ipynb
20200820_char_model_training-qrnn.ipynb		20200820_char_model_training-qrnn.ipynb
20200820_unigram_model_training-qrnn.ipynb		20200820_unigram_model_training-qrnn.ipynb
20200820_word_model_training-qrnn.ipynb		20200820_word_model_training-qrnn.ipynb
20200824_roberta.ipynb		20200824_roberta.ipynb
LICENSE.txt		LICENSE.txt
README.md		README.md
analyzing_results.ipynb		analyzing_results.ipynb
bpe_model.model		bpe_model.model
bpe_model.vocab		bpe_model.vocab
char_clean_code_challenge_model.model		char_clean_code_challenge_model.model
char_clean_code_challenge_model.vocab		char_clean_code_challenge_model.vocab
downloading_cleaning_codeSearchNetChallenge_dataset.ipynb		downloading_cleaning_codeSearchNetChallenge_dataset.ipynb
preparing_csv_files_from_codeSearchNetChallenge_dataset.ipynb		preparing_csv_files_from_codeSearchNetChallenge_dataset.ipynb
tokenizer_utils.py		tokenizer_utils.py
unigram_model.model		unigram_model.model
unigram_model.vocab		unigram_model.vocab
word_model.model		word_model.model
word_model.vocab		word_model.vocab

License

cbjuan/tokenizers-neural-nets-2020-paper

Folders and files

Latest commit

History

Repository files navigation

tokenizers-neural-nets-2020-paper

📓 Notebooks

📖 Models from tokenization

🤖 Trained models

✅ Results

About

Resources

License

Stars

Watchers

Forks

Languages