MODE-X

This repo provides the code for reproducing the experiments in On The Cross-Modal Transfer from Natural Language to Code through Adapter Modules. We explored the bimodality of adapter modules to facilitate cross-modal transfer from large pre-trained neural language models to other language modalities, i.e. source code. Specifically, we trained adapters on three programming languages (Python, Java and C/C++) for the pre-trained RoBERTa language model, and tested them on the downstream task of Code Clone Detection. We also tested the semantic and syntactic representation learned by the adapter modules using appropriate Cloze Style Testing for six programming languages.

Tasks and Datasets

Below, we elaborate on the task definition for each task and the newly introduced dataset for Code Clone Detection. The code and instructions to replicate our experiments for each task can be found in the corresponding directories.

Language Adapter Pretraining (CodeNet, CodeSearchNet). Adapter modules are initialized for each layer of the transformer language model while the weights of the pretrained model are frozen. The combined model is then trained using a masked-language-modeling (MLM) objective over benchmark programming language datasets from Project CodeNet and CodeSearchNet, with the weights of the adapter modules being updated using the standard backpropagation algorithm.
Cloze test (CT-all, CT-max/min). A model is tasked with predicting the masked token from code, formulated as a multi-choice classification problem. The two datasets are taken from CodeXGLUE, one with candidates from the (filtered) vocabulary and the other with candidates among “max” and “min”.
Clone detection (BigCloneBench, POJ-104, SCD-88). A model is tasked with measure the semantic similarity between codes. Two existing datasets are included and a new dataset for Python specific code clone detection is introduced (SCD-88). BigCloneBench is for binary classification between code while the others are for retrieving semantically similar code given code as the query.

Dependencies

python 3.6 or 3.7
torch>=1.5.0
adapter-transformers>=4.8.2
scikit-learn

Build from Source

Clone this repository.
git clone https://github.com/fardfh-lab/NL-Code-Adapter.git
cd NL-Code-Adapter
Create a python virtual environment to run your experiments.
python -m venv adapters
source adapters/bin/activate
Install the requirements given in requirements.txt.
pip install --upgrade pip
pip install -r requirements.txt
Change working directory to run the desired experiment.
cd ClozeTest

Quick Tour

We used the adapter-transformers framework to train the adapter modules. Here we provide an example of the basic setup to add our adapters to a pre-trained RoBERTa base. To load the adapters directly from the AdapterHub:

import torch
from transformers import RobertaModel, RobertaTokenizer

# Load pre-trained RoBERTa model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")

# Load language adapters
model.load_adapter("adapter_path_on_hub")
model.set_active_adapters(["adapter_name_from_hub"])
model.to(device)

Alternatively, you can download the trained adapters on your local drive and load the adapter with model.load_adapter(path). Remember to activate the modules after loading them from path.

Probing

The RoBERTa base model is not suitable for mask prediction (probing) and hence cloze testing. A more suitable choice is the RoBERTa model with an MLM head attached on the top. Here we provide a simple example of how to prepare RoBERTa for probing tasks.

import torch
import transformers import RobertaForMaskedLM, RobertaTokenizer, pipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForMaskedLM.from_pretrained("roberta-base")

# Load language adapters
model.load_adapter("adapter_path_on_hub")
model.set_active_adapters(["adapter_name_from_hub"])
model.to(device)

CODE = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(CODE)
print(outputs)

Downstream Tasks

We use language adapters to adapt a pre-trained RoBERTa language model to source code. Additionaly, we add task-specific adapters to the resultant language model for downstream tasks. The resultant framework is termed MODE-X. Here we provide an example on how to prepare MODE-X for training on a downstream task.

import torch
from transformers import RobertaModel, RobertaTokenizer, AdapterConfig
import transformers.adapters.composition as ac

# Load pre-trained RoBERTa model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")

# Load adapters
model.load_adapter("language_adapter_path_on_hub")

# Add task adapters
task = "adapter_name"
adapter_config = AdapterConfig.load(
                "pfeiffer",
                non_linearity="gelu",
                reduction_factor="16",
            )
model.add_adapter(task, config=adapter_config)

# Activate adapters
model.active_adapters = ac.Stack("language_adapter_name_on_hub", task)

# Set task adapter for training
model.train_adapter([task])

You can then add a custom task-specific head to get the final outputs from the model, similar to adding an MLP over RoBERTa for classification tasks.

Pretrained Adapters

We pre-trained adapter modules for RoBERTa on three programming languages (Python, Java and C/C++) spanning two benchmark datasets (Project CodeNet and CodeSearchNet). The pre-trained modules can be loaded directly from the AdapterHub or can be downloaded from our lab's website.

Cite

If you use this code or our pre-trained adapter modules, please cite our paper:

@inproceedings{goel2022cross,
  title={On the cross-modal transfer from natural language to code through adapter modules},
  author={Goel, Divyam and Grover, Ramansh and Fard, Fatemeh H},
  booktitle={Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension},
  pages={71--81},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ClozeTest		ClozeTest
CodeCloneDetection-BCB		CodeCloneDetection-BCB
CodeCloneDetection-POJ-104		CodeCloneDetection-POJ-104
CodeCloneDetection-SCD-88		CodeCloneDetection-SCD-88
Language Adapter Training		Language Adapter Training
LICENSE		LICENSE
README.md		README.md
expSetup.png		expSetup.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClozeTest

ClozeTest

CodeCloneDetection-BCB

CodeCloneDetection-BCB

CodeCloneDetection-POJ-104

CodeCloneDetection-POJ-104

CodeCloneDetection-SCD-88

CodeCloneDetection-SCD-88

Language Adapter Training

Language Adapter Training

LICENSE

LICENSE

README.md

README.md

expSetup.png

expSetup.png

requirements.txt

requirements.txt

Repository files navigation

MODE-X

Tasks and Datasets

Dependencies

Build from Source

Quick Tour

Probing

Downstream Tasks

Pretrained Adapters

Cite

About

Releases

Packages

Contributors 2

Languages

License

fardfh-lab/NL-Code-Adapter

Folders and files

Latest commit

History

Repository files navigation

MODE-X

Tasks and Datasets

Dependencies

Build from Source

Quick Tour

Probing

Downstream Tasks

Pretrained Adapters

Cite

About

Resources

License

Stars

Watchers

Forks

Languages