Skip to content

reddy-lab-code-research/XLCoST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Introduction

Recent advances in machine learning have benefited a number of code related tasks, such as code translation, code summarization, and code synthesis. Open-source code repository websites like Github provide enormous amount of source code data, which enables the training of large-scale code language models such as CodeBERT (Feng et al., 2020), PLBART (Ahmad et al., 2021a), TransCoder (Roziere et al., 2020) and CodeT5 (Wang et al., 2021). Although the open-source code data is abundant in quantity, it has several disadvantages when serving as training data for code-related models. First, most of the available code data is unlabeled. For tasks like Code Translation, Code Summarization, and Code Synthesis, high quality parallel data is critical for model training.

We introduce XLCoST , a machine learning benchmark dataset that contains fine-grained parallel data in 7 commonly used programming languages (C++, Java, Python, C#, Javascript, PHP, C), and natural language (English). The data is parallel across 7 languages, at both code snippet level and program level. This means that given a program in one language, the dataset contains the same program in up to 6 other programming languages. Each program is divided into several code snippets, and programs in all the languages are aligned at the snippet level. Moreover, each of the snippets is accompanied with a comment, and the comment for a particular snippet is the same across all the languages. Please find the full paper here.

The figure below shows a schematic diagram of how the dataset is organised and the possible tasks that can be performed with it.

Tasks

We introduce the following 10 cross-lingual tasks. All the tasks have pairwise data at both snippet-level and program-level in 7 programming languages, C++, Java, Python, C#, Javascript, PHP, and C. The tasks can be divided into two categories, generation and retrieval. The generation tasks include Code Translation, Code Summarization and Code Syntheis; the retrieval tasks include NL (natural language) Code Search and XL (Cross-Lingual) Code Search. All the tasks are in both snippet-level and program-level. We use 3 state-of-the-art baselines for the generation tasks and 2 for the retrieval tasks.

Category Task Data Description Baselines
Generation Code-to-Code Snippet Translation 872K/47K/83K Translate code snippet across programming languages CodeBERT(enc-dec), PLBART, CodeT5
Program Translation 106K/6K/11K Translate program across programming languages
Code-to-Text Snippet Summarization 446K/22K/41K Generate comment for given code snippet
Program Summarization 50K/3K/5K Generate problem description for given program
Text-to-Code Snippet Synthesis 446K/22K/41K Generate code snippet giving comment
Program Synthesis 50K/3K/5K Generate program giving problem description and comments
Retrieval NL Code Search Comment-to-Snippet Search 446K/22K/41K Retrieve code snippet for given comment RoBERTa, CodeBERT
Problem-to-Program Search 50K/3K/5K Retrieve program for given problem description
XL Code Search Snippet-to-Snippet Search 872K/47K/83K Retrieve code snippets in other languages for given snippet
Program-to-Program Search 106K/6K/11K Retrieve programs in other languages for given snippet

How to use this repository

Use the requirements.txt file to setup your environment.

Code for this repository has been adapted from CodeXGLUE and PLBART.

Instructions to run the generation tasks can be found here.

Instructions to run the code search tasks can be found here.

Data

The data can be downloaded here.

Data Description (Metadata)

Details about the data files and metadata can be found here.

Statistics

Some basic averaged statistics of the dataset are presented below. "#" means number. #comments/program is the same as #snippets/program. (Py is short for Python; JS for Javascript; TOK for tokens; SN for snippets; PR for programs; com for comments;)

C++ Java C# Python JS PHP C Avg
# tokens/snippet 21.52 24.1 21.63 23.06 22.52 28.14 25.37 22.83
# tokens/program 204.97 227.09 188.54 215.29 184.63 163.51 197.95 201.96
# tokens/comment 8.25 8.14 7.97 8.23 7.96 8.45 9.67 8.15
# tokens/desc 10.68 10.67 10.75 10.7 10.87 9.91 8.19 10.66
# snippet/program 9.52 9.42 8.51 9.33 8.2 5.81 7.77 8.81
# lines/snippet 3.41 3.71 2.41 3.82 3.23 4 4.05 3.37
# lines/program 32.45 34.93 20.54 35.64 26.47 23.23 31.5 29.71
total snippets 106,397 103,703 92,446 100,032 81,511 20,639 4,363 -
total programs 11,198 11,028 10,622 10,735 9,951 3,553 574 -

Number of pairwise code-code data in training, validation and testing splits for each language-pair are presented in the following table. The upper triangle shows the number of parallel code snippets, and the lower triangle shows the number of parallel programs. This data is used for the Code Translation and XL Code Search tasks. (Py is short for Python. JS is short for Javascript.)

Code-Code Pairs C++ Java Python C# JS PHP C
C++ train 89,040 80,100 85,662 69,507 17,811 3,386
val 4,419 3,913 4,408 3,808 923 352
test 8,059 7,228 7,922 6,965 1,647 222
Java train 9,450 77,759 87,065 69,341 17,853 2,996
val 490 3,938 4,437 3,826 929 353
test 901 7,259 8,011 7,005 1,672 238
Python train 9,139 8,991 75,843 67,219 17,616 2,478
val 468 471 3,922 3,750 923 311
test 878 882 7,215 6,861 1,655 203
C# train 9,187 9,301 8,826 68,093 17,873 2,958
val 488 491 470 3,826 928 352
test 890 898 877 6,961 1,668 238
JS train 8,482 8,470 8,182 8,367 17,117 1,875
val 472 475 459 475 921 309
test 878 881 864 877 1,617 200
PHP train 3,056 3,68 3,003 3,071 2,971 856
val 157 158 153 158 157 271
test 303 307 304 307 302 183
C train 402 409 380 394 308 170
val 59 59 59 59 59 55
test 45 49 48 49 49 43

Number of pairwise code-text data in each language are presented in the table below. "Snippet" means snippet-comment pairs, and "Program" means program-description (problem description) pairs. This data is used for Code Summarization (Code-to-Text), Code Synthesis (Text-to-Code) and NL Code Search tasks.

NL-Code Pairs C++ Java Python C# JS PHP C Total
Snippet train 93,847 91,089 81,207 87,583 70,649 18,027 3,763 446,165
valid 4,432 4,460 3,946 4,436 3,829 930 350 22,383
test 8,118 8,154 7,293 8,013 7,033 1,682 250 40,543
Program train 9,797 9,623 9,263 9,345 8,590 3,087 463 50,168
valid 492 494 472 491 475 158 60 2,642
test 909 911 887 899 886 308 51 4,851

With the release of this dataset hope to enable more research into the domain of Deep Learning for Software Engineering tasks. We believe that this dataset is a valuable asset for the research community and can potentially benefit a number of code-related research problems.

Citation

If you use this dataset in your work, please consider citing us. The arXiv version of the paper can be found here.

@misc{zhu2022xlcost,
     title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence},
     url = {https://arxiv.org/abs/2206.08474},
     author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.},
     year = {2022},
     eprint={2206.08474},
     archivePrefix={arXiv}
}