Code-to-Text Datasets

This directory contains the data and resources for the code-to-text experiments of Richardson and Kuhn ACL 2017, and EMNLP 2017 (see citations below).

What's included

UPDATED 26.3.2018 : under /other_data you will find polyglot_data, which was used for a forthcoming NAACL paper (see references below).

All of the current ACL data is included in data/. The EMNLP data is included in other_data/py27.

The data consists of textual descriptions of source code representations (mostly function signatures) across several natural and programming languages. The experiments in the paper above look at learning to translate these text descriptions to code descriptions, or more simply text -> code.

In each case, you will find the following files for each project with name name:

Filename	Description
name.{e,f}	Training splits with extra data and pseudolex.
name_bow.{e,f}	Training splits without extra data
name_pseudo.{e,f}	Training splits with pseudo lexicon
name_valid.{e,f}	Validation split
name_test.{e,f}	Test split
rank_list.txt	Output representations tokenized
rank_list_orig.txt	Original Output representations, without preprocessing (camel case, hyphens, uppcase, etc.. preserved)
rank_list_class.txt	Abstract class sequences for output
rank_list_tree.txt	Syntax information about reps
descriptions.txt	Output symbols with associated words
extra_pairs.txt	The extra data used above, taken from API
pseudolex.txt	Output symbols mapped to themselves.
grammar.txt	A set of grammar rules for hiero decoding
hiero_rules.txt	Hierarchical phrase rules extracted from training
phrase_table.txt	Phrase rules extracted from training

Warning: The data is relatively noisy. These particular files are directly from our model, other users of the data might decide to make different decision about how the code is representated. If you need more information, please write the email address below.

The zipped files in the uppder directory (acl_emnlp.zip) includes files used for reproducing previous experiments using the Zubr toolkit. Please see the following to learn more: https://github.com/yakazimir/zubr_public

Alternative Signature Formats

Recently, we've been thinking about normalizing the function signature representations and mapping them into logical representations. Details of this can be found in a brief technical report here[https://arxiv.org/abs/1804.00987]:

To facillitate the ideas in this note, we have a simple script in bin/ for converting signatures to alternative representations. For example, typing the following

python bin/formatter.py --data_loc
other_data/py27/nltk/rank_list_orig.txt --format lisp

will convert the NLTK target representations (provided in a tabular format) to a lisp-like FOL representation.

Code retrieval and Question Answering, Text Generation

We have also used these resources for studying source code retrieval and question answering. See information below:

online demo

References

If you use the polyglot data, please cite the following:

@inproceedings{richardson-berant:2018,
  author    = {Richardson, Kyle  and Berant, Jonathan and  Kuhn, Jonas},
  title     = {Polyglot {S}emantic {P}arsing in {API}s},
  booktitle = {Proceedings of NAACL (to appear)},
  year      = {2018},
  url={https://arxiv.org/abs/1803.06966},
  }

If you use other resources, please cite the following (the second one if you use the Py27 dataset or our extractor tool):

@inproceedings{richardson-kuhn:2017:Long,
  author    = {Richardson, Kyle  and  Kuhn, Jonas},
  title     = {Learning {S}emantic {C}orrespondences in {T}echnical {D}ocumentation},
  booktitle = {Proceedings of the ACL},
  year      = {2017},
  url={http://aclweb.org/anthology/P/P17/P17-1148.pdf},
  }

@inproceedings{richardson-kuhn:2017:Demo,
  author    = {Richardson, Kyle  and  Kuhn, Jonas},
  title     = {Function {A}ssistant: {A} {T}ool for {NL} {Q}uerying of {API}s},
  booktitle = {Proceedings of the EMNLP},
  year      = {2017},
  }

You might also consider citing the following, which is where the Unix and Java portion of the data originally come from (respctively):

@inproceedings{richardson2014unixman,
 title={UnixMan {C}orpus: A {R}esource for {L}anguage {L}earning in the {U}nix {D}omain.},
 author={Richardson, Kyle and Kuhn, Jonas},
 booktitle={Proceedings of LREC},
 year={2014},
 utl={http://www.lrec-conf.org/proceedings/lrec2014/pdf/823_Paper.pdf},
}

@inproceedings{deng2014semantic,
 title={Semantic approaches to software component retrieval with English queries.},
 author={Deng, Huijing and Chrupa\l{}a, Grzegorz},
 booktitle={Proceedings of LREC},
 year={2014},
 url={http://www.lrec-conf.org/proceedings/lrec2014/pdf/106_Paper.pdf},
 }

Contact

If you have any questions, or find errors, please write the address below:

kyle@ims.uni-stuttgart.de

website

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
bin		bin
data		data
other_data		other_data
polyglot_data		polyglot_data
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
acl_emnlp_naacl.zip		acl_emnlp_naacl.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

data

data

other_data

other_data

polyglot_data

polyglot_data

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

README.md

README.md

acl_emnlp_naacl.zip

acl_emnlp_naacl.zip

Repository files navigation

Code-to-Text Datasets

What's included

Alternative Signature Formats

Code retrieval and Question Answering, Text Generation

References

Contact

About

Releases 2

Packages

Languages

yakazimir/Code-Datasets

Folders and files

Latest commit

History

Repository files navigation

Code-to-Text Datasets

What's included

Alternative Signature Formats

Code retrieval and Question Answering, Text Generation

References

Contact

About

Resources

Stars

Watchers

Forks

Languages