Skip to content

tud-ccc/compy-learn

Repository files navigation

ComPy-Learn

Build Status codecov License

ComPy-Learn is a framework for defining and exploring program representations for machine learning on source code (ML4CODE) tasks. While the special focus is on compiler optimization tasks, ComPy-Learn can also be used in other domains like software engineering, or systems security.

Project goals

  • Exploration of best-performing code representation and model: Depending on the task, different representations and models have shown to be differently suitable. Finding the best-performing one is not obvious and currently requires empirical evaluation. ComPy-Learn provides a common framework for that - evaluating different representations on a given task to find the best-performing one.
  • Design and discovery of new representations: Custom, task-specific representations of code can improve a models performance. However, extracting representations of program code is a tedious endeavor and requires low-level development with compiler tools. We aim to take away this burden by enabling to define program representations with a simple, high-level programming interface. This allows easier design and faster iterations.
  • Common tools, evaluation pipeline and datasets: Several promising representations and models to learn embeddings from those representations have been proposed in recent time. However, they use unique tools and pipelines for evaluations, making further comparisons to those methods time-consuming and difficult. ComPy-Learn provides a common framework for representations, models, and datasets and allows for evaluation of their combinations. Implementing a novel representation and model in this framework enables researches to do an effort-less and complete evaluation on the one hand, on the other hand contributes another widely applicable method to the community.

Design

ComPy-Learn's main components are shown in the pipeline below:

  • compy.representation allows the user to define custom representations (such as the ones from published work) of source code based on available semantic compiler-internal information, currently from the Clang/LLVM framework. Both, linear and graph representations of code are supported.
  • compy.model contains ML-models (in fact, it provides connectors to well-established model libraries) that embed the representations into vectors and finally output a prediction.
  • compy.dataset contains datasets of source code for evaluation, along with helper functions that allow integration of new datasets.

Supported representations

Currently, the following representations and models from published work are implemented in this framework:

Installation

We supply an installation script that automates the build, test, and installation process. The script currently supports the platforms listed below. Because the process builds ComPy-Learn from its sources, other platforms can be used with a bit of manual installation effort.

Platform Build status
Ubuntu 16.04 Build Status
Ubuntu 18.04 Build Status
Ubuntu 20.04 Build Status

To get started on one of the supported platforms, we suggest to first create a virtual environment, then run:

./install_deps.sh ${CUDA}

whereas ${CUDA} needs to be cpu, cu92, cu100 or cu102, depending on your machine's capabilities.

After successful installation, ComPy-Learn should be compiled and tested. To do so, please run:

python setup.py test

Finally, install ComPy-Learn in order to use it in your project:

python setup.py install

An example exploration is located in examples/devmap_exploration.py.

Publications