Skip to content

Mbodin/CoqR

Repository files navigation

A trustworthy Coq formalisation of the R programming language https://travis-ci.org/Mbodin/CoqR.svg?branch=master

Overview of the Project

This project is a Coq formalisation of the R programming language, or more accurately, of the core of a R programming language. It follows principles strongly inspired from the JSCert project:

  • the code is as close as possible from a reference specification (in this case, the source of GNU R—a mirror can be found here). The formalisation has been written in a monadic form, which enables a line-to-line correspondance with the original source. We believe that anyone can check the code similarity after a short training period.
  • the formalisation is runnable, and can be run against test suites. We tested it against GNU R in random tests (see the gen/ folder), custom tests (see the tests/ folder), as well as other test suites for R (see the compare/RTests/ folder).

The goal of this project is to provide a strong basis for anyone wanting to certify in Coq a higher-level formalisation of R, as well as code analyses, etc. Due to its double certification (both code similarity and testing), this project provides a high amount of trust to any proofs built on it. This trust comes with the cost of complexity: R’s semantics is complex, and this project aims at formalising this complexity. This is why linking this project to higher-level specifications would be a huge addition.

Some parts of the original R source code have not been formalised. The main example is the garbage collector, which is not supposed to change the semantics of R programs. R is furthermore composed of several parts, some in C, and some in R it self. Our Coq translation focusses the C part, but we did our best to be able to run the R parts. These parts can be found in the folder Rlib/base.

We have tested our formalisation by extracting it to OCaml (see how to build the project below). We have tested it on several test suites, including random tests (see the gen/ folder after building th project), customs tests with an emphase on corner cases (referred as “Corners” in our publications, see the tests/ folder), as well as other test suites (from GNU R and two different versions of the FastR project, available here and here, see the compare/RTests/ folder). Test suites for the R programming language are quite large. We have filtered out some tests which we consider irrelevant for the current state of this project. More precisely, we have filtered out:

  • tests generating PDFs;
  • tests requiring additional packages and libraries (PCRE, JIT, MASS, etc);
  • tests about character encodings;
  • tests requiring Internet connection;
  • tests generating plots in graphical interfaces;
  • Windows-only tests;
  • tests related to S4 classes;
  • tests about input and outputs (typically manipulating files or sockets);
  • tests related to limiting CPU time for computations;
  • generated tests.

R is a project moving fast. We believe that the approach of sticking as close as possible to R’s source code enables to update our formalisation in the same time than R. But this means that if you are using this repository several years in the future, there might be differences that were not present at the writing time. Most of the project is based on the versions 3.4.2, 3.4.3, and 3.4.4 of R.

More information about this project can be found in the following papers:

Here follows a quick description of the main files of interest:

  • src/Rinternals.v describes the internal structures used by R thorough its source code;
  • the src/core/ folder contains translations of the main functions of R (these functions were called “nucleus functions” in our second paper);
  • the src/features/ folder contains translations of additional functions of R (typically present in the default heap);
  • src/Invariants.v defines some invariants of R’s structures;
  • src/InvariantProofs.v contains proofs that the invariants defined in src/Invariants.v are conserved along the execution.

Note that as we try to be as close as possible to the original source code of R, we are limited in term of how much comment we can write: most of the algorithm of this project were by construction not designed by us! This means that most comments in this project are merely pointers to the original source code of GNU R. For instance, the file src/features/FArithmetic.v corresponds to the file `arithmetic.c` from the actual GNU R source code.

Building the Project

Using esy

The simplest way is to use esy. One way to install it is as follows.

	sudo apt install npm
	sudo npm install --global esy@latest # Tested with version 0.6.2 of esy.

One can then clone the project.

	git clone https://github.com/Mbodin/CoqR.git CoqR
	cd CoqR
	git submodule update --init

Then, typing esy should install all dependencies and build the project.

Once compiled, the following commands are available:

  • esy lrun runs the CoqR interpreter with the base libraries loaded (the loading of the base library will take some time the first time).
  • esy run runs a lighter version of the CoqR interpreter without the base libraries.
  • esy lrun_bisect and esy run_bisect are similar to the other ones, but the program is instrumented by the Bisect coverage tool. Once the program is run on some test suites, one can call esy report to create a coverage report in bisect/report/index.html.

Using opam

Another way is to use opam. In Debian, installing Opam, Git, as well as the other needed libraries, can be done as follows. More information for other systems can be found here for Opam and here for Git.

	sudo apt install opam aspcud m4 perl git
	opam init # This is not needed if you have already installed and initialised Opam before.

It is advised to add the following line into your .bashrc file.

	eval `opam config env`

We recommand a version of OCaml of at least 4.03.

	opam switch 4.04.0

The following lines ensure that the Coq repository is known by Opam.

	opam repo add coq-released https://coq.inria.fr/opam/released
	opam update

The version of Coq used to test this repository is the version 8.6. We can thus pin it in Opam.

	opam pin add coq 8.6

We can now install the required packages.

	opam install ocamlbuild menhir coq coq-flocq coq-paco extlib bisect coq-tlc
	eval `opam config env`

We can now clone the project.

	git clone https://github.com/Mbodin/CoqR.git CoqR
	cd CoqR
	git submodule update --init

The Makefile should now work.

	make

This should create the following program.

	src/runR.native

As-is, this program should run more or less similarly to R (although most features have not been implemented). It accepts a large number of argument to customize its output, which are (briefly) described by the following invocation:

	src/runR.native -expert-mode -help

Initialising the program can take some time. To save computations time, the program enables to save and load previous states from memory. A faster-to-load version is available with the following invocation.

	make run

The first make command also created a html/ folder, with an arguably more readable version of the files and their documentations. See for instance the table of contents in html/toc.html.

The Bisect tool is a coverage tool for OCaml. The following invocation uses it in the context of this R interpreter.

	make run_bisect

This will generate bisect.out files during the computation. To translate these files into a readable report, run the following command. This will create some files in the bisect/ folder, and in particular bisect/report/index.html providing a coverage report.

	make report

Note that make run does not include the R libraries: it only includes what is defined in C in GNU R. Executing the R files of the base library can take some time, but it will lead to the usual R environment. To include this libraries, use the following invocation. Note that the first time that this invocation will be performed, all of the base library will be executed, which will take quite a large amount of time. Once this library will have been executed, the final state will be stored and cached.

	make lrun

If you want to also run Bisect with the R libraries, use the following invocation.

	make lrun_bisect

Files Description

src

This folder contains is a low-level formalisation of R. It is the closest possible to the R reference interpreter (written in C).

Rinternals.v

This file describes the data structures of R as they are in its C code.

RinternalsAux.v

Auxiliary definitions and properties about the data structures defined in Rinternals.v.

This file may take some time to compile.

State.v

This file provides a model for the C memory.

InternalTypes.v

This file describes various internal data types used in the source of R.

Globals.v

This file lists all global variables used in the C source code of R, that are initialised, then never changed.

This file may take some time to compile.

Monads.v

This file provides monads to manipulate R objects easily.

In some ways, this file provides notations enabling to write imperative code in Coq with respect to memory.

Loops.v

This file provides looping monads to easily manipulate R objects. It also defines the structure used in the interpreter to deal with infinite execution.

In some ways, this file provides notations enabling to write imperative loops in Coq.

core/

This folder contains Coq translations of various core functions (that has been called “nucleus functions” at some point) of R from its C code. This folder is organised by the corresponding file name in the C source code of R.

The core functions of R are the ones used thoroughly in R source code. They are internal C functions and are not linked to the initial R environment (defined in the names.c file in C).

Due to some circular dependencies between C files (which is accepted in C, but not in Coq), some functions have been put in a special file Conflicts.v. Note that despite this special file, there still are some functions separated from their original file in other file. They are however always preceeded by a comment warning for the exceptional location of the function definition.

Rcore.v

This file wraps up all the files present in the core/ folder.

Rinit.v

Initialises global variables.

features/

This folder contains Coq translations of various non-core functions of R from its C code. This folder is organised by the corresponding file name in the C source code of R. Most of the functions defined in the files of this folder appear in the initial environment of R, initialised in the file names.c in C.

Rfeatures.v

This file wraps up all the files present in the features/ folder. This file also wraps all these functions into the main function array of names.c. Finally, this files defines the runs structure, enabling Coq functions to loop without explicitly manipulating a fuel.

RfeaturesAux.v

Contains useful lemmae about runs.

This file may take some time to compile.

Rparsing.v

This file formalises functions for parsing R expressions.

MonadTactics.v

Provides tactics to easily manipulate the monads defined in Monads.v and Loops.v.

Path.v

Provides abstractions to reason about the heap.

Invariants.v

States some invariants of R’s heap.

InvariantsAux.v

Contains lemmae about the invariants stated in Invariants.v.

InvariantsTactics.v

Defines tactics to manipulate the invariants defined in Invariants.v.

InvariantsProofs.v

Proofs of some invariants respected by the functions defined in Rcore.v, Rinit.v, and Rfeatures.v.

Extraction.v

Extract R interpreter into OCaml.

extract.ml, extract.mli

These files are extracted by Extraction.v.

print.ml

Contains various OCaml function to print Coq objects.

hooks.ml

Contains some impure functions used by the Coq extracted code (mainly input and output).

runR.ml, runR.native

Main interpreter file. It calls the various functions defined in extract.ml.

The following command provides some help about it.

	src/runR.native -help

gen

This folder is a random test generator. It builds random tests from the gram file, which could in theory be used for other languages than just R.

The tests generated by this program are however of very relative interest, most of them just throwing an error. It was useful during the development to test non-frequent programs.

tests

The file testExtra.R has been build by hand during the development. It consists of more than 500 unit tests, checking for very specific features of R, subjectively considered difficult. Instances of these tests include implicit type conversions or some specific rules for variable look-ahead.

License

The files in the src/ folder are under the GPL license version 2. See the file COPYING for more informations. As this license is the same than R’s license, as well as most programs about R, it seemed the best to help development. If you want to contribute to this project, consider reading the file CODE_OF_CONDUCT.org. Some suggestions for future directions can be found in the file CONTRIBUTING.org.