This artifact is associated with the paper "Understanding GCC Builtins to Develop Better Tools" and provides the builtin raw data, the preprocessed data, and aggregated data of builtin usages. It also contains a record of the manual decisions made as part of this study. Furthermore, it contains all the scripts and applications to reproduce the paper's results.
To make full use of the artifact, we recommend using an Ubuntu 19.04 system (which is assumed for the further instructions). If you are not using Ubuntu, you can download a Virtual Box image from osboxes.org:
The artifact requires the following dependencies:
- git: for downloading the projects from GitHub
- cloc: for retrieving metadata about the projects
- Python 3: for executing the Python scripts
- sqlite3: for opening the database
- Latex: to build the paper
- libcurl4-openssl-dev and libxml2-dev: required by the forecast R package (see here)
- The R packages ggplot2, ggthemes, reshape, and forecast: for plotting
- A recent JDK and Eclipse: to use the Java projects
You can download these dependencies on Ubuntu as follows:
sudo apt install git cloc r-base-core python3 libcurl4-openssl-dev libxml2-dev sqlite3 texlive-full openjdk-11-jdk
echo 'install.packages(c("ggplot2", "ggthemes", "reshape", "forecast"), repos = "http://cran.us.r-project.org")' | sudo R --vanilla
We believe that the artifact works also on other Linux systems. Furthermore, browsing the SQLite3 database should be possible on any operating system that has SQLite3 installed.
The artifact provides the raw data, preprocessed data, and aggregated data of builtin usages. Furthermore, it also records the manual decisions made in parts of the study. Please unpack the database file contained in the database.zip
archive.
The database is stored in the ./database.db file. It is stored as a SQLite3 database, so that it can be easily distributed and shared. For convenience, we suggest to explore the database's content using sqlitebrowser, which is a GUI for SQLite3 databases and can be used on both Linux, Windows, and MacOS. Alternatively, the database can also be accessed on the command line by installing SQLite3 and typing sqlite3 database.db
. The figure below shows an ER diagram created using Dia with the most important entities and relations represented in the database.
Tables with a suffix Unfiltered
indicate raw data that is filtered by a view. For example, the database contains a table GithubProjectUnfiltered
that contains the ~5,000 projects that we initially considered in our study. As described in the paper, we excluded projects having less than 100 LOC as well as GCC and forks of GCC. Rather than removing this projects from the database, we inserted a view GithubProjectView
that filters the project: CREATE VIEW GithubProjectView as SELECT *, CLOC_LOC_C+CLOC_LOC_H AS CLOC FROM GithubProjectUnfiltered WHERE PROCESSED=1 AND CLOC >= 100 AND GITHUB_PROJECT_NAME NOT IN ("gcc", "gcc-xtensa", "gccxml", "rose")
(note that PROCESSED=1
ensures that only those projects are considered that were fully processed by the BuiltinAnalyzer
). Filtering the data rather than removing it allows to reproduce every filtering step. However, having many nested views results in slow queries - querying some of the view takes several minutes. To this end, we implemented a script to persist views in a table (see ./src/includes/sync_views_to_tables.py). Thus, the database also contains a table GithubProject
with the content of the GithubProjectView
view. Thus, if you make changes to the content of the database, make sure to execute ./src/includes/sync_views_to_tables.py to synchronize the views and tables. The sqlitebrowser allows, for example, to easily determine the CREATE
statements for the views (see below).
Most numbers mentioned in the paper are generated directly from the data contained in the database, rather than being hardcoded, to make the results better reproducible and prevent that outdated data is included.
We recommend having a look at the ./generated/latex/generated.tex file that contains many of the commands and tables included in the paper. The ./paper/Makefile generates this data. Type make
in the ./paper to generate it (if missing).
The source code is contained in the ./src directory. Java programs are contained in the ./src/JavaProject directory and include multi-threaded programs to analyze the builtin usage, as well as GUIs for browsing the builtin usage of the projects and classifying their trends (see below).
The Python scripts in ./src were used for various purposes, for example, for fetching the GitHub projects and aggregating data from the database.
The ./src/plots directory contains R scripts to plot the figures contained in the paper.
- For the script that fetches the projects, see the ./src/fetch_github_projects.py script.
- For the coverage score, see the ./coverage directory.
- For how we filtered projects, see the definition of the
GithubProjectView
in ./database.db. - For how we identified builtins from the documentation, see ./defs.
- For how we identified builtins from the GCC source code, and how we identified builtin usages in the ~5,000 projects, see ./BuiltinAnalyzer.
- For how we excluded builtins, see ./src/sync-excludes.py.
- For how we extracted builtin usages, see ./src/README.md.
- See ./src/stat.py for how the stats are generated.
- See ./src/stat.py for how the stats are generated.
See ./src/implementation_effort.py.
- The scripts and data related to this research question are described in ./gcc-builtin-tests/README.md.
We performed the study on over 5,000 GitHub projects, which occupy several hundreds of GB and for which extracting the builtins and commit history requires multiple days. Thus, we did not include the GitHub projects itself into the artifact (they would otherwise be located in the ./projects directory).
However, since we considered only popular GitHub projects, we assume that most projects are still available on GitHub.
Furthermore, the GithubProjectUnfiltered
table, which contains meta data about the projects contains a column PULL_HASH
referencing the projects' last commit when we performed the analysis.
├── LICENSE.md
├── README.md
├── STATUS.md
├── __pycache__
├── coverage
│ ├── README.md
│ └── sampling_software_projects
├── defs
│ ├── README.md
│ └── architecture-specific
│ └── internal
├── excludes
├── gcc-builtin-tests
│ ├── README.md
│ └── tests
│ ├── LICENSE.md
│ ├── README.md
│ └── test-cases
├── generated
│ ├── historical-data
│ │ ├── csv
│ │ └── plots
│ │ ├── pdf
│ │ └── png
│ ├── latex
│ └── plots
├── paper
├── projects
└── src
├── FETCH_GITHUB_PROJECTS.md
├── JavaProject
│ ├── JAVA_PROJECTS.md
│ ├── bin
│ ├── lib
│ └── src
├── __pycache__
├── helper-gcc-doc-extraction
│ └── README.md
├── include
│ └── __pycache__
├── plots
│ └── README.md
└── screenshots