arxiv-code-search

Do authors on arXiv make their code and data available? We're building text mining and machine learning tools to find out!

Our goal is to build a system that can "read" arXiv papers, at scale, and determine if the papers come with publicly available code or data. The planned steps are as follows:

Download paper meta-data from the arXiv dataset and select papers by categories etc. ✔️ (complete)
Download selected papers from arXiv ✔️ (complete)
Label system to manually label paragraphs from papers ✔️ (complete)
- Convert pdfs to text files.
- Search text files for keywords and extract the paragraphs that contain the keywords
- Save the paragraphs in a file that can be readily labeled
Classifier for identifying papers that make their code or data available 🛠️ (in progress)
- Use a BERT model, fine-tuned on the labeled paragraphs
- Train classical ML models on the embeddings from a BERT model
Deploy classifier onto HPC and classify papers at scale! 🛠️ (in progress)

This is active and preliminary research. Stay tuned!

Preliminary Results

- Using the labeling system, I've manually labeled several thousand paragraphs (tedious work!). Here are some results from that. You can reproduce the figures in colab, or view the notebook.

Using a random search, I've trained different classical ML models on the embeddings from a BERT model. I'm using the Allen AI SciBERT model. These models are classifying paragraphs as to whether or not they indicate code OR data availability. I'll be labelling more paragraphs to further improve the results.
Below are the precision-recall and ROC curves for the top performing random forest model. It has been trained with 5-fold cross-validation.

Here's are the precision-recall and ROC curves for the top performing linear regression model.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Name		Name	Last commit message	Last commit date
Latest commit History 293 Commits
.github/workflows		.github/workflows
data		data
docs		docs
models		models
notebooks		notebooks
references		references
reports		reports
src		src
tests/unit		tests/unit
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
envarxiv.yml		envarxiv.yml
install_conda_local.sh		install_conda_local.sh
install_env_hpc.sh		install_env_hpc.sh
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

License

tvhahn/arxiv-code-search

Folders and files

Latest commit

History

Repository files navigation

arxiv-code-search

Preliminary Results

Project Organization

About

Topics

Resources

License

Stars

Watchers

Forks

Languages