Skip to content

CTCycle/NISTADS-data-collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NISTADS: NIST/ARPA-E dataset composer

1. Project Overview

NISTADS is a python application developed to extract adsorption isotherms data from the NIST/ARPA-E Database of Novel and Emerging Adsorbent Materials (https://adsorption.nist.gov/index.php#home) through their dedicated API. The user can either collect data regarding adsorbent materials and adsorbate species or fetch adsorption isotherm experimental data. Experiments are identified by name upon building the entire database experiments index from the API endpoint. Furthermore, NISTADS exploits the PUG REST API (see https://pubchempy.readthedocs.io/en/latest/ for more information) to enrich the adsorbate species dataset with molecular properties (such as molecular weight, canonical smiles, complexity, heavy atoms, etc.). Eventually, the adsorption isotherm dataset is split into two datasets, one containing data on single component adsorption and the other including experiments with binary mixtures.

Adsorbent material

2. Adsorption datasets

The collected data is saved locally in 4 different .csv files, located in the NISTADS/data folder. Adsorption isotherm datasets are saved in NISTADS/data/experiments, while the adsorbents and adsorbates datasets are saved into NISTADS/data/materials. The former will include the experiments datasets for both single component and binary mixture measurements, while the latter will host datasets on guest and host species.

2. Installation

The installation process is designed for simplicity, using .bat scripts to automatically create a virtual environment with all necessary dependencies. Please ensure that Anaconda or Miniconda is installed on your system before proceeding.

  • The setup/create_environment.bat file, located in the scripts folder, offers a convenient one-click solution to set up your virtual environment.

3. How to use

The project is organized into subfolders, each dedicated to specific tasks.

data: run NISTADS/data/compose_experiments_dataset.py or NISTADS/data/compose_materials_dataset.py to respectively fetch data for adsorption experiments or for the guest/host entries. The data collection operation may take long time due to the large number of queries to perform, and it heavily depends on your internet connection performance (more than 30k experiments are available as of now). You can select a fraction of data that you wish to extract (guest, host, or experiments data), and you can also split the total number of adsorption isotherm experiments in chunks, so that each chunk will be collected and saved as file iteratively. Use the notebook NISTADS/data/dataset_info.ipynb to perform explorative data analysis on the collected datasets.

experimental: contains experimental features to integrate further information into the dataset. Description of chemicals (both adsorbate species and adsorbent materials) can be generated using the pretrained GPT2 model using NISTADS/experimental/gpt_enhancement.py. Due to the model limitations, description may not be very accurate and lack context for more complex molecules and materials.

3.1 Configurations

The NISTADS/config/configurations.py file allows to change the script configuration.

Category Setting Description
Data settings GUEST_FRACTION fraction of adsorbate species data to be fetched
HOST_FRACTION fraction of adsorbent materials data to be fetched
EXP_FRACTION fraction of adsorption isotherm data to be fetched
CHUNK_SIZE fraction of data chunks to extract and save
Series settings MIN_POINTS Minimum number of measurements per experiment
MAX_PRESSURE Max pressure to consider (in Pa)
MAX_UPTAKE Max uptake to consider (in mol/g)

License

This project is licensed under the terms of the MIT license. See the LICENSE file for details.

About

Collect adsorption isotherm data from the NIST/ARPA-E Database

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published