ChemicalIdentifiers.jl

A chemical identifiers search package, using the databases present at CalebBell/chemicals.

Instalation:

using Pkg
Pkg.add("ChemicalIdentifiers.jl")

The databases are downloaded, parsed, processed and stored as Apache Arrow files at the first package usage, so the first usage may take some time.

Usage

This package exports search_chemical, that, given a search string, performs a search on a database of over 70000 compounds, returning a Named Tuple with the identifiers of the substance in question.

julia>using ChemicalIdentifiers
julia> res = search_chemical("water")
(pubchemid = 962, CAS = (7732, 18, 5), formula = "H2O", MW = 18.01528, smiles = "O", InChI = "H2O/h1H2", InChI_key = "XLYOFNOQVPJJNP-UHFFFAOYSA-N", iupac_name = "oxidane", common_name = "water")

#worst case scenario, not found on the present databases
julia> @btime search_chemical("dimethylpyruvic acid22",nothing)
  273.700 μs (264 allocations: 15.05 KiB)   
missing

#common compound found in the short database
julia> @btime search_chemical("methane",nothing)
  7.075 μs (57 allocations: 2.97 KiB)       
(pubchemid = 297, CAS = (74, 82, 8), formula = "CH4", MW = 16.04246, smiles = "C", InChI = "CH4/h1H4", InChI_key = "VNWKTOKETHGBQD-UHFFFAOYSA-N", iupac_name = "methane", common_name = "methane")

A query is usually a string and its type is detected automatically when possible. the supported query types are:

PubChemID: by using any ``<:Integer`(or a string containing an Integer)

julia> search_chemical(8003)
(pubchemid = 8003, CAS = (109, 66, 0), formula = "C5H12", MW = 72.14878, smiles = 
"CCCCC", InChI = "C5H12/c1-3-5-4-2/h3-5H2,1-2H3", InChI_key = "OFBQJSOFQDEBGM-UHFFFAOYSA-N", iupac_name = "pentane", common_name = "pentane")

CAS registry number:by using a Tuple of integers or a string with the digits separated by - :

julia> search_chemical((67,56,1))        
(pubchemid = 887, CAS = (67, 56, 1), formula = "CH4O", MW = 32.04186, smiles = "CO", InChI = "CH4O/c1-2/h2H,1H3", InChI_key = "OKKJLVBELUTLKV-UHFFFAOYSA-N", iupac_name = "methanol", common_name = "methanol")

 search_chemical((67,56,1),nothing) == search_chemical("67-56-1",nothing) #true

SMILES: by using a string starting with SMILES= :

julia> search_chemical("SMILES=N")       
(pubchemid = 222, CAS = (7664, 41, 7), formula = "H3N", MW = 17.03052, smiles = "N", InChI = "H3N/h1H3", InChI_key = "QGZKDVFQNNGYKY-UHFFFAOYSA-N", iupac_name = "azane", common_name = "ammonia")

InChI : by using a string starting with InChI=1/ or InChI=1S/ :

julia> search_chemical("InChI=1/C2H4/c1-2/h1-2H2")     
(pubchemid = 6325, CAS = (74, 85, 1), formula = "C2H4", MW = 28.05316, smiles = "C=C", InChI = "C2H4/c1-2/h1-2H2", InChI_key = "VGGSQFUCUMXWEO-UHFFFAOYSA-N", iupac_name = "ethene", common_name = "ethene")

InChI key : by using a string with the pattern XXXXXXXXXXXXXX-YYYYYYYYFV-P:

julia> search_chemical("IMROMDMJAWUWLK-UHFFFAOYSA-N")
(pubchemid = 11199, CAS = (9002, 89, 5), 
formula = "C2H4O", MW = 44.05256, smiles 
= "C=CO", InChI = "C2H4O/c1-2-3/h2-3H,1H2", InChI_key = "IMROMDMJAWUWLK-UHFFFAOYSA-N", iupac_name = "ethenol", common_name 
= "ethenol")

Searches by CAS and PubChemID are a little bit faster thanks to being encoded as native numeric types, other properties are stored as strings.

The package stores each query in ChemicalIdentifiers.SEARCH_CACHE as a Dict{String,Any}, so subsequent queries on the same (or similar) strings, dont pay the cost of searching in the database.

If you don't want to store the query, you could use search_chemical(query,nothing), or, if you want your own cache to be used, pass your own cache via search_chemical(query,mycache).

Custom Databases

If you want to add your own databases, you could use the (unexported) data utilities to do so. lets say we also want to add the inorganic database located at https://github.com/CalebBell/chemicals/blob/master/chemicals/Identifiers/Inorganic%20db.tsv. we could do:

using ChemicalIdentifiers
inorganic_url = "https://github.com/CalebBell/chemicals/blob/master/chemicals/Identifiers/Inorganic%20db.tsv"
ChemicalIdentifiers.load_data!(:inorganic,url = inorganic_url)
ChemicalIdentifiers.load_db!(:inorganic)

or if you already have a local database:

using ChemicalIdentifiers
filepath = "path/to/my/db.tsv"
ChemicalIdentifiers.load_data!(:custom,file = filepath)
ChemicalIdentifiers.load_db!(:custom)

ChemicalIdentifiers.load_data! will generate a named tuple of file paths (stored in ChemicalIdentifiers.DATA_INFO), and ChemicalIdentifiers.load_db! will use that data to generate the corresponding Apache Arrow files and store those in a scratch space (ChemicalIdentifiers.download_cache). This download cache can be cleaned (in case a download goes wrong) with ChemicalIdentifiers.clear_download_cache!()

The raw databases are then stored in ChemicalIdentifiers.DATA_DB. if the data was already processed, then the arrow files are read directly, saving significant loading time.

In case of adding user databases, those are searched first, so there is a possibility of collision.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
LocalPreferences.toml		LocalPreferences.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

src

src

test

test

.gitignore

.gitignore

LICENSE

LICENSE

LocalPreferences.toml

LocalPreferences.toml

Project.toml

Project.toml

README.md

README.md

Repository files navigation

ChemicalIdentifiers.jl

Instalation:

Usage

Custom Databases

About

Releases 10

Packages

Contributors 2

Languages

License

longemen3000/ChemicalIdentifiers.jl

Folders and files

Latest commit

History

Repository files navigation

ChemicalIdentifiers.jl

Instalation:

Usage

Custom Databases

About

Topics

Resources

License

Stars

Watchers

Forks

Languages