Home

Background Information

This Wiki is about creating a compound database obtained from PubChem, an open data source.

Compounds, NOT Substances!

There are two main types of data in PubChem that we are interested in:

Substances: These are chemical information submitted to PubChem from more than 1 source. They may be repetitive and disorganized. There are 277,335,999 (277M) of these substances.
Compounds: These are PubChem chemicals combed through the substances submitted and that are standardized. There are 110,376,731 (110M) compounds. These are identified with a PubChem Compound ID (CID) number. For example, CID=222 belongs to Ammonia.

In Periodum, we want to use Compounds, not Substances, because we want the most standardized and distilled data. In other words, for this project, we will ignore the following data types for now:

Substances
BioAssays
Targets (genes and proteins)
Pathways
Taxonomy
Patents

Target Entry Number for Database: 110M!

Due to the background info given above, we have 110M compound entries that we need to scrape in order to build our database.

Compound Examples

Here are some examples of these entries:

Ways to Download PubChem Data

There are 3 main ways to access data:

Individual Record Download: Use Download button at the top right.
Programmatic Download
Bulk Download
- FTP
  - Compound: Full and incremental data dump for PubChem compounds (without annotations and 3-D conformer models).
  - 3D Compound: Computationally generated 3-D structures for PubChem compounds, along with other 3-D properties such as molecular volume, shape quadrupoles, shape fingerprint, etc.

Difference Between PUG-View vs. PUG-REST

PUG-View can be easily confused with PUG-REST. Although both PUG-View and PUG-REST are REST-like interfaces, they aim to serve distinct kinds of data in general. PUG-REST primarily provides access to data that can be readily structured, such as computed properties of compounds, activity data for assays, associations (cross-references) between PubChem records and other resources, and so on. On the other hand, PUG-View is intended to support the retrieval of unstructured, largely textual annotation data (e.g., excerpts about handling or first aid procedure for a chemical). The data models for the JSON/XML returned by these services are also completely different. In general, PUG-REST is intended to be used to grab small, specific bits of information; whereas PUG-View is used for larger reports, another reason that PUG-View does not handle multiple records in a single request.

Some users are often confused with PUG-View and PUG-REST. While PUG-REST retrieves property values computed by PubChem, PUG-View retrieves annotations collected from other data sources.
Contrary to PUG-REST, PUG-View takes only CID (rather than chemical names, InChIKeys or other identifiers). Therefore, to get annotations corresponding to non-CID identifiers, they need to be converted to CIDs first and then those CIDs should be used in PUG-View requests.
Another important difference between PUG-REST and PUG-View is that PUG-View cannot take multiple CIDs in a single request, whereas PUG-REST can. That is, of the following two PUG-View requests, only the first one will work:
- (Correct) https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/1/JSON?heading=Substances+by+Category
- (Incorrect) https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/1,2,3/JSON?heading=Substances+by+Category

Provide feedback

Saved searches

Use saved searches to filter your results more quickly