Skip to content
Çağrı Mert Bakırcı edited this page Jan 30, 2022 · 8 revisions

Background Information

This Wiki is about creating a compound database obtained from PubChem, an open data source.

Compounds, NOT Substances!

There are two main types of data in PubChem that we are interested in:

  • Substances: These are chemical information submitted to PubChem from more than 1 source. They may be repetitive and disorganized. There are 277,335,999 (277M) of these substances.
  • Compounds: These are PubChem chemicals combed through the substances submitted and that are standardized. There are 110,376,731 (110M) compounds. These are identified with a PubChem Compound ID (CID) number. For example, CID=222 belongs to Ammonia.

In Periodum, we want to use Compounds, not Substances, because we want the most standardized and distilled data. In other words, for this project, we will ignore the following data types for now:

  • Substances
  • BioAssays
  • Targets (genes and proteins)
  • Pathways
  • Taxonomy
  • Patents

Target Entry Number for Database: 110M!

Due to the background info given above, we have 110M compound entries that we need to scrape in order to build our database.

Compound Examples

Here are some examples of these entries:

Ways to Download PubChem Data

There are 3 main ways to access data:

  • Individual Record Download: Use Download button at the top right.
  • Programmatic Download
  • Bulk Download
    • FTP
      • Compound: Full and incremental data dump for PubChem compounds (without annotations and 3-D conformer models).
      • 3D Compound: Computationally generated 3-D structures for PubChem compounds, along with other 3-D properties such as molecular volume, shape quadrupoles, shape fingerprint, etc.

Difference Between PUG-View vs. PUG-REST

PUG-View can be easily confused with PUG-REST. Although both PUG-View and PUG-REST are REST-like interfaces, they aim to serve distinct kinds of data in general. PUG-REST primarily provides access to data that can be readily structured, such as computed properties of compounds, activity data for assays, associations (cross-references) between PubChem records and other resources, and so on. On the other hand, PUG-View is intended to support the retrieval of unstructured, largely textual annotation data (e.g., excerpts about handling or first aid procedure for a chemical). The data models for the JSON/XML returned by these services are also completely different. In general, PUG-REST is intended to be used to grab small, specific bits of information; whereas PUG-View is used for larger reports, another reason that PUG-View does not handle multiple records in a single request.