Home
This Wiki is about creating a compound database obtained from PubChem, an open data source.
There are two main types of data in PubChem that we are interested in:
- Substances: These are chemical information submitted to PubChem from more than 1 source. They may be repetitive and disorganized. There are 277,335,999 (277M) of these substances.
- Compounds: These are PubChem chemicals combed through the substances submitted and that are standardized. There are 110,376,731 (110M) compounds. These are identified with a PubChem Compound ID (CID) number. For example, CID=222 belongs to Ammonia.
In Periodum, we want to use Compounds, not Substances, because we want the most standardized and distilled data. In other words, for this project, we will ignore the following data types for now:
- Substances
- BioAssays
- Targets (genes and proteins)
- Pathways
- Taxonomy
- Patents
Due to the background info given above, we have 110M compound entries that we need to scrape in order to build our database.
Here are some examples of these entries:
There are 3 main ways to access data:
- Individual Record Download: Use Download button at the top right.
- Programmatic Download
- Bulk Download
-
FTP
- Compound: Full and incremental data dump for PubChem compounds (without annotations and 3-D conformer models).
- 3D Compound: Computationally generated 3-D structures for PubChem compounds, along with other 3-D properties such as molecular volume, shape quadrupoles, shape fingerprint, etc.
-
FTP
PUG-View can be easily confused with PUG-REST. Although both PUG-View and PUG-REST are REST-like interfaces, they aim to serve distinct kinds of data in general. PUG-REST primarily provides access to data that can be readily structured, such as computed properties of compounds, activity data for assays, associations (cross-references) between PubChem records and other resources, and so on. On the other hand, PUG-View is intended to support the retrieval of unstructured, largely textual annotation data (e.g., excerpts about handling or first aid procedure for a chemical). The data models for the JSON/XML returned by these services are also completely different. In general, PUG-REST is intended to be used to grab small, specific bits of information; whereas PUG-View is used for larger reports, another reason that PUG-View does not handle multiple records in a single request.
- Some users are often confused with PUG-View and PUG-REST. While PUG-REST retrieves property values computed by PubChem, PUG-View retrieves annotations collected from other data sources.
- Contrary to PUG-REST, PUG-View takes only CID (rather than chemical names, InChIKeys or other identifiers). Therefore, to get annotations corresponding to non-CID identifiers, they need to be converted to CIDs first and then those CIDs should be used in PUG-View requests.
- Another important difference between PUG-REST and PUG-View is that PUG-View cannot take multiple CIDs in a single request, whereas PUG-REST can. That is, of the following two PUG-View requests, only the first one will work: