Skip to content

BioSchemas/bioschemas-data-harvesting

Repository files navigation

Bioschemas Data Harvesting

Details of the harvesting of Bioschemas markup from live deployments on the Web.

The initial purpose is to track the harvesting of data for use in Project 29 at the BioHackathon-Europe 2021. The harvesting will be conducted with BMUSE and the data hosted on a server at Heriot-Watt University.

BioHackathon 2021 Harvest

We aim to harvest data from the sites on the Bioschemas live deploy page for which we have a sitemap. We will also include sites where we have a list of URLs. Full details of the datasets to be harvested and their progress can be found on the project board.

We have loaded the harvested data into a GraphDB triplestore:

Notes about datasets included in the collection.

Data Harvested with BMUSE

  1. DisProt: 2,044 pages harvested using the dynamic scraper (v0.4.0) on 20 October 2021
  2. MobiDB: 2,083 pages harvested using the dynamic scraper (v0.4.0) on 27 October 2021
  3. Paired Omics: 78 pages harvested using the dynamic scraper (v0.5.0) on 28 October 2021
  4. BridgeDb: 2 pages harvested using the static scraper (v0.5.1) on 2 November 2021
  5. PCDDB: 1,402 pages harvested using the static scraper (v0.5.1) on 2 November 2021
  6. MassBank: 76,253 pages harvested using the static scraper (v0.5.0) on 4 November 2021; 10,326 pages did not harvest due to errors in the JSON-LD. For loading into the triplestore, the nquad files were merged using the command find . -name *.nq -exec cat {} \; > massbank.nq as detailed here.
  7. Cosmic: 2,424 pages harvested using the static scraper (v0.5.2) on 4 November 2021
  8. Nanocommons: 3 pages harvested using the static scraper (v0.5.2) on 4 November 2021
  9. Alliance of Genomes: 12 pages harvested using scraper (v0.5.2) on 5 November 2021
  10. BioVersions: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
  11. EGA: 11,834 pages harvested using scraper (v0.5.2) on 5 November 2021; 745 pages could not be harvested
  12. IFB: 87 pages harvested using scraper (v0.5.2) on 5 November 2021
  13. PDBe: 672 pages harvested using scraper (v0.5.2) on 5 November 2021
  14. Prosite: 5,859 pages harvested using scraper (v0.5.2) on 5 November 2021
  15. UniProt: 3 pages harvested using the static scraper (v0.5.2) on 5 November 2021
  16. FAIRsharing: 6,351 pages harvested using scraper (v0.5.2) on 6 November 2021
  17. COVID19 Portal: 20 pages harvested using the dynamic scraper (v0.5.2) on 7 November 2021
  18. GBIF: 68,167 pages harvested using the static scraper (v0.5.2) on 7 November 2021
  19. TeSS: 13,940 pages harvested using scraper (v0.5.2) on 7 November 2021
  20. Scholia:
    • 5,345 pages harvested out of 660k supplied URLs using dynamic scraper (v0.5.2) on 8 November 2021; 1 page did not scrape
    • 68,974 pages harvested using dynamic scraper (v0.5.2) on 10 November 2021; 21 pages did not scrape
  21. Protein Ensembl (PED): 187 pages harvested using the dynamic scraper (v0.5.2) on 9 November 2021
  22. Bgee: statically scraped (v0.5.2) on 9-10 November
  23. COVIDmine (no longer maintained): 49,959 pages scraped using the dynamic scraper (v0.5.2) on 8 November 2021
  24. MetaNetX: statically scraped (v0.5.2) on 11 November 2021

Data Feeds and Associated Named Graph

We have started testing loading data dumps made available as the experimental Schema.org data feed. The following table details the feeds that have been loaded. The raw data is available here.

Data Source Date Generated Date Loaded Named Graph
bio.tools 2021-11-09 2021-12-17 http://bio.tools/comp-tools-0.6-draft/
chembl-28 2022-01-15 2022-03-04 https://www.ebi.ac.uk/chembl-28/

The following triples were hand inserted to track the provenance of the data feeds. Note that the location retrieved from pav:retrievedFrom refers to the domain of the data and the date pav:retrievedOn is the date the date was generated. This is to be consistent with the data coming from BMUSE.

# Bio.Tools
INSERT DATA {
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievedFrom> <https://bio.tools> .
<http://bio.tools/comp-tools-0.6-draft/> <http://purl.org/pav/retrievededOn> "2021-11-09T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://bio.tools/comp-tools-0.6-draft/> a <https://schema.org/DataFeed> .
}

# ChEMBL 28
INSERT DATA {
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievedFrom> <https://www.ebi.ac.uk/chembl/> .
<https://www.ebi.ac.uk/chembl-28/> <http://purl.org/pav/retrievededOn> "2022-01-15T09:28:45"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<https://www.ebi.ac.uk/chembl-28/> a <https://schema.org/DataFeed> .
}

About

Details of the harvesting of Bioschemas markup from live deployments on the Web.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages