Skip to content

Use Case: DOIs among LOD

Tim L edited this page Jun 16, 2014 · 39 revisions

What is first

What we will cover

This page describes how to use DataFAQs to find other LOD Cloud data sources that describe a set of publications with Digital Object Identifiers (DOIs). This pattern could be applied more generally for a particular property's value (i.e., those that are often inverse functional).

Let's get to it!

The DataOne project is working with some documents/datasets with indexing terms, e.g.:

"status"|"doi:10.5063/AA/nceas.226.3"
"status"|"doi:10.5063/AA/nceas.227.15"

""snow"|"doi:10.6073/AA/knb-lter-arc.1423.1"

"%285"|"doi:10.6073/AA/knb-lter-fce.111.5"
"%285"|"doi:10.6073/AA/knb-lter-fce.112.5"
"%285"|"doi:10.6073/AA/knb-lter-fce.108.5"

With some quick csv2rdf4lod'ing, we can get good linked data URIs that reuse bibo:

<http://dx.doi.org/10.5063/AA/nceas.227.15>
   dcterms:isReferencedBy <http://localhost/source/patrice/dataset/index-term-doi-pairs/version/2013-Jan-09> ;
   void:inDataset <http://localhost/source/patrice/dataset/index-term-doi-pairs/version/2013-Jan-09> ;
   a index-term-doi-pairs_vocab:DigitalObject ;
   bibo:doi "10.5063/AA/nceas.227.15" ;
   ov:csvRow "2"^^xsd:integer .

<http://dx.doi.org/10.6073/AA/knb-lter-arc.1423.1>
   dcterms:isReferencedBy <http://localhost/source/patrice/dataset/index-term-doi-pairs/version/2013-Jan-09> ;
   void:inDataset <http://localhost/source/patrice/dataset/index-term-doi-pairs/version/2013-Jan-09> ;
   a index-term-doi-pairs_vocab:DigitalObject ;
   bibo:doi "10.6073/AA/knb-lter-arc.1423.1" ;
   ov:csvRow "4"^^xsd:integer .

Since we want to model this as good Linked Data, we can get better about the source identifier and the dataset identifier, per the conventions:

  • source identifier: (what organization produced this data?)
  • dataset identifier: (what would the organization call it?)

Then, we can commit the source data and the enhancement parameters (per these conventions) into an existing csv2rdf4lod node such as LOGD, whose conversion data root is here. After an svn update and a pull of a conversion trigger, it's reconverted and published as Linked Data.

DataFAQs revolves around two things: Datasets and FAqT Services. The data that we show above is the one dataset that we want to work with, i.e. "evaluate". We'll also need to create some FAqT Services to fulfill our use case of finding other Digital Objects that may be described elsewhere in the LOD Cloud.

First, we'll make sure that we have described the dataset appropriately for our use case. All that really matters is that the dataset describes some resources using the bibo vocabulary -- specifically, bibo:doi. We can reuse VoID and csv2rdf4lod's vocabulary to do that:

<http://localhost/source/patrice/dataset/index-term-doi-pairs/version/2013-Jan-09>
   a void:Dataset;
   void:vocabulary bibo:;
   conversion:uses_predicate bibo:doi;
.

Next, we need to model the inputs and outputs for each FAqT Service. The input to a FAqT Service is a dcat:Dataset, but we can and should add additional descriptions that are appropriate for our use case.

  • For each http://dx.doi.org URI, SPARQL-query each Bubble's SPARQL endpoint to see if it is described there (BTE Between The Edges can be used to describe the domain of a URI).
  • For each bibo:doi value, SPARQL-query each Bubble's SPARQL endpoint to see if it appears there and, if so, using which predicate.

The final result that we are looking for would look something like the following. The three void:inDataset would be asserted if the same URI is found in these datasets. The owl:sameAs would result when we find other URIs with the same bibo:doi in those sources. For the distinct URIs, we also indicate which LOD bubble they are from using void:inDataset. The example URIs in this example are notional.

<http://dx.doi.org/10.6073/AA/knb-lter-arc.1423.1>
   void:inDataset <http://datahub.io/dataset/twc-logd>,
                  <http://datahub.io/dataset/dbpedia>,
                  <http://datahub.io/dataset/vivo-indiana-university>;
   owl:sameAs <http://ieee.rkbexplorer.com/id/10.6073/AA/knb-lter-arc.1423.1>;

.
<http://ieee.rkbexplorer.com/id/10.6073/AA/knb-lter-arc.1423.1>
   void:inDataset <http://datahub.io/dataset/rkb-explorer-ieee>;
.

Since most LOD Cloud bubbles should have SPARQL endpoints, we'll avoid trying to access and load their void:dataDumps and instead query the SPARQL endpoint from within the FAqT Service. This assumption should be verified: how many bubbles offer endpoints, and how many offer data dumps, and how many do both/neither?

We also need to find out all of the bubbles in the lodcloud group. This can be done using the by-ckan-group.py [DataFAQs Core Service](DataFAQs Core Services).

For a given datafaqs:CKANDataset, we need to find the SPARQL endpoint that is serving it. The existing FAqT service lift-ckan (deployed here) accepts a datafaqs:CKANDataset and returns a well-structured RDF description using the DC Terms, VoID, and other vocabularies. This will include the void:sparqlEndpoint property, which we can use to query the dataset. lift-ckan can be used as a datafaqs:DatasetAugmenter within a FAqT Brick configuration (see "augmenters" in DataFAQs Core Services). Dataset augmenters provide additional descriptions about the POSTed URI and can supplement the descriptions provided by direct URI dereferencing. We need to do this because CKAN does not provide useful RDF descriptions when their datasets are dereferenced.

So, we've described our DOI dataset (as using bibo:doi), we've selected all 334 lodcloud datasets, and we've determined each lodcloud bubble's SPARQL endpoint (if available). This is enough metadata for a SADI service to query each SPARQL endpoint for the presence of our DOI URIs and our DOI values, and return a description of the dataset interconnections using the structure that we sketched above.

To make all of this actually happen, we write a FAqT Brick configuration, and invoke the df-epoch.sh script. Though, we'll need to [Install DataFAQs](Installing DataFAQs) first and make sure the [environment variables](DATAFAQS environment variables) are set correctly to publish the results to a TDB, Virtuoso, or Sesame triple store. Rest assured, all sorts of metadata and provenance are recorded within a FAqT Brick, so it should be easy to discover, access, and navigate as linked data -- if it's not, let me know!.

What is next

Clone this wiki locally