Skip to content

FAqT Service

Tim L edited this page Jul 5, 2014 · 200 revisions

What's first

What we'll cover

This page will walk you through the steps to create a new FAqT evaluation service. By creating and deploying an evaluation service, others will be able to ask what you think about their dataset by calling your service.

Let's get to it

A FAqT Service is a [SADI](SADI Semantic Web Services framework) service that accepts any dataset URI and returns an RDF-encoded evaluation using the FAqT Vocabulary. If a FAqT Service is invoked during an evaluation epoch, it becomes part of the FAqT Brick that accumulates evaluation results and can be browsed using the FAqT Brick Explorer.

How to create a FAqT service

(This is for python, but we switched to Java because python kept falling over on unicode issues).

    1. First, git clone your fork of git://github.com/timrdf/DataFAQs.git, which creates a directory DataFAQs on your local system.
    1. Decide the local name and relative path of the service that you want to create.
    • Choosing our new service's relative path keeps it organized among the other services that we have created.
    • The path that we choose organizes the service's source code in our code repository, as well as when it is deployed on a server.
    • e.g. services/sadi/faqt/sparql-service-description is the relative path for named-graphs.py in this code repository. Similarly, services/sadi/faqt/sparql-service-description is the (same) relative path for named-graphs, which is the deployment location of the code above, relative to this server.
    • Make the directory for the relative path. For example, if the service's relative path is services/sadi/faqt/sparql-service-description, mkdir -p services/sadi/faqt/sparql-service-description from within DataFAQs/.
    1. Copy the template.
    • cp services/sadi/faqt-template.py <relative-path>/<local-name>, e.g. cp services/sadi/faqt-template.py services/sadi/faqt/sparql-service-description/named-graphs.py
    1. Edit your copy of the template to make it your own.
    • cd <relative-path> e.g. cd services/sadi/faqt/sparql-service-description/
    • vi <local-name>.py e.g. vi named-graphs.py
    • 3.A) Replace the value of servicePath = 'services/sadi' (use pwd | sed 's/^.*services/services/').
    • 3.B) Replace TEMPLATE-CLASS-NAME with a name for the python class.
    • 3.C) Replace TEMPLATE-NAME with a name for the service (will become part of its external URI); use local-name that you chose in Step 1.
    • 3.D) Provide a description in the attribute serviceDescriptionText.
    • 3.E) [optional] Provide a comment in the attribute comment.
    • 3.F) Replace the value of result.protegedc_creator = '' with your email address.
    • 3.G) Replace the value of dev_port = 9106 with a port reserved in this list (add a new entry for your service).
    1. Implement the process(self, input, output) method.
    • Set the return values of getInputClass and getOutputClass to characterize your SADI service.
    • Add any new namespace prefixes that you want to use (e.g. ns.register(sd='http://www.w3.org/ns/sparql-service-description#'))
    • Evaluate the dataset URI input.subject in def process(self, input, output): and say what you think about it by describing output. (For the SuRF and rdflib concepts, see SADI Semantic Web Services framework)
    • Use [Beautiful Soup](FAqT Service using Beautiful Soup) or [Ripple](FAqT Service using Ripple)
    • Use SuRF to execute SPARQL queries against the POSTed RDF graph similar to how add-metadata.py does it.
    1. Test your service.
    • Create sample inputs in <TEMPLATE-NAME>-materials/sample-inputs/ (e.g. mondeca.ttl)
    • Temporarily deploy the service on localhost (e.g. python named-graphs.py)
    • Invoke the service
      • Modify the example call that the service offers: curl -H "Content-Type: text/turtle" -d @my.ttl http://localhost:9106/named-graphs

How to run a FAqT service locally

Add the following to __main__, like in add-metadata.py.

      reader= open(sys.argv[1],"r")
      mimeType = "application/rdf+xml"
      if len(sys.argv) > 2:
         mimeType = sys.argv[2]
      if len(sys.argv) > 3:
         writer = open(sys.argv[3],"w")

      graph = resource.processGraph(reader,mimeType)

      if len(sys.argv) > 3:
         writer.write(resource.serialize(graph,mimeType))
      else:
         print resource.serialize(graph,mimeType)

A second example

In this section, we'll walk through a second example. The FAqT service that we create here will reproduce some of the analysis that LODStats does. On 4 Feb 2012, they report that 59 datasets were accessible via SPARQL endpoints and 142 datasets had SPARQL endpoint errors.

We'll pick one successful dataset and one unsuccessful dataset from their lists and try to reproduce their results:

First, we'll choose the relative URI of our new FAqT evaluation service:

services/sadi/faqt/access/in-sparql-endpoint

We'll make a new directory in our github repository (you could do yours in your fork of this repository if you'd like):

/opt/DataFAQs$ ls

bin
doc
lib
ontology
queries
readme.md
services
ui

/opt/DataFAQs$ mkdir services/sadi/faqt/access/
/opt/DataFAQs$ cd services/sadi/faqt/access/

Then, we'll copy the template and change the names and development port:

/opt/DataFAQs/services/sadi/faqt/access/$ cp ../../faqt-template.py in-sparql-endpoint.py

/opt/DataFAQs/services/sadi/faqt/access/$ vi in-sparql-endpoint.py
 :% s/TEMPLATE-NAME/in-sparql-endpoint/gc
 :% s/TEMPLATE-CLASS-NAME/InSPARQLEndpoint/gc
 :% s/9090/9109/gc
serviceDescriptionText = 'Queries into the void:sparqlEndpoint of the dcat:Dataset and reports if the endpoint is there.'
comment                = 'Initial purpose was to evaluate LOD datasets.'

In a second terminal, we can temporarily deploy the service on localhost (ignore the DeprecationWarning for the md5 and sha modules):

$ cd /opt/DataFAQs/github/DataFAQs/services/sadi/faqt/access
$ python in-sparql-endpoint.py 
...
in-sparql-endpoint running on port 9109. Invoke it with:
curl -H "Content-Type: text/turtle" -d @my.ttl http://localhost:9109/in-sparql-endpoint

So, our service is up and ready for someone to ask it what it thinks about a dataset. We can make sure by opening a third terminal and asking the service to describe itself:

$ cd /opt/DataFAQs/github/DataFAQs/services/sadi/faqt/access
$ curl http://localhost:9109/in-sparql-endpoint

@prefix mygrid: <http://www.mygrid.org.uk/mygrid-moby-service#> .
...
<> a <http://www.mygrid.org.uk/mygrid-moby-service#serviceDescription>;
    rdfs:label "in-sparql-endpoint";
...
<#input> a <http://www.mygrid.org.uk/mygrid-moby-service#parameter>;
    mygrid:objectType <http://www.w3.org/ns/dcat#Dataset>  .
...
<#output> a <http://www.mygrid.org.uk/mygrid-moby-service#parameter>;
    mygrid:objectType <http://purl.org/twc/vocab/datafaqs#EvaluatedDataset> .
...

From this, we see that the evaluation service accepts RDF descriptions of dcat:Datasets and returns RDF descriptions of the same instances that will then be typed as datafaqs:EvaluatedDataset. This conforms to the design of the SADI Semantic Web Services framework.

Let's make the sample input using the examples we are using from LODStats:

$ cd /opt/DataFAQs/services/sadi/faqt/access/
/opt/DataFAQs/services/sadi/faqt/access/$ mkdir -p in-sparql-endpoint-materials/sample-inputs
$ cd in-sparql-endpoint-materials/sample-inputs
$ curl -s http://prefix.cc/dcat,datafaqs.file.n3 > 1-good-1-bad-from-lodstat.ttl

Then make 1-good-1-bad-from-lodstat.ttl list the two datasets that we want to evaluate. The type needs to match the type returned by your evaluation service's getInputClass function (which is used to create the service description above).

@prefix dcat:     <http://www.w3.org/ns/dcat#> .
@prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#> .

<http://thedatahub.org/dataset/fu-berlin-stitch>     a dcat:Dataset .
<http://thedatahub.org/dataset/2000-us-census-rdf>   a dcat:Dataset .

Next, send the descriptions of the datasets to the evaluation service and see what it thinks about them:

curl -H "Content-Type: text/turtle" -d @1-good-1-bad-from-lodstat.ttl http://localhost:9109/in-sparql-endpoint

<http://thedatahub.org/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .

<http://thedatahub.org/dataset/fu-berlin-stitch> a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .

Because the template that we copied asserts Unsatisfactory by default, every dcat:Dataset we send this service will be Unsatisfactory until we implement the def process(self, input, output): function.

To do that, we'll need a bit more than the URI of the dataset. A FAqT Service is allowed to assume that the RDF descriptions it receives about a dcat:Dataset already includes the RDF obtained by the dcat:Dataset's URI dereference. That's because [DataFAQs core](FAqT Brick) does this beforehand as it constructs an evaluation epoch. To avoid setting up a FAqT Brick now, we can grab the RDF descriptions ourselves:

cd /opt/DataFAQs/services/sadi/faqt/access/in-sparql-endpoint-materials/sample-inputs/
curl -sH "Accept: application/rdf+xml" -L http://thedatahub.org/dataset/fu-berlin-stitch   > fu-berlin-stitch.rdf
curl -sH "Accept: application/rdf+xml" -L http://thedatahub.org/dataset/2000-us-census-rdf > 2000-us-census-rdf.rdf

In there, we see the association between the dataset we're trying to evaluate and the endpoint that we want to make sure works:

<dcat:Dataset rdf:about="http://thedatahub.org/dataset/fu-berlin-stitch">
   <void:sparqlEndpoint rdf:resource="http://www4.wiwiss.fu-berlin.de/stitch/sparql"/>
...
<dcat:Dataset rdf:about="http://thedatahub.org/dataset/2000-us-census-rdf">
   <void:sparqlEndpoint rdf:resource="http://www.rdfabout.com/sparql"/>
...

Seeing the input that we'll be processing, we can implement the function:

   def process(self, input, output):
      print 'processing ' + input.subject

      if input.void_sparqlEndpoint:
         output.void_sparqlEndpoint = input.void_sparqlEndpoint.first
         result = {}
         try:
            print '          ',
            print input.void_sparqlEndpoint.first
            queries = [ 'select distinct ?type where { graph ?g { [] a ?type } } limit 1', 
                        'select distinct ?type where {[] a ?type} limit 1' ]
            for query in queries:
               if ns.DATAFAQS['Satisfactory'] not in output.rdf_type:
                  store   = Store(reader = 'sparql_protocol', endpoint = input.void_sparqlEndpoint.first)
                  session = Session(store)
                  session.enable_logging = False
                  result = session.default_store.execute_sparql(query)
                  if result['results'] != None:
                     for binding in result['results']['bindings']:
                        type = binding['type']['value']
                        output.rdf_type.append(ns.DATAFAQS['Satisfactory'])
                        print '          ',
                        print type
         except:
            print '           BAD ENDPOINT'
            output.rdf_type.append(ns.DATAFAQS['Unsatisfactory'])
            output.datafaqs_error = result.read()
      else:
         print '           NO ENDPOINT'
         output.rdf_type.append(ns.DATAFAQS['Unsatisfactory'])
         output.datafaqs_error = 'Dataset was not described with predicate void:sparqlEndpoint.'

      if ns.DATAFAQS['Satisfactory'] not in output.rdf_type:
         output.rdf_type.append(ns.DATAFAQS['Unsatisfactory'])

      output.save()

Redeploy the service (python in-sparql-endpoint.py) and call it for each dataset to see that we can reproduce the results that LODStat reports (fu-berlin-stitch good, 2000-us-census-rdf bad):

$ curl -sH "Content-Type: application/rdf+xml" -d @fu-berlin-stitch.rdf http://localhost:9109/in-sparql-endpoint
@prefix void: <http://rdfs.org/ns/void#> .

<http://thedatahub.org/dataset/fu-berlin-stitch> a <http://purl.org/twc/vocab/datafaqs#Satisfactory>;
    void:sparqlEndpoint <http://www4.wiwiss.fu-berlin.de/stitch/sparql> .

$ curl -sH "Content-Type: application/rdf+xml" -d @2000-us-census-rdf.rdf http://localhost:9109/in-sparql-endpoint
@prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#> .
@prefix void: <http://rdfs.org/ns/void#> .

<http://thedatahub.org/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory>;
    datafaqs:error """
""";
    void:sparqlEndpoint <http://www.rdfabout.com/sparql> .

After [deploying the service](Sample FAqT deployment to) to its public home, we can register it at the SADI registry and see it listed at http://sadiframework.org/registry/services.

If you want to run an evaluation epoch with just the in-sparql-endpoint evaluation service and the two datasets in the example, use this epoch configuration.

Test ports

This section is used to reserve ports for each FAqT evaluation service, so we can test many at the same time without having collisions. The services listed are available in the repository.

  • 9090 https://github.com/timrdf/DataFAQs/blob/master/services/sadi/ckan/add-metadata.rpy
  • 9091 https://github.com/timrdf/DataFAQs/blob/master/services/sadi/faqt/void-triples.rpy
  • 9092 https://github.com/timrdf/DataFAQs/blob/master/services/sadi/faqt/internet-domain.rpy
  • 9093 redirect-loop.rpy
  • 9094 class-and-predicate-capitalization.rpy
  • 9095 triple-count-accuracy.rpy
  • 9096 instances-are-explicitly-typed.rpy
  • 9097 instances-are-typed-by-domain-and-range.rpy
  • 9098 by-ckan-group.rpy
  • 9099 with-preferred-uri-and-ckan-meta-void.rpy
  • 9100 vocabulary-resolves-to-description.rpy
  • 9101 via-sparql-query.rpy on sparql.tw
  • 9102 void-properties.rpy
  • 9103 predicate-counter.rpy
  • 9104 lodcloud/max-1-tag.rpy
  • 9105 lodcloud/identity
  • 9106 faqt/sparql-service-description/named-graphs.rpy
  • 9107 csv2rdf4lod-as-ckan.rpy
  • 9108 select-datasets/identity.rpy
  • 9109 access/in-sparql-endpoint.rpy (deployed) (meta)
  • 9110 core/select-dataset/by-ckan-tag.rpy
  • 9111 contributor-email.rpy
  • 9112 fake-goef-coverage.rpy
  • 9113 select-datasets/via-sparql-query.rpy
  • 9114 logd-catalog-listing.rpy
  • 9115 wikitable-gspo.rpy
  • 9116 wikitable-fol.rpy
  • 9117 rdf2asn.rpy
  • 9118 lena-example.rpy
  • 9119 faqt/access/void-subset-tree-dumps
  • 9120 faqt/provenance/named-graph-derivation.rpy
  • 9121 core/select-faqts/towards/ckan-tag.rpy
  • 9122 references-instance-hub.rpy
  • 9223 datascape/size.rpy
  • 9224 connected/void-linkset.py
  • 9225 core/augment-dataset/lift-ckan.py
  • 9226 core/augment-dataset/sameas-org.py
  • 9227 access/void-datadump.py
  • 9228 visko-planner.py
  • 9229 w3c-mail-archives.py
  • 9230 w3c-mail-archives-per-month.py
  • 9231 w3c-mail-archives-message.py
  • 9232 via-hypermail/groups.py
  • 9233 by-ckan-installation.py
  • 9234 with-rdf-extension.py
  • 9235 services/sadi/faqt/naming/between-the-edges
  • 9236 services/sadi/faqt/vocabulary/uses/prov
  • 9237 services/sadi/faqt/vocabulary/uses/dcat
  • 9238 services/sadi/faqt/vocabulary/uses/void
  • 9239 services/sadi/faqt/vocabulary/uses/dcterms
  • 9240 services/sadi/bibo/subject-broader.py
  • 9241 lod-tag-and-lodcloud-group-contacts.py

The faqt-template.rpy includes a print out with a sample of how to call it:

if __name__ == '__main__':
   print resource.name + ' running on port ' + str(resource.dev_port) + '. Invoke it with:'
   print 'curl -H "Content-Type: text/turtle" -d @my.ttl http://localhost:' + str(resource.dev_port) + '/' + resource.name
   sadi.publishTwistedService(resource, port=resource.dev_port)

which is usually either of:

curl -H "Content-Type: text/turtle"         -d @my.ttl http://localhost:9090/add-metadata
curl -H "Content-Type: application/rdf+xml" -d @my.rdf http://localhost:9090/add-metadata

What's next

Clone this wiki locally