FAqT Service

What's first

Getting started
CKAN, if you're writing a service that uses their API.
Installing DataFAQs, to extend the faqt service (python) superclass.
DataFAQs Core Services for selecting datasets and FAqT Services, and adding descriptions to datasets.

What we'll cover

This page will walk you through the steps to create a new FAqT evaluation service. By creating and deploying an evaluation service, others will be able to ask what you think about their dataset by calling your service.

Let's get to it

A FAqT Service is a [SADI](SADI Semantic Web Services framework) service that accepts any dataset URI and returns an RDF-encoded evaluation using the FAqT Vocabulary. If a FAqT Service is invoked during an evaluation epoch, it becomes part of the FAqT Brick that accumulates evaluation results and can be browsed using the FAqT Brick Explorer.

How to create a FAqT service

(This is for python, but we switched to Java because python kept falling over on unicode issues).

1. First, git clone your fork of git://github.com/timrdf/DataFAQs.git, which creates a directory DataFAQs on your local system.
1. Decide the local name and relative path of the service that you want to create.
- Choosing our new service's relative path keeps it organized among the other services that we have created.
- The path that we choose organizes the service's source code in our code repository, as well as when it is deployed on a server.
- e.g. services/sadi/faqt/sparql-service-description is the relative path for named-graphs.py in this code repository. Similarly, services/sadi/faqt/sparql-service-description is the (same) relative path for named-graphs, which is the deployment location of the code above, relative to this server.
- Make the directory for the relative path. For example, if the service's relative path is services/sadi/faqt/sparql-service-description, mkdir -p services/sadi/faqt/sparql-service-description from within DataFAQs/.
1. Copy the template.
- cp services/sadi/faqt-template.py <relative-path>/<local-name>, e.g. cp services/sadi/faqt-template.py services/sadi/faqt/sparql-service-description/named-graphs.py
1. Edit your copy of the template to make it your own.
- cd <relative-path> e.g. cd services/sadi/faqt/sparql-service-description/
- vi <local-name>.py e.g. vi named-graphs.py
- 3.A) Replace the value of servicePath = 'services/sadi' (use pwd | sed 's/^.*services/services/').
- 3.B) Replace TEMPLATE-CLASS-NAME with a name for the python class.
- 3.C) Replace TEMPLATE-NAME with a name for the service (will become part of its external URI); use local-name that you chose in Step 1.
- 3.D) Provide a description in the attribute serviceDescriptionText.
- 3.E) [optional] Provide a comment in the attribute comment.
- 3.F) Replace the value of result.protegedc_creator = '' with your email address.
- 3.G) Replace the value of dev_port = 9106 with a port reserved in this list (add a new entry for your service).
1. Implement the process(self, input, output) method.
- Set the return values of getInputClass and getOutputClass to characterize your SADI service.
- Add any new namespace prefixes that you want to use (e.g. ns.register(sd='http://www.w3.org/ns/sparql-service-description#'))
- Evaluate the dataset URI input.subject in def process(self, input, output): and say what you think about it by describing output. (For the SuRF and rdflib concepts, see SADI Semantic Web Services framework)
- Use [Beautiful Soup](FAqT Service using Beautiful Soup) or [Ripple](FAqT Service using Ripple)
- Use SuRF to execute SPARQL queries against the POSTed RDF graph similar to how add-metadata.py does it.
1. Test your service.
- Create sample inputs in <TEMPLATE-NAME>-materials/sample-inputs/ (e.g. mondeca.ttl)
- Temporarily deploy the service on localhost (e.g. python named-graphs.py)
- Invoke the service
  - Modify the example call that the service offers: curl -H "Content-Type: text/turtle" -d @my.ttl http://localhost:9106/named-graphs

How to run a FAqT service locally

Add the following to __main__, like in add-metadata.py.

      reader= open(sys.argv[1],"r")
      mimeType = "application/rdf+xml"
      if len(sys.argv) > 2:
         mimeType = sys.argv[2]
      if len(sys.argv) > 3:
         writer = open(sys.argv[3],"w")

      graph = resource.processGraph(reader,mimeType)

      if len(sys.argv) > 3:
         writer.write(resource.serialize(graph,mimeType))
      else:
         print resource.serialize(graph,mimeType)

A second example

In this section, we'll walk through a second example. The FAqT service that we create here will reproduce some of the analysis that LODStats does. On 4 Feb 2012, they report that 59 datasets were accessible via SPARQL endpoints and 142 datasets had SPARQL endpoint errors.

We'll pick one successful dataset and one unsuccessful dataset from their lists and try to reproduce their results:

http://thedatahub.org/dataset/fu-berlin-stitch reports an "successful" endpoint at http://www4.wiwiss.fu-berlin.de/stitch/sparql
http://thedatahub.org/dataset/2000-us-census-rdf reports an "unsuccessful" endpoint at http://www.rdfabout.com/sparql

First, we'll choose the relative URI of our new FAqT evaluation service:

services/sadi/faqt/access/in-sparql-endpoint

We'll make a new directory in our github repository (you could do yours in your fork of this repository if you'd like):

/opt/DataFAQs$ ls

bin
doc
lib
ontology
queries
readme.md
services
ui

/opt/DataFAQs$ mkdir services/sadi/faqt/access/
/opt/DataFAQs$ cd services/sadi/faqt/access/

Then, we'll copy the template and change the names and development port:

/opt/DataFAQs/services/sadi/faqt/access/$ cp ../../faqt-template.py in-sparql-endpoint.py

/opt/DataFAQs/services/sadi/faqt/access/$ vi in-sparql-endpoint.py
 :% s/TEMPLATE-NAME/in-sparql-endpoint/gc
 :% s/TEMPLATE-CLASS-NAME/InSPARQLEndpoint/gc
 :% s/9090/9109/gc
serviceDescriptionText = 'Queries into the void:sparqlEndpoint of the dcat:Dataset and reports if the endpoint is there.'
comment                = 'Initial purpose was to evaluate LOD datasets.'

In a second terminal, we can temporarily deploy the service on localhost (ignore the DeprecationWarning for the md5 and sha modules):

$ cd /opt/DataFAQs/github/DataFAQs/services/sadi/faqt/access
$ python in-sparql-endpoint.py 
...
in-sparql-endpoint running on port 9109. Invoke it with:
curl -H "Content-Type: text/turtle" -d @my.ttl http://localhost:9109/in-sparql-endpoint

So, our service is up and ready for someone to ask it what it thinks about a dataset. We can make sure by opening a third terminal and asking the service to describe itself:

$ cd /opt/DataFAQs/github/DataFAQs/services/sadi/faqt/access
$ curl http://localhost:9109/in-sparql-endpoint

@prefix mygrid: <http://www.mygrid.org.uk/mygrid-moby-service#> .
...
<> a <http://www.mygrid.org.uk/mygrid-moby-service#serviceDescription>;
    rdfs:label "in-sparql-endpoint";
...
<#input> a <http://www.mygrid.org.uk/mygrid-moby-service#parameter>;
    mygrid:objectType <http://www.w3.org/ns/dcat#Dataset>  .
...
<#output> a <http://www.mygrid.org.uk/mygrid-moby-service#parameter>;
    mygrid:objectType <http://purl.org/twc/vocab/datafaqs#EvaluatedDataset> .
...

From this, we see that the evaluation service accepts RDF descriptions of dcat:Datasets and returns RDF descriptions of the same instances that will then be typed as datafaqs:EvaluatedDataset. This conforms to the design of the SADI Semantic Web Services framework.

Let's make the sample input using the examples we are using from LODStats:

$ cd /opt/DataFAQs/services/sadi/faqt/access/
/opt/DataFAQs/services/sadi/faqt/access/$ mkdir -p in-sparql-endpoint-materials/sample-inputs
$ cd in-sparql-endpoint-materials/sample-inputs
$ curl -s http://prefix.cc/dcat,datafaqs.file.n3 > 1-good-1-bad-from-lodstat.ttl

Then make 1-good-1-bad-from-lodstat.ttl list the two datasets that we want to evaluate. The type needs to match the type returned by your evaluation service's getInputClass function (which is used to create the service description above).

@prefix dcat:     <http://www.w3.org/ns/dcat#> .
@prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#> .

<http://thedatahub.org/dataset/fu-berlin-stitch>     a dcat:Dataset .
<http://thedatahub.org/dataset/2000-us-census-rdf>   a dcat:Dataset .

Next, send the descriptions of the datasets to the evaluation service and see what it thinks about them:

curl -H "Content-Type: text/turtle" -d @1-good-1-bad-from-lodstat.ttl http://localhost:9109/in-sparql-endpoint

<http://thedatahub.org/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .

<http://thedatahub.org/dataset/fu-berlin-stitch> a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory> .

Because the template that we copied asserts Unsatisfactory by default, every dcat:Dataset we send this service will be Unsatisfactory until we implement the def process(self, input, output): function.

To do that, we'll need a bit more than the URI of the dataset. A FAqT Service is allowed to assume that the RDF descriptions it receives about a dcat:Dataset already includes the RDF obtained by the dcat:Dataset's URI dereference. That's because [DataFAQs core](FAqT Brick) does this beforehand as it constructs an evaluation epoch. To avoid setting up a FAqT Brick now, we can grab the RDF descriptions ourselves:

cd /opt/DataFAQs/services/sadi/faqt/access/in-sparql-endpoint-materials/sample-inputs/
curl -sH "Accept: application/rdf+xml" -L http://thedatahub.org/dataset/fu-berlin-stitch   > fu-berlin-stitch.rdf
curl -sH "Accept: application/rdf+xml" -L http://thedatahub.org/dataset/2000-us-census-rdf > 2000-us-census-rdf.rdf

In there, we see the association between the dataset we're trying to evaluate and the endpoint that we want to make sure works:

<dcat:Dataset rdf:about="http://thedatahub.org/dataset/fu-berlin-stitch">
   <void:sparqlEndpoint rdf:resource="http://www4.wiwiss.fu-berlin.de/stitch/sparql"/>
...
<dcat:Dataset rdf:about="http://thedatahub.org/dataset/2000-us-census-rdf">
   <void:sparqlEndpoint rdf:resource="http://www.rdfabout.com/sparql"/>
...

Seeing the input that we'll be processing, we can implement the function:

   def process(self, input, output):
      print 'processing ' + input.subject

      if input.void_sparqlEndpoint:
         output.void_sparqlEndpoint = input.void_sparqlEndpoint.first
         result = {}
         try:
            print '          ',
            print input.void_sparqlEndpoint.first
            queries = [ 'select distinct ?type where { graph ?g { [] a ?type } } limit 1', 
                        'select distinct ?type where {[] a ?type} limit 1' ]
            for query in queries:
               if ns.DATAFAQS['Satisfactory'] not in output.rdf_type:
                  store   = Store(reader = 'sparql_protocol', endpoint = input.void_sparqlEndpoint.first)
                  session = Session(store)
                  session.enable_logging = False
                  result = session.default_store.execute_sparql(query)
                  if result['results'] != None:
                     for binding in result['results']['bindings']:
                        type = binding['type']['value']
                        output.rdf_type.append(ns.DATAFAQS['Satisfactory'])
                        print '          ',
                        print type
         except:
            print '           BAD ENDPOINT'
            output.rdf_type.append(ns.DATAFAQS['Unsatisfactory'])
            output.datafaqs_error = result.read()
      else:
         print '           NO ENDPOINT'
         output.rdf_type.append(ns.DATAFAQS['Unsatisfactory'])
         output.datafaqs_error = 'Dataset was not described with predicate void:sparqlEndpoint.'

      if ns.DATAFAQS['Satisfactory'] not in output.rdf_type:
         output.rdf_type.append(ns.DATAFAQS['Unsatisfactory'])

      output.save()

Redeploy the service (python in-sparql-endpoint.py) and call it for each dataset to see that we can reproduce the results that LODStat reports (fu-berlin-stitch good, 2000-us-census-rdf bad):

$ curl -sH "Content-Type: application/rdf+xml" -d @fu-berlin-stitch.rdf http://localhost:9109/in-sparql-endpoint
@prefix void: <http://rdfs.org/ns/void#> .

<http://thedatahub.org/dataset/fu-berlin-stitch> a <http://purl.org/twc/vocab/datafaqs#Satisfactory>;
    void:sparqlEndpoint <http://www4.wiwiss.fu-berlin.de/stitch/sparql> .

$ curl -sH "Content-Type: application/rdf+xml" -d @2000-us-census-rdf.rdf http://localhost:9109/in-sparql-endpoint
@prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#> .
@prefix void: <http://rdfs.org/ns/void#> .

<http://thedatahub.org/dataset/2000-us-census-rdf> a <http://purl.org/twc/vocab/datafaqs#Unsatisfactory>;
    datafaqs:error """
""";
    void:sparqlEndpoint <http://www.rdfabout.com/sparql> .

After [deploying the service](Sample FAqT deployment to) to its public home, we can register it at the SADI registry and see it listed at http://sadiframework.org/registry/services.

If you want to run an evaluation epoch with just the in-sparql-endpoint evaluation service and the two datasets in the example, use this epoch configuration.

Test ports

This section is used to reserve ports for each FAqT evaluation service, so we can test many at the same time without having collisions. The services listed are available in the repository.

9090 https://github.com/timrdf/DataFAQs/blob/master/services/sadi/ckan/add-metadata.rpy
9091 https://github.com/timrdf/DataFAQs/blob/master/services/sadi/faqt/void-triples.rpy
9092 https://github.com/timrdf/DataFAQs/blob/master/services/sadi/faqt/internet-domain.rpy
9093 redirect-loop.rpy
9094 class-and-predicate-capitalization.rpy
9095 triple-count-accuracy.rpy
9096 instances-are-explicitly-typed.rpy
9097 instances-are-typed-by-domain-and-range.rpy
9098 by-ckan-group.rpy
9099 with-preferred-uri-and-ckan-meta-void.rpy
9100 vocabulary-resolves-to-description.rpy
9101 via-sparql-query.rpy on sparql.tw
9102 void-properties.rpy
9103 predicate-counter.rpy
9104 lodcloud/max-1-tag.rpy
9105 lodcloud/identity
9106 faqt/sparql-service-description/named-graphs.rpy
9107 csv2rdf4lod-as-ckan.rpy
9108 select-datasets/identity.rpy
9109 access/in-sparql-endpoint.rpy (deployed) (meta)
9110 core/select-dataset/by-ckan-tag.rpy
9111 contributor-email.rpy
9112 fake-goef-coverage.rpy
9113 select-datasets/via-sparql-query.rpy
9114 logd-catalog-listing.rpy
9115 wikitable-gspo.rpy
9116 wikitable-fol.rpy
9117 rdf2asn.rpy
9118 lena-example.rpy
9119 faqt/access/void-subset-tree-dumps
9120 faqt/provenance/named-graph-derivation.rpy
9121 core/select-faqts/towards/ckan-tag.rpy
9122 references-instance-hub.rpy
9223 datascape/size.rpy
9224 connected/void-linkset.py
9225 core/augment-dataset/lift-ckan.py
9226 core/augment-dataset/sameas-org.py
9227 access/void-datadump.py
9228 visko-planner.py
9229 w3c-mail-archives.py
9230 w3c-mail-archives-per-month.py
9231 w3c-mail-archives-message.py
9232 via-hypermail/groups.py
9233 by-ckan-installation.py
9234 with-rdf-extension.py
9235 services/sadi/faqt/naming/between-the-edges
9236 services/sadi/faqt/vocabulary/uses/prov
9237 services/sadi/faqt/vocabulary/uses/dcat
9238 services/sadi/faqt/vocabulary/uses/void
9239 services/sadi/faqt/vocabulary/uses/dcterms
9240 services/sadi/bibo/subject-broader.py
9241 lod-tag-and-lodcloud-group-contacts.py

The faqt-template.rpy includes a print out with a sample of how to call it:

if __name__ == '__main__':
   print resource.name + ' running on port ' + str(resource.dev_port) + '. Invoke it with:'
   print 'curl -H "Content-Type: text/turtle" -d @my.ttl http://localhost:' + str(resource.dev_port) + '/' + resource.name
   sadi.publishTwistedService(resource, port=resource.dev_port)

which is usually either of:

curl -H "Content-Type: text/turtle"         -d @my.ttl http://localhost:9090/add-metadata
curl -H "Content-Type: application/rdf+xml" -d @my.rdf http://localhost:9090/add-metadata

What's next

FAqT Service using Ripple
FAqT Service using Beautiful Soup
FAqT Service with Secondary Parameters
Sample FAqT deployment to see how we use twistd to deploy the FAqT evaluation service from a working copy of the github repository.
FAqT Bricks accumulate evaluations provided by FAqT Services.
DataFAQs Core Services
SADI Semantic Web Services framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly