Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF Schema contains erroneous rdfs:domain triples #276

Open
stmbaier opened this issue Jan 4, 2024 · 4 comments
Open

RDF Schema contains erroneous rdfs:domain triples #276

stmbaier opened this issue Jan 4, 2024 · 4 comments

Comments

@stmbaier
Copy link

stmbaier commented Jan 4, 2024

Description

The turtle file with RDF(S) of the CWL SALAD Schema retrievable under http://commonwl.org/v1.2/cwl.ttl (or alternatively by running schema-salad-tool --print-rdfs with the full schema CommonWorkflowLanguage.yml in the cwl-v1.2 repository) contains the following statement:

@base:basename a rdf:Property ;
    rdfs:domain @base:Directory,
        @base:File .

Which states (by inference, see RDF 1.1 Semantics entailment pattern 'rdfs2') that every subject IRI of a triple with the predicate https://w3id.org/cwl/cwl#basename is an instance of the class @base:Directory and an instance of the class @base:File. This is clearly not the case as not every File is a Directory (and they even have different fields in CWL too).

Analysis

Perhaps this error is due to the fact that the semantics of the object-oriented model and RDFS mis-match:

  • In RDFS it is not possible to narrow down the domain and range of a property in a sub-class or different classes (properties are always global)
  • In Apache Avro its instead totally fine to use the same name 'basename' for a field in different records with different types 'File' and 'Directory' as the properties are not globally visible and valid.

Solution ideas

  • Either the same field name is in different records with different types mapped to different IRIs (as the same field name indicates some conformity in use, the IRIs could still build a property-hierarchy, stating in the super-property the overlap in their usage, and in the sub-properties their difference (but this cannot be formulated with the current standard))
  • or the salad-schema-tool does infer only the most specific type of the domain/range of a property, which is valid globally in all contexts in which the property is used (not like now every most specific type; by following the class hierarchy upwards until a common class for all domain/range usages of the property is found).
@tetron
Copy link
Member

tetron commented Jan 4, 2024

Either the same field name is in different records with different types mapped to different IRIs (as the same field name indicates some conformity in use, the IRIs could still build a property-hierarchy, stating in the super-property the overlap in their usage, and in the sub-properties their difference (but this cannot be formulated with the current standard))

That's explicit in schema salad, if two classes use the same field name, it is intended that the field name is semantically the same with the same predicate.

However, when I designed the RDFS export, I don't think I was aware of the entailment rules. I think I interpreted domain to mean "this property may be found on this type" but not "the presence of this property implies it must be all of these types".

I believe the correct way to fix this in the CWL schema specifically would be to find the situations where different classes use the same field, and extract that field to a parent type. So the presence of "basename" would entail that the object was "the class of objects that have basename (of which File and Directory are subtypes)" and not "must be both a File and a Directory".

For example the "Documented" type that implies a field called doc. Although that's defined as rdfs:comment so it is possible that also has incorrect entailment?

Do you have a particular project where you are using RDFS entailment rules with CWL? While linked data underlies the CWL definition, applications that actually map a CWL document to triples have been fairly niche, so I'd love to hear more.

@stmbaier
Copy link
Author

stmbaier commented Jan 9, 2024

That's explicit in schema salad, if two classes use the same field name, it is intended that the field name is semantically the same with the same predicate.

Totally agree with that, that seems like the most straight-forward interpretation.

I believe the correct way to fix this in the CWL schema specifically would be to find the situations where different classes use the same field, and extract that field to a parent type. So the presence of "basename" would entail that the object was "the class of objects that have basename (of which File and Directory are subtypes)" and not "must be both a File and a Directory".

You mean something like this, where @base:FileSystemRessource is the parent type:

@base:basename a rdf:Property ;
    rdfs:domain @base:FileSystemRessource.

In this case the parent type has to be defined in the SALAD-Schema itself, too, by defining File and Directory as extension of FileSystemRessource:

- name: File
   type: record
   extends: [FileSystemRessource]
...
- name: Directory
   type: record
   extends: [FileSystemRessource]

Which leads to subclass relationships in RDF:

@base:File rdfs:subClassOf @base:FileSystemRessource.
@base:Directory rdfs:subClassOf @base:FileSystemRessource.

In this case the RDFS-only modeling seems fine, as the entailment that every file and every directory is a filesystem ressource is correct (rules 'rdfs2' + 'rdfs9' in RDF 1.1 Semantics).

However, I would like to point out the following: This approach doesn't work in every case, e.g.:
If one models the record for 'Man' and one for 'Women', with the field 'parentOf', one could define the domain of the 'parentOf' predicate as class instances of 'Parent'. And now the approach breaks down if 'Man' and 'Women' are further defined as extension of type 'Parent', because this would entail that every man and every women is additionally a parent, which is certainly not true.
If instead the domain of 'parentOf' is defined as instances of class 'Person', the approach is valid, as every man and every women is a person.

If one wants to avoid to build a class hierarchy altogether (like in the 'parentOf' case with a 'Parent' class), one could use OWL 2 instead with the Union of Class expressions (could be suitable for Apache Avro Unions used in fields, which still need to be translated to a correct rdfs:range triple):

@base:basename a rdf:Property ;
    rdfs:domain [ owl:unionOf (@base:Directory @base:File)].

But OWL 2 introduces a much more complex set of possibilities to model semantics.

For example the "Documented" type that implies a field called doc. Although that's defined as rdfs:comment so it is possible that also has incorrect entailment?

In RDF, this is currently stated as:

rdfs:comment a rdf:Property ;
    rdfs:domain sld:Documented .

Which entails that every entity, which has a rdfs:commentproperty attached, is an instance of sld:Documented. From my point of view the correctness of this entailment depends on the semantics of the sld:Documented class, which is not further defined. So if the intention was to gather all schema parts, which are documented, its wrong, because every entity independent of its type, is instance of this class if it has a property rdfs:comment. If the intention was to simply state all entities, which have some form of humand-readable documentation, its fine.

Do you have a particular project where you are using RDFS entailment rules with CWL? While linked data underlies the CWL definition, applications that actually map a CWL document to triples have been fairly niche, so I'd love to hear more.

I am working on my master thesis "Automated transformation processing for dynamic RDF information integration and XML document generation" and would like to use CWL for the general transformation processing. As the leading data management takes place in RDF, all Workflows and CommandLineTools have to be transformed into RDF to be persisted (which is why I noticed this mistake). Finally, the idea is to be able to generate workflows (partially) automatically using entailment rules and additional annotations on the CWL constructs (which are only visible within the RDF database to generate jobs, and are removed before the job is actually submitted to a CWL runner).

@stmbaier
Copy link
Author

Maybe even a completely different approach works here better: Instead of fiddling around with the open world assumption and globality of the RDFS and OWL approaches, one could instead use SHACL to define shapes for every record type.
Here the property values can be restricted locally for all instances of a shape only, which seems to nicely fit the Apache Avro data modeling logic.

At the moment I am trying to do this by deriving SHACL shapes from the RDF triples of the Common Workflow Language using SHACL rules.

@tetron
Copy link
Member

tetron commented Jan 17, 2024

I am aware of SHACL but haven't actually studied it, so I don't know exactly how it compares to RDFS. My impression is that SHACL is a much better fit than RDFS for modeling nested data structures such as are found in CWL, so a Schema salad-to-SHACL translation is probably a productive line of research. In that case, the problematic RDFS definitions could probably be dropped in favor of SHACL definitions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants