Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularity #118

Open
timothee-haudebourg opened this issue Feb 16, 2023 · 0 comments
Open

Modularity #118

timothee-haudebourg opened this issue Feb 16, 2023 · 0 comments
Labels
enhancement New feature or request scope:semantics Related to the semantics of TreeLDR

Comments

@timothee-haudebourg
Copy link
Collaborator

This issue covers the quite large topic of how to make TreeLDR modular.

What is modularity in this context

The input of TreeLDR is a set of RDF datasets that are processed into an inner data model, then given to various (code) generators. For now, the generators are all embedded into the compiler. This is a bad design that I expect will cause many problems in the long run:

  • As the number of generator grows, the source code will become hard to maintain. The number of dependencies will grow as the time needed to refactor parts of the core compiler.
  • As the size of the input dataset grows, the time spend compiling will grow. This may become a problem with datasets such as schema.org that are too large to be processed every time a generator is called.

Modularity means:

  • Spliting the core compiler and generators into independent programs.
  • Pre-processing datasets into a reusable data format.

Pre-processing

The primary task of the TreeLDR compiler is to take the input dataset triples, infer new triples according to the semantics of RDF/RDFS/OWL/TreeLDR and store the resulting triples into a final data structure for easy access by the generators. The idea here would be to create an intermediate file format to store the triples, including inferred triples, for later accesses without having to call the compiler again. This is very similar to the way traditional compilers will create an intermediate object file *.o for each compiled file before merging them into the final executable.

The intermediate file describes a Model Theoretic Interpretation of the processed dataset.

Composition Problem

The main challenge is to make sure the resulting interpretations are composable with each other. For instance consider the following schema:

@prefix : <http://example.org/> .
:foo :prop _:0 .
:bar :prop _:1 .
:baz :prop _:1 .

A valid interpretation of this graph can merge the blank nodes like so:

I(_:0) = I(_:1) = blank
I(:foo) = foo
I(:bar) = bar
I(:baz) = baz
I(:prop) = prop
EXT(prop) = { <foo, blank>, <bar, blank>, <baz, blank> }

Merging structurally equivalent blank nodes is something TreeLDR does all the time to since most of the time they refer to the same resource and it can greatly reduce the complexity of the final model. However now consider the following graph:

@prefix : <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
:prop rdf:type owl:FunctionalProperty .
:foo :prop _:2 .
:bar :prop _:3 .
_:2 owl:differentFrom _:3 .

Imagine we want to update our previous interpretation to include this now knowledge. Because :prop is declared as a owl:FunctionalProperty then according to the OWL semantics I(_:0) = I(_:2) and I(_:1) = I(_:3). But because of the _:2 owl:differentFrom _:3 statement we also have I(_:2) != I(_:3) which also means I(_:0) != I(_:1). But we already decided in our previous interpretation that I(_:0) = I(_:1) = blank. We cannot go back. All information is lost and we wouldn't know which of I(_:0) or I(_:1) is such that <baz, I(_:?)> in EXT(prop).

This shows that we cannot update or compose interpretations this way.

Solution to the Composition Problem

So we cannot decide on a single interpretation from just a subset of the processed datasets. One solution is to build two interpretations:

  • Maximal Interpretation: this is a conservative interpretation where two names (IRIs, Blank node ids, literals) are never interpreted the same unless explicitly stated in the graph (with owl:sameAs for instance).
  • Optimal Interpretation: this is the non conservative interpretation where two names are merged at liberty unless explicitly stated otherwise (with owl:differentFrom for instance).

We can easily compose interpretation pairs:

  • Since the maximal interpretation is conservative if two resources are interpreted the same it means they must be interpreted the same. If they are interpreted the same in one maximal interpretation and not the other, then the resources must be merged.
  • Since the optimal interpretation is so liberal if two resource are not interpreted the same, it means they must not be interpreted the same. If they are interpreted differently in one optimal interpretation and not the other, then the interpretation must be refined. Fortunately, merged resources can be separated by looking at the maximal interpretation.
@timothee-haudebourg timothee-haudebourg added enhancement New feature or request scope:semantics Related to the semantics of TreeLDR labels Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request scope:semantics Related to the semantics of TreeLDR
Projects
None yet
Development

No branches or pull requests

1 participant