Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support probabilistic knowledge representations #78

Open
SemanticBeeng opened this issue Dec 15, 2020 · 10 comments
Open

Support probabilistic knowledge representations #78

SemanticBeeng opened this issue Dec 15, 2020 · 10 comments

Comments

@SemanticBeeng
Copy link

SemanticBeeng commented Dec 15, 2020

Knowledge representations may have a probabilistic nature and capturing that is especially important for complex business domains where data is ... non-stationary or contextual.

This is often seen in machine learning models like #WordEmbedding, #ContextualWordEmbedding, #KnowledgeGraphEmbedding, etc.

Often semantic models are distilled by experts which use implicit expertise and extensive curation processes; this makes the final models lossy, not easily traceable to the original source data and ultimately less useful to non experts than they could be.

It would be very useful to have the ability to represent probabilistic knowledge in the semantic web world so that the models are more robust, defensible, trusted and accessible.

The intuition behind probabilisitc RDF seems somewhat related to language models like #ContextualWordEmbedding which are probabilistic (have #DistributionalSemantics as opposed to the lexical models) but also capture context thus making the concepts more grounded and easily traceable to the original resources they were extracted from. 🤔

Resources

  1. Scalable Uncertainty Treatment Using Triplestores and the OWL 2 RL Profile
    PR-OWL, Multi-Entity #Bayesian Networks (#MEBN)", "hybrid ontologies .. deterministic and probabilistic parts"

  2. "Probabilistic RDF"
    "build a logical model of RDF with uncertainty"

  3. "Combining RDF Graph Data and Embedding Models for an Augmented Knowledge Graph"
    "integrated #RDF data with vector space models", #knowledgeGraph, #wordEmbedding, #graphEmbedding

  4. "FoodEx2vec: New foods' representation for advanced food data analysis"
    See also the FoodOn ontology.

@draggett
Copy link
Member

draggett commented Dec 15, 2020

  1. Machine learning is dependent on statistics, as are many kinds of reasoning that deal with the uncertainties, incompleteness and inconsistencies commonplace in everyday situations.
  2. Human reasoning isn't based upon logic or the laws of probability, and instead makes use of mental models of examples, metaphors and analogies, see e.g. the work of Philip Johnson-Laird.

The current mindset for the Semantic Web is oblivious to these points, instead narrowly focusing on deductive logic and model theory. There is however a great deal to be gained by studying over 500 million years of neural evolution and decades of work across the cognitive sciences. Cognitive AI seeks to mimic human memory, reasoning, learning and human natural language processing at a functional level. This involves a combination of symbolic graphs, sub-symbolic statistics, rules and graph algorithms, along with a willingness to adopt an interdisciplinary approach to research, something that is unfortunately generally discouraged when it comes to incentives for academic careers.

The W3C Cognitive AI Community Group is formalising the Chunks graph data and rules language inspired by earlier work by John Anderson on ACT-R, as a popular cognitive architecture. Chunks is easier to work with than Turtle, JSON and JSON-LD. It includes the means to map to RDF URIs where needed. As such Chunks is a viable candidate for Easier RDF, and one that opens up new vistas of opportunities to give computing a human touch.

@akuckartz
Copy link

See also #71

@SemanticBeeng
Copy link
Author

SemanticBeeng commented Dec 25, 2020

The current mindset for the Semantic Web is oblivious to these points ...

Indeed, hence my proposal to somehow combine the two areas so that humans and machines can synergize better with most humans in control; undesirable but likely alternatives being with machines and/or "elites" being in control (plutocracy, etc).

Also, humans are basically statistically blind which makes it very hard to make good decision in larger groups.

Would be grateful for more comments on the specific topics and resources I mentioned.

@iherman
Copy link
Member

iherman commented Dec 25, 2020

Just a historical point: several years ago the W3C did set up a “W3C Uncertainty Reasoning for the World Wide Web Incubator Group” which did publish a report. The report as well as the charter referred to above contain a large number of references and use cases, but there was no real follow-up on the report in terms of a W3C WG, i.e., for standardization. As far as I can remember there wasn't a clear, standardization-ready approach to go with, and the interest from W3C members was mild, to say the least.

I have not followed the evolution since 2008 (I have drifted away from the subject area), and I do not know whether the area is more mature than it was back then (afaik, the topic was picked up at some subsequent ISWC conferences as workshops).

@SemanticBeeng
Copy link
Author

SemanticBeeng commented Dec 25, 2020

Report 31 March 2008

Thanks. Interesting, will review.
It is very old, though, probabilistic knowledge representations in ML have evolved dramatically since.
Also, this is not only about reasoning/inference but also the actual knowledge representations.
In ML world there are a lot of knowledge graph embedding frameworks but no standard about the knowledge graph representation.

I am not really expecting that the semantic web community will cross the chasm by itself.
And that building the bridges will require a lot of knowledge from other areas.
Just proposing that we need to cross it and seeking interest.

@iherman
Copy link
Member

iherman commented Dec 25, 2020

Good to know that things have evolved. The main reason I gave the reference is my experience (as a W3C Staff member) of the mild interest of the semantic web community back then to engage into a more systematic standardization work and I frankly do not know whether it would be easier to do it now. I sincerely hope...

Cc @pchampin

@chiarcos
Copy link

chiarcos commented Jan 5, 2021

In ML world there are a lot of knowledge graph embedding frameworks but no standard about the knowledge graph representation.
...
I am not really expecting that the semantic web community will cross the chasm by itself.
And that building the bridges will require a lot of knowledge from other areas.
Just proposing that we need to cross it and seeking interest.

Within the OntoLex W3C CG, we're currently developing a novel module on Frequency, Attestation and Corpus-based Information (OntoLex-FrAC). Among other things, corpus-based information includes embeddings (in the "word embedding" sense as well as other numerical representations), with the specific goal to be applicable to any OntoLex concept (which includes ontologies in general). Representing word embeddings in RDF doesn't give much of a benefit, but for sense and concept embeddings that get easily detached from their definition, that's quite different. The current status will be presented at SemDeep on Jan 8th, 2021.

@SemanticBeeng
Copy link
Author

SemanticBeeng commented Jan 5, 2021

Glad to hear about OntoLex.

concept embeddings that get easily detached from their definition

Maybe in this regard is worth mentioning entity typing, knowledge graph embedding, and others that combine NLProc with external knowledge from knowledge graphs (which can be thought as preceding ontologies and as stores of "definitions"). Hope I have not missed your meaning.

Recently I also found "Embedding OWL Ontologies with OWL2Vec".
My highlights here https://twitter.com/semanticbeeng/status/1345712396229337096.
Very rich review from @cmungall here https://twitter.com/chrismungall/status/1313296287861600256.

Related: Onto2Vec, OPA2Vec, RDF2Vec, Node2Vec

@chiarcos
Copy link

chiarcos commented Mar 4, 2021

concept embeddings that get easily detached from their definition

Maybe in this regard is worth mentioning entity typing, knowledge graph embedding, and others that combine NLProc with external knowledge from knowledge graphs (which can be thought as preceding ontologies and as stores of "definitions"). Hope I have not missed your meaning.

Related: Onto2Vec, OPA2Vec, RDF2Vec, Node2Vec

That's exactly what we had in mind. The difference is that these approaches are typically oriented at algorithms, i.e., at creating embeddings from knowledge graphs or at extending knowledge graphs by embedding-based techniques. The OntoLex extension is purely representational. It is about storing (and re-using) such information together with a knowledge graph. With a standard vocabulary for this purpose, it will become possible to provide APIs to store and load such data bundles more efficiently and in a way that makes sure the user has eventually access to both the embeddings and the underlying graph.

For certain kinds of knowledge graphs, having both information sources may be less essential, because the embeddings itself are a stochastical approximation and generalization over the knowledge graph and can be readily applied in different settings.

For lexical information, however, this is very different, because there is a ground truth (in the dictionaries/wordnets) from which we want to deviate from only if the explicit information we have is insufficient (e.g., for out-of-vocabulary words). And provenance is key, here, because dictionaries differ widely in scale, quality, methodology, and purpose, and if aggregating over different dictionaries (which is a good idea in general, to improve coverage), we should be able to keep track of that information.

As an example, the MUSE dictionaries by Facebook (https://github.com/facebookresearch/MUSE) have certain valid uses, but they contain a lot of noise, as they're automatically created from translation memories (I guess). For creating multilingual embeddings, they're probably sufficient, but not for applications in MT or localization. The Apertium dictionaries (https://apertium.org/), on the other hand, are much more carefully curated and specifically designed for MT. They are general-purpose, though, and lack support for specific domains. For some languages, they are small, and also, they emphasize lexical concepts whereas function words may be left out for certain languages. The existing multilingual WordNets, again, are great resources than can be used to complement dictionaries with semantic concepts, but they lack coverage of certain grammatical categories, and they have some imbalances in the taxonomy they posit. Now, we can just combine all that data in a single lexical knowledge graph, hoping that that compensates the respective biases, and then induce or create embeddings over it. But if we run into any weird behaviour in downstream applications, we should be able to track if this is related to the specific source of information involved, at least. (So we can disable, replace or fix it.)

Of course, embeddings can be calculated on the fly, as well. But if it comes to multilingual applications, lexical resources can get quite substantial, e.g., https://github.com/acoli-repo/acoli-dicts. (Sorry for not providing a triple count, the basic unit we operate with are the existence of > 10.000 translations per language pairs.) So, inducing embeddings over this graph for every individual application is both a massive waste of energy and a compabitility hazard if different applications are supposed to operate on the same embeddings (inducing embeddings involves non-deterministic aspects).

@SemanticBeeng
Copy link
Author

For certain kinds of knowledge graphs, having both information sources may be less essential, because the embeddings itself are a stochastical approximation and generalization over the knowledge graph and can be readily applied in different settings.

Indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants