Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to retrieve a text only wikidata definition from entity ? #146

Open
Lucaterre opened this issue Jul 7, 2022 · 7 comments
Open
Assignees
Milestone

Comments

@Lucaterre
Copy link

Lucaterre commented Jul 7, 2022

Hello @kermitt2 ,

I leave this feature/proposal here. Sorry, in advance if I used a wrong terminology (preprocess/clean etc.).

  • Context

Currently, the query endpoint kb/concept/ returns the concept definition with a "Wikimedia" style markup.

output example for concept "Victor Hugo" :

'''''' (; 26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the [[Romanticism|Romantic movement]]. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels '''', 1862, and ''[[The Hunchback of Notre-Dame]]'', 1831. In France, Hugo is known primarily for his poetry collections, such as '''' (''The Contemplations'') and '''' (''The Legend of the Ages'').
  • Expect behavior

A definition without specific markup, for example (Cf. https://en.wikipedia.org/wiki/Victor_Hugo) :

Victor-Marie Hugo (26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the Romantic movement. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels Les Misérables, 1862, and The Hunchback of Notre-Dame, 1831. In France, Hugo is known primarily for his poetry collections, such as The Contemplations and The Legend of the Ages.
  • Suggestion

I don't know if this is complicated to implement, but it could be considered in two different ways:

  1. the user has the choice to retrieve a "clean" definition by adding an optional parameter, for example, something like: "raw":"true" or "clean":"true" for the kb/concept endpoint

  2. In the answer add a "definition_raw" key (with wikimedia markup) and a "definition_clean" key (without markup)

I think it could be useful for people who need to work on additional features, here the definition, from the entities, without going through the addition of a textual preprocessing function.

What do you think about that ?

Regards,
Lucas Terriel

@kermitt2
Copy link
Owner

kermitt2 commented Jul 7, 2022

Hello @Lucaterre

Thanks for the issue.

Yes we can do this, so have plain text or the mediawiki format for the definition field which is set by a query parameter. The plain text method already exist:

https://github.com/kermitt2/entity-fishing/blob/master/src/main/java/com/scienceminer/nerd/utilities/mediaWiki/MediaWikiParser.java#L117

@Lucaterre
Copy link
Author

Lucaterre commented Jul 7, 2022

Thank you for your answer ! Oh ok nice for a ready method :)

This is the idea indeed, to clarify my issue a little more (but I think that's what you said).

We consider an optional parameter query "plain_text" (maybe it's not the best param name here) set to "false" by default and which returns the definition in mediawiki format in the response.

Now if we imagine a request, such as:

$ curl 'https://cloud.science-miner.com/nerd/service/kb/concept/Q90?lang=fr?plain_text=true'

the response return a plain text definition instead of the definition in mediawiki format.

I don't know if there is any interest in keep both definitions (plain text and mediawiki) in the same response, it depends on the use case? (that's an open question)

@kermitt2
Copy link
Owner

kermitt2 commented Jul 8, 2022

what about something like this:

$ curl 'https://cloud.science-miner.com/nerd/service/kb/concept/Q90?lang=fr&definition=mediawiki'

the definition parameter name is more precise for the expected behavior, as well as a non boolean value (which could be mediawiki (default), plain_text or maybe another one in the future). Maybe definition_format rather than definition ?

@Lucaterre
Copy link
Author

Lucaterre commented Jul 8, 2022

I am agree, it seems definition_format is fine and more explicit as a parameter name than definition alone (which is confusing: the user may think that retrieving the definition is optional with this last name parameter).

Ok, with mediawiki as the default option of the parameter (this seems normal, this is the original format for the definition).

Just curious, what other "cross-mediawiki" formats do you think of in the future? HTML, Markdown for example?

@kermitt2
Copy link
Owner

kermitt2 commented Jul 8, 2022

Just curious, what other "cross-mediawiki" formats do you think of in the future? HTML, Markdown for example?

yes I was thinking of these two possible formats.

@kermitt2 kermitt2 added this to the 0.0.6 milestone Jul 8, 2022
@kermitt2 kermitt2 self-assigned this Jul 19, 2022
@kermitt2 kermitt2 changed the title Add an option to retrieve a clean/preprocess wikidata definition from entity ? Add an option to retrieve a text only wikidata definition from entity ? Jan 7, 2023
@kermitt2
Copy link
Owner

kermitt2 commented Jan 7, 2023

This is implemented with 2557847

  • default format for definitions is the original MediaWiki format (as before)
  • optional supported format is "plain text"

REST API parameter is definitionFormat with value Mediawiki (default) or PlainText (as requested in this issue). I am using Java notation for the parameter, because we are in the Java world in this project.

Example:

curl -X GET http://localhost:8090/service/kb/concept/Q190712?definitionFormat=PlainText
{ "rawName" : "First Battle of the Marne", "preferredTerm" : "First Battle of the Marne", "confidence_score":0, "wikipediaExternalRef":171325, "wikidataId" : "Q190712", "definitions" : [ { "definition" : "The First Battle of the Marne was a battle of the First World War fought from 5 to 12 September 1914. It was fought in a collection of skirmishes around the Marne River Valley. It resulted in an Entente victory against the German armies in the west. The battle was the culmination of the Retreat from Mons and pursuit of the Franco-British armies which followed the Battle of the Frontiers in August and reached the eastern outskirts of Paris.", "source" : "wikipedia-en", "lang" : "en" } ] ... }

https://nerd.readthedocs.io/en/latest/restAPI.html#get-kb-concept-id

@kermitt2
Copy link
Owner

kermitt2 commented Jan 9, 2023

Also added html as format:

curl -X GET http://localhost:8090/service/kb/concept/Q190712?definitionFormat=html
{ "rawName" : "First Battle of the Marne", "preferredTerm" : "First Battle of the Marne", "confidence_score":0, "wikipediaExternalRef":171325, "wikidataId" : "Q190712", "definitions" : [ { "definition" : "<p>The <b>First Battle of the Marne</b> was a battle of the <a href=\"https://en.wikipedia.org/wiki/First_World_War\" title=\"First World War\">First World War</a> fought from 5 to 12 September 1914. It was fought in a collection of skirmishes around the Marne River Valley. It resulted in an <a href=\"https://en.wikipedia.org/wiki/Allies_of_World_War_I\" title=\"Allies of World War I\">Entente</a> victory against the <a href=\"https://en.wikipedia.org/wiki/German_Army_(German_Empire)\" title=\"German Army (German Empire)\">German</a> armies in the west. The battle was the culmination of the <a href=\"https://en.wikipedia.org/wiki/Retreat_from_Mons\" title=\"Retreat from Mons\">Retreat from Mons</a> and pursuit of the Franco-British armies which followed the <a href=\"https://en.wikipedia.org/wiki/Battle_of_the_Frontiers\" title=\"Battle of the Frontiers\">Battle of the Frontiers</a> in August and reached the eastern outskirts of Paris.<p>", "source" : "wikipedia-en", "lang" : "en" } ]  ... }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants