Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling overlapping annotations by DBpedia Spotlight #42

Open
ChristophLeonhardt opened this issue Mar 7, 2024 · 1 comment
Open

Comments

@ChristophLeonhardt
Copy link
Collaborator

Issue

Occasionally, DBpedia Spotlight returns overlapping annotations.

Take the following example:

library(dbpedia)

doc <- "Der Deutsche Bundestag tagt in Berlin."

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint") # German endpoint
)

In phrases such as

"Der Deutsche Bundestag"

(found in GermaParl) both the entities "Der Deutsche Bundestag" and "Bundestag" are annotated. They might share the same URI but do not need to. Depending on the input format, this might cause different issues. For character vectors, this at least overestimates the number of unique entities (in the example above, there is only one instance of "Bundestag" but if we count the two URIs as two instances, this would not be correct in most cases). For CWB corpora, we currently do not have a way to encode these overlapping annotations.

In this issue, I'll demonstrate three variations of overlapping entity annotations. I think that the technical solution might be similar for all three scenarios. There are conceptual aspects to be discussed. The following considerations follow the assumption that we do not want to keep overlapping annotations but resolve these to a single annotation. Other solutions could be considered here.

Embedded Annotations

In the example above, "Der Deutsche Bundestag", one entity is completely embedded in the other. This could be resolved by controlling for overlapping entities and limiting the output to either the entity included in all annotations ("Bundestag"), the longest entity ("Der Deutsche Bundestag") or, using the scores provided by DBpedia Spotlight, the most "similar" (in terms of confidence) entity. Are there better options? This could be either controlled by an additional argument in get_dbpedia_uris() or maybe an option. I am not sure what constitutes good practice here.

Overlapping Entities

While in the example above, one entity is part of another, there are other examples in which the annotations merely overlap. I found an example for this in a speech by Angela Merkel (PlPr 16/46, page 4479; https://dserver.bundestag.de/btp/16/16046.pdf; abbreviated for this example):

"Die Mauer fiel

In this example, DBpedia Spotlight identifies two entities: "Die Mauer" and "Mauer fiel". They are both referring to the same URI. See the following chunk:

doc <- "Die Mauer fiel"

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint") # German endpoint
)

Similar to the issue above, if we would only count the number of URIs, the number of references to "Berliner Mauer" would be overestimated as it is counted twice although the term only occurs once.

Here, resolving these overlapping entities to one annotation seems to be more complicated than above: Which one is the more correct one? Combining both entities, the entity would be "Die Mauer fiel" which might be artificial. It would also be possible to reduce the entity to the tokens occurring in both overlapping spans (i.e. "Mauer"). Might this be more appropriate? This would be applicable to the embedded entities above, but does this always work as expected?

Interestingly, as_subcorpus() in combination with read() seems to work just fine (at least as long as the URI is the same for both parts of the overlap):

sc <- corpus("GERMAPARL2") |>
  subset(speaker_name == "Angela Merkel") |>
  subset(protocol_date == "2006-09-06") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date",
              gap = 50) |>
  _[[1]]


speech_annotation <- get_dbpedia_uris(
  x = sc,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20,
  api = getOption("dbpedia.endpoint"), # German endpoint
  verbose = FALSE,
  expand_to_token = TRUE
)

read(sc,
     annotation = as_subcorpus(speech_annotation))

Overlapping Entities with the same starting position

This is a specific case of the first variation of the issue: It is possible that an entity is embedded in another entity but they both share the same starting position. In the following example (taken from a speech by Heinrich von Brentano in the Bundestag; PlPr. 3/118 page 6801; https://dserver.bundestag.de/btp/03/03118.pdf), this becomes apparent:

doc <- "Ölbild Kaiser Wilhelms I."

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint")
)

In this case, "Kaiser Wilhelms I." and "Kaiser" are both annotated as entities. They also have different URIs assigned to them.

Since this also results in warnings when applied to CWB corpora, I will create a separate issue for this scenario.

Possible Solution

Assuming that overlapping entities might not be encoded, it becomes necessary to determine how to handle these overlaps. What I can imagine is an option or an argument that states whether the shortest (or the actual overlapping token?), the longest or the most similar entity should be kept. This terminology of "longest" and "shortest" is somewhat inspired by the CWB manual for CQP queries - it probably should be checked how this is handled in other tools as well.

In the examples above, this would mean something like

Entity Shortest / Overlapping Entity Longest Entity Most Similar Entity
[Der Deutsche [Bundestag]] Bundestag Der Deutsche Bundestag Der Deutsche Bundestag
[Die [Mauer] fiel] Mauer Die Mauer fiel Mauer fiel
[[Kaiser] Wilhelms I.] Kaiser Kaiser Wilhelms I. Kaiser Wilhelms I.

Notes:

  • in the "Entity" column, the separate token spans are represented with pairs of squared brackets, i.e. "Der Deutsche Bundestag" is a span and "Bundestag" is a span, "Die Mauer" is a span and "Mauer fiel" is a span, etc.
  • the entity in the column "Most Similar" is based on the "similarityScore" column in the resources data.table retrieved from DBpedia Spotlight. These values can be very close.

Discussion

The question is how this behavior should be handled.

  • when is this behavior problematic?
  • should it be addressed as an argument or option for get_dbpedia_uris()?
  • Alternatively, the return value might contain both overlapping annotations which have to be filtered later on somehow
  • how would arguments and defaults look like?
@ChristophLeonhardt
Copy link
Collaborator Author

To provide an update: Our current line of reasoning is that get_dbpedia_uris() should return all found entity annotations. Overlaps should be filtered afterwards. This is drafted in detect_overlap() and categorize_overlap().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant