Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings caused by overlapping annotations when processing CWB corpora #43

Open
ChristophLeonhardt opened this issue Mar 7, 2024 · 1 comment

Comments

@ChristophLeonhardt
Copy link
Collaborator

Issue

DBpedia Spotlight can return multiple entity annotations for the same token. In issue #42, I described the general issues with this. In one scenario, the overlapping entities share the same starting position. This is problematic for CWB corpora.

See the following example:

library(polmineR)
library(dbpedia)

sc <- corpus("GERMAPARL2") |>
  subset(speaker_name == "Heinrich von Brentano") |>
  subset(protocol_date == "1960-06-22") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date",
              gap = 50) |>
  _[[1]]

get_dbpedia_uris(
  x = sc,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20,
  api = getOption("dbpedia.endpoint"), # German endpoint
  verbose = FALSE,
  expand_to_token = TRUE
)

There are warnings stating that

Warning: longer object length is not a multiple of shorter object length

Likely Cause

This seems to be due to these lines in get_dbpedia_uris():

dbpedia/R/dbpedia.R

Lines 610 to 620 in f4dc779

tab <- links[,
list(
cpos_left = dt[.SD[["start"]] == dt[["start"]]][["id"]],
cpos_right = expand_fun(.SD),
dbpedia_uri = .SD[["dbpedia_uri"]],
text = .SD[["text"]],
types = .SD[["types"]]
),
by = "start",
.SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]

DBpedia Spotlight adds different URIs to overlapping spans of tokens which share the same starting position. Since the starting position is the same for both annotations, dt[.SD[["start"]] == dt[["start"]]] is true for more than one token in the subcorpus. This causes the warning.

Possible solution

If we do not want to encode overlapping entities (in CWB corpora at least), we have to decide which annotation to keep and which to omit. Some options are already discussed in issue #42.

To circumvent the specific issue here, the first step would be to check whether there are multiple annotations for a single token (span). For this, a check like

if (any(table(resources$start) > 1)) 

could be added before resources is reduced to resources_min in get_dbpedia_uris() for subcorpora.

Then, it would be possible to introduce an argument which describes what to do in these cases.

Discussion

I am not sure about argument names and defaults. In addition, since this happens very rarely (for GermaParl at least), instead of an additional argument for get_dbpedia_uris(), it could also be considered to have an option which set the default behavior in such cases. But I am not sure if that is good practice.

As discussed in issue #42, there might be other options.

@ChristophLeonhardt
Copy link
Collaborator Author

Merging the tables not based on "start" alone should prevent this error from happening. To do this, "end" was added to the by argument in the chunk quoted above.

This should result in all overlapping annotations being kept in get_dbpedia_uris() without confusing different annotations with the same starting position. Overlaps should be handled by a new set of functions after get_dbpedia_uris().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant