You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DBpedia Spotlight can return multiple entity annotations for the same token. In issue #42, I described the general issues with this. In one scenario, the overlapping entities share the same starting position. This is problematic for CWB corpora.
See the following example:
library(polmineR)
library(dbpedia)
sc <- corpus("GERMAPARL2") |>
subset(speaker_name == "Heinrich von Brentano") |>
subset(protocol_date == "1960-06-22") |>
as.speeches(s_attribute_name = "speaker_name",
s_attribute_date = "protocol_date",
gap = 50) |>
_[[1]]
get_dbpedia_uris(
x = sc,
language = getOption("dbpedia.lang"),
max_len = 5600L,
confidence = 0.35,
support = 20,
api = getOption("dbpedia.endpoint"), # German endpoint
verbose = FALSE,
expand_to_token = TRUE
)
There are warnings stating that
Warning: longer object length is not a multiple of shorter object length
Likely Cause
This seems to be due to these lines in get_dbpedia_uris():
DBpedia Spotlight adds different URIs to overlapping spans of tokens which share the same starting position. Since the starting position is the same for both annotations, dt[.SD[["start"]] == dt[["start"]]] is true for more than one token in the subcorpus. This causes the warning.
Possible solution
If we do not want to encode overlapping entities (in CWB corpora at least), we have to decide which annotation to keep and which to omit. Some options are already discussed in issue #42.
To circumvent the specific issue here, the first step would be to check whether there are multiple annotations for a single token (span). For this, a check like
if (any(table(resources$start) > 1))
could be added before resources is reduced to resources_min in get_dbpedia_uris() for subcorpora.
Then, it would be possible to introduce an argument which describes what to do in these cases.
Discussion
I am not sure about argument names and defaults. In addition, since this happens very rarely (for GermaParl at least), instead of an additional argument for get_dbpedia_uris(), it could also be considered to have an option which set the default behavior in such cases. But I am not sure if that is good practice.
As discussed in issue #42, there might be other options.
The text was updated successfully, but these errors were encountered:
Merging the tables not based on "start" alone should prevent this error from happening. To do this, "end" was added to the by argument in the chunk quoted above.
This should result in all overlapping annotations being kept in get_dbpedia_uris() without confusing different annotations with the same starting position. Overlaps should be handled by a new set of functions after get_dbpedia_uris().
Issue
DBpedia Spotlight can return multiple entity annotations for the same token. In issue #42, I described the general issues with this. In one scenario, the overlapping entities share the same starting position. This is problematic for CWB corpora.
See the following example:
There are warnings stating that
Likely Cause
This seems to be due to these lines in
get_dbpedia_uris()
:dbpedia/R/dbpedia.R
Lines 610 to 620 in f4dc779
DBpedia Spotlight adds different URIs to overlapping spans of tokens which share the same starting position. Since the starting position is the same for both annotations,
dt[.SD[["start"]] == dt[["start"]]]
is true for more than one token in the subcorpus. This causes the warning.Possible solution
If we do not want to encode overlapping entities (in CWB corpora at least), we have to decide which annotation to keep and which to omit. Some options are already discussed in issue #42.
To circumvent the specific issue here, the first step would be to check whether there are multiple annotations for a single token (span). For this, a check like
could be added before
resources
is reduced toresources_min
inget_dbpedia_uris()
for subcorpora.Then, it would be possible to introduce an argument which describes what to do in these cases.
Discussion
I am not sure about argument names and defaults. In addition, since this happens very rarely (for GermaParl at least), instead of an additional argument for
get_dbpedia_uris()
, it could also be considered to have anoption
which set the default behavior in such cases. But I am not sure if that is good practice.As discussed in issue #42, there might be other options.
The text was updated successfully, but these errors were encountered: