Warnings caused by overlapping annotations when processing CWB corpora #43

ChristophLeonhardt · 2024-03-07T14:33:33Z

Issue

DBpedia Spotlight can return multiple entity annotations for the same token. In issue #42, I described the general issues with this. In one scenario, the overlapping entities share the same starting position. This is problematic for CWB corpora.

See the following example:

library(polmineR)
library(dbpedia)

sc <- corpus("GERMAPARL2") |>
  subset(speaker_name == "Heinrich von Brentano") |>
  subset(protocol_date == "1960-06-22") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date",
              gap = 50) |>
  _[[1]]

get_dbpedia_uris(
  x = sc,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20,
  api = getOption("dbpedia.endpoint"), # German endpoint
  verbose = FALSE,
  expand_to_token = TRUE
)

There are warnings stating that

Warning: longer object length is not a multiple of shorter object length

Likely Cause

This seems to be due to these lines in get_dbpedia_uris():

dbpedia/R/dbpedia.R

Lines 610 to 620 in f4dc779

    
           tab <- links[, 
        
                        list( 
        
                          cpos_left = dt[.SD[["start"]] == dt[["start"]]][["id"]], 
        
                          cpos_right = expand_fun(.SD), 
        
                          dbpedia_uri = .SD[["dbpedia_uri"]], 
        
                          text = .SD[["text"]], 
        
                          types = .SD[["types"]] 
        
                        ), 
        
                        by = "start", 
        
                        .SDcols = c("start", "end", "dbpedia_uri", "text", "types") 
        
           ]

DBpedia Spotlight adds different URIs to overlapping spans of tokens which share the same starting position. Since the starting position is the same for both annotations, dt[.SD[["start"]] == dt[["start"]]] is true for more than one token in the subcorpus. This causes the warning.

Possible solution

If we do not want to encode overlapping entities (in CWB corpora at least), we have to decide which annotation to keep and which to omit. Some options are already discussed in issue #42.

To circumvent the specific issue here, the first step would be to check whether there are multiple annotations for a single token (span). For this, a check like

if (any(table(resources$start) > 1))

could be added before resources is reduced to resources_min in get_dbpedia_uris() for subcorpora.

Then, it would be possible to introduce an argument which describes what to do in these cases.

Discussion

I am not sure about argument names and defaults. In addition, since this happens very rarely (for GermaParl at least), instead of an additional argument for get_dbpedia_uris(), it could also be considered to have an option which set the default behavior in such cases. But I am not sure if that is good practice.

As discussed in issue #42, there might be other options.

The text was updated successfully, but these errors were encountered:

ChristophLeonhardt · 2024-03-27T14:06:45Z

Merging the tables not based on "start" alone should prevent this error from happening. To do this, "end" was added to the by argument in the chunk quoted above.

This should result in all overlapping annotations being kept in get_dbpedia_uris() without confusing different annotations with the same starting position. Overlaps should be handled by a new set of functions after get_dbpedia_uris().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warnings caused by overlapping annotations when processing CWB corpora #43

Warnings caused by overlapping annotations when processing CWB corpora #43

ChristophLeonhardt commented Mar 7, 2024

ChristophLeonhardt commented Mar 27, 2024

Warnings caused by overlapping annotations when processing CWB corpora #43

Warnings caused by overlapping annotations when processing CWB corpora #43

Comments

ChristophLeonhardt commented Mar 7, 2024

Issue

Likely Cause

Possible solution

Discussion

ChristophLeonhardt commented Mar 27, 2024