NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora #44

ChristophLeonhardt · 2024-03-07T19:27:30Z

Issue

As discussed in issue #26, DBpedia Spotlight occasionally annotates entities which do not perfectly align with token spans. The fix discussed in issue #26 is incomplete, however.

Working with GermaParl, it became apparent that there are scenarios in which tokenization can be tricky for the left entity boundary as well. In phrases like "G-8-Gipfel" (which, in GermaParl is often tokenized in two tokens, "G" and "-8-Gipfel"), the entity identified by DBpedia Spotlight is "Gipfel" which starts in the middle of the token. This is an issue when we join tokens and entities based on their starting positions as the offset is different, thus leading to a "NA" value in the left corpus position of the entity.

Potential Solutions

If we want to address this, we could use the same approach as suggested for issue #26: Expand the span to the previous token boundary. For this, we could compare the starting positions of the entity and tokens and chose the previous token using an extended version of the expand_fun() auxiliary function introduced earlier:

expand_fun = function(.SD, direction) {
  if (direction == "right") {
    cpos_right <- dt[.SD[["end"]] == dt[["end"]]][["id"]]
    if (length(cpos_right) == 0 & isTRUE(expand_to_token)) {
      cpos_right <- dt[["id"]][which(dt[["end"]] > .SD[["end"]])[1]]
    } else {
      cpos_right
    }
  } else {
    cpos_left <- dt[.SD[["start"]] == dt[["start"]]][["id"]]
    if (length(cpos_left) == 0 & isTRUE(expand_to_token)) {
      cpos_vec <- which(dt[["start"]] < .SD[["start"]])
      cpos_left <- dt[["id"]][cpos_vec[length(cpos_vec)]]
    } else {
      cpos_left
    }
  }
}

This would make it necessary to adjust the following chunk as well:

tab <- links[,
             list(
               cpos_left = expand_fun(.SD, direction = "left"),
               cpos_right = expand_fun(.SD, direction = "right"),
               dbpedia_uri = .SD[["dbpedia_uri"]],
               text = .SD[["text"]],
               types = .SD[["types"]]
             ),
             by = "start",
             .SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]

The possibility that there are incomplete annotations for "cpos_left" should be considered here as well:

dbpedia/R/dbpedia.R

Line 693 in 4a8fd3c

if (isTRUE(drop_inexact_annotations) & any(is.na(tab[["cpos_right"]]))) {

Discussion

As with issue #26, this should be optional and comes with some conceptual considerations, in particular whether it always makes sense to expand the entity span to match the token span.

This might also be not very efficient as this is checked for each entity.

The text was updated successfully, but these errors were encountered:

ChristophLeonhardt · 2024-03-27T14:01:27Z

This has been implemented as discussed above in a development branch.

The issues in the "discussion" still should be considered.

ChristophLeonhardt added a commit that referenced this issue Mar 27, 2024

entity spans expand to start of token (#44)

2d88fc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora #44

NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora #44

ChristophLeonhardt commented Mar 7, 2024

ChristophLeonhardt commented Mar 27, 2024

NAs in cpos_left in output for get_dbpedia_uris() for subcorpora #44

NAs in cpos_left in output for get_dbpedia_uris() for subcorpora #44

Comments

ChristophLeonhardt commented Mar 7, 2024

Issue

Potential Solutions

Discussion

ChristophLeonhardt commented Mar 27, 2024

NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora #44

NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora #44