Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAs in cpos_left in output for get_dbpedia_uris() for subcorpora #44

Open
ChristophLeonhardt opened this issue Mar 7, 2024 · 1 comment

Comments

@ChristophLeonhardt
Copy link
Collaborator

Issue

As discussed in issue #26, DBpedia Spotlight occasionally annotates entities which do not perfectly align with token spans. The fix discussed in issue #26 is incomplete, however.

Working with GermaParl, it became apparent that there are scenarios in which tokenization can be tricky for the left entity boundary as well. In phrases like "G-8-Gipfel" (which, in GermaParl is often tokenized in two tokens, "G" and "-8-Gipfel"), the entity identified by DBpedia Spotlight is "Gipfel" which starts in the middle of the token. This is an issue when we join tokens and entities based on their starting positions as the offset is different, thus leading to a "NA" value in the left corpus position of the entity.

Potential Solutions

If we want to address this, we could use the same approach as suggested for issue #26: Expand the span to the previous token boundary. For this, we could compare the starting positions of the entity and tokens and chose the previous token using an extended version of the expand_fun() auxiliary function introduced earlier:

expand_fun = function(.SD, direction) {
  if (direction == "right") {
    cpos_right <- dt[.SD[["end"]] == dt[["end"]]][["id"]]
    if (length(cpos_right) == 0 & isTRUE(expand_to_token)) {
      cpos_right <- dt[["id"]][which(dt[["end"]] > .SD[["end"]])[1]]
    } else {
      cpos_right
    }
  } else {
    cpos_left <- dt[.SD[["start"]] == dt[["start"]]][["id"]]
    if (length(cpos_left) == 0 & isTRUE(expand_to_token)) {
      cpos_vec <- which(dt[["start"]] < .SD[["start"]])
      cpos_left <- dt[["id"]][cpos_vec[length(cpos_vec)]]
    } else {
      cpos_left
    }
  }
}

This would make it necessary to adjust the following chunk as well:

tab <- links[,
             list(
               cpos_left = expand_fun(.SD, direction = "left"),
               cpos_right = expand_fun(.SD, direction = "right"),
               dbpedia_uri = .SD[["dbpedia_uri"]],
               text = .SD[["text"]],
               types = .SD[["types"]]
             ),
             by = "start",
             .SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]

The possibility that there are incomplete annotations for "cpos_left" should be considered here as well:

if (isTRUE(drop_inexact_annotations) & any(is.na(tab[["cpos_right"]]))) {

Discussion

As with issue #26, this should be optional and comes with some conceptual considerations, in particular whether it always makes sense to expand the entity span to match the token span.

This might also be not very efficient as this is checked for each entity.

@ChristophLeonhardt
Copy link
Collaborator Author

This has been implemented as discussed above in a development branch.

The issues in the "discussion" still should be considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant