Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking for types_src results in an error if element in types column is unnamed #41

Open
ChristophLeonhardt opened this issue Mar 5, 2024 · 0 comments

Comments

@ChristophLeonhardt
Copy link
Collaborator

Issue

There are scenarios in which elements in the types column returned by get_dbpedia_uris() are not named lists. This is a) inconsistent and b) results in errors when checking for the types_src which relies on named elements in this column.

Example

See the following example:

library(dbpedia)
library(quanteda)

inaugural_paragraphs <- data_corpus_inaugural |>
  corpus_subset(Year == 2021) |>
  corpus_reshape(to = "paragraphs")

get_dbpedia_uris(
  x = inaugural_paragraphs["2021-Biden.145"],
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.5,
  support = 20,
  types = character(),
  api = getOption("dbpedia.endpoint"), # English endpoint
  verbose = FALSE,
  progress = FALSE
)

This will result in an error:

Error in FUN(X[[i]], ...) : subscript out of bounds

Likely underlying issue

Currently, the way to populate the types column in get_dbpedia_uris() usually results in either an empty list (if there are no types for the entity) or a list of lists containing entity types (if there are types for an entity). The names of the nested lists refer to the source/ontology the type is derived from.

This fails, however, if the document passed to get_dbpedia_uris() has only one entity and only types from one source. In this case, types are added as unnamed list elements to the column. This seems to be happening only if resource_min (the data.table containing entities) has only one row.

Error with types_src

This, in itself, is inconsistent and should be addressed. However, the lack of a name in the column results in an error in the subsequent mechanism to extract and filter the types by their source via the types_src argument. This relies on the elements in types being named.

Potential Solution

I think that when preparing the types for the column, it would be necessary to check if

  • there are only types for a single element
  • these types are all from the same source

In case there is only one type of a single source, e.g. "Person" from "DBpedia", wrapping this value into an additional list() should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant