Skip to content

Commit

Permalink
entity_types_map (#40) and types_src (#41) more robust
Browse files Browse the repository at this point in the history
  • Loading branch information
ChristophLeonhardt committed Mar 6, 2024
1 parent f4dc779 commit 4a8fd3c
Show file tree
Hide file tree
Showing 8 changed files with 83 additions and 67 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
@@ -1,8 +1,8 @@
Package: dbpedia
Type: Package
Title: R Wrapper for DBpedia Spotlight
Version: 0.1.2
Date: 2024-02-26
Version: 0.1.2.9001
Date: 2024-03-06
Authors@R: c(
person("Andreas", "Blaette", role = c("aut", "cre"), email = "andreas.blaette@uni-due.de", comment = c(ORCID = "0000-0001-8970-8010")),
person("Christoph", "Leonhardt", role = "aut")
Expand Down
6 changes: 6 additions & 0 deletions NEWS.md
@@ -1,3 +1,9 @@
## dbpedia v0.1.2.9001
* `entity_types_map()` now creates assignments again (#40) and returns them as character vectors
* `entity_types_map()` also passes all arguments when used with data.table objects
* `types_src` works in `get_dbpedia_uris()` for documents with a single type (#41)
* messages for `types_src` follow verbosity set by the argument `verbose`

## dbpedia v0.1.2
* `get_dbpedia_uris()` has new argument `types` to filter results.
* `dbpedia_spotlight_status()` without warnings if docker not available / not running #32.
Expand Down
51 changes: 30 additions & 21 deletions R/dbpedia.R
Expand Up @@ -424,24 +424,33 @@ setMethod(
new = c("dbpedia_uri", "text", "start", "types")
)
setcolorder(resources_min, c("start", "text", "dbpedia_uri", "types"))

resources_min[, "start" := as.integer(resources_min[["start"]]) + 1L]

# See issue 41.
types_list <- strsplit(x = resources_min[["types"]], split = ",")

resources_min[, "types" := lapply(
strsplit(x = resources_min[["types"]], split = ","),
types_list,
function(x){
if (length(x) == 0L) return(list())
spl <- strsplit(x, split = ":")
split(
types <- split(
x = unlist(lapply(spl, `[`, 2L)),
f = unlist(lapply(spl, `[`, 1L))
)
if (length(types) == 1L & length(types_list) == 1L) {
list(types)
} else {
types
}
}
)]

if (length(types_src) > 0L){
src_all <- unique(unlist(lapply(resources_min[["types"]], names)))
src_unused <- setdiff(src_all, types_src)
if (length(src_unused) > 0L)
if (length(src_unused) > 0L & isTRUE(verbose))
cli_alert_info(
"dropping available types from: {paste(src_unused, collapse = ' / ')}"
)
Expand Down Expand Up @@ -481,17 +490,17 @@ setMethod("get_dbpedia_uris", "AnnotatedPlainTextDocument", function(x, language


#' Get DBpedia links.
#'
#' #' @details
#' `expand_to_token` is a rather experimental feature that resolves mismatches
#' between entity spans and token spans by expanding the former to the last
#' character position of the corresponding token. See issue #26 in the `dbpedia`
#' GitHub repository.
#' The configuration of the `httr::GET()` calls used can be controlled using
#' `httr::config()`. A relevant scenario is SSL verification issues that can be
#' addressed using `httr::set_config(httr::config(ssl_verifypeer = 0L))` (at own
#' risk!). The error "HTTP/2 stream 1 was not closed cleanly before end of the
#' underlying stream" can be addressed using
#'
#' @details - `expand_to_token` is a rather experimental feature that resolves
#' mismatches between entity spans and token spans by expanding the former to
#' the last character position of the corresponding token. See issue #26 in the
#' `dbpedia` GitHub repository.
#' - The configuration of the `httr::GET()` calls
#' used can be controlled using `httr::config()`. A relevant scenario is SSL
#' verification issues that can be addressed using
#' `httr::set_config(httr::config(ssl_verifypeer = 0L))` (at own risk!). The
#' error "HTTP/2 stream 1 was not closed cleanly before end of the underlying
#' stream" can be addressed using
#' `httr::set_config(httr::config(http_verson = 1.1))`
#'
#' @param x A `subcorpus` (`xml`, ...) object. Will be coerced to
Expand All @@ -505,9 +514,9 @@ setMethod("get_dbpedia_uris", "AnnotatedPlainTextDocument", function(x, language
#' as threshold before DBpedia Spotlight includes a link into the report.
#' @param api An URL of the DBpedia Spotlight API.
#' @param types A `character` vector to restrict result returned to certain
#' entity types, such as 'Company' or 'Organization'. If the `character`
#' entity types, such as 'Company' or 'Organization'. If the `character`
#' vector is empty (default), no restrictions are applied.
#' @param support The number of indegrees at Wikidata. Useful for limiting the
#' @param support The number of indegrees at Wikidata. Useful for limiting the
#' the number of results by excluding insignificant entities.
#' @param types_src A `character` vector specifying knowledge bases as sources
#' for entity types. If provided, columns following the pattern '(src)_type'
Expand All @@ -530,9 +539,9 @@ setMethod("get_dbpedia_uris", "AnnotatedPlainTextDocument", function(x, language
#' @return A `data.table` with the following columns:
#' - *dbpedia_uri*: The DBpedia URI.
#' - *text*: Text that has been annotated
#' - *types*: Recognized entity types, for each row a named list, if available
#' entries such as 'DBpedia', 'Schema', 'Wikidata', 'DUL'
#' Depending on the input object, further columns may be available.
#' - *types*: Recognized entity types, for each row a named list, if available
#' entries such as 'DBpedia', 'Schema', 'Wikidata', 'DUL'.
#' Depending on the input object, further columns may be available.
#' @exportMethod get_dbpedia_uris
#' @importFrom cli cli_alert_warning cli_progress_step cli_alert_danger
#' cli_progress_done cli_alert_info
Expand Down
42 changes: 19 additions & 23 deletions R/entity_types.R
@@ -1,12 +1,12 @@
#' Map types returned by DBpedia Spotlight to a limited set of classes
#' Map types returned by DBpedia Spotlight to a limited set of categories
#'
#' This function takes the output of `get_dbpedia_uris()` and compares values in
#' the `types` column with a named character vector. The main purpose of this
#' function is to reduce the number of types to a limited set of classes.
#' function is to reduce the number of types to a limited set of categories.
#'
#' @param x A `data.table` with DBpedia URIs.
#' @param mapping_vector A `named character vector` with desired class names (as
#' names) and types from the DBpedia ontology as values. For example:
#' @param mapping_vector A `named character vector` with desired category names
#' (as names) and types from the DBpedia ontology as values. For example:
#' c("PERSON" = "DBpedia:Person"). Can contain more than one pair of class and
#' type.
#' @param other a `character vector` with the name of the class of all types not
Expand Down Expand Up @@ -70,32 +70,28 @@ setMethod(
if (!is.character(other) | length(other) > 1)
stop(format_error("{.var other} not character vector of length {.val 1}."))

lapply(
sapply(
x,
function(el){
# types is a list of lists. Transform to single character vector.
type_list <- unlist(el, recursive = FALSE)

# An unintended consequence here is that you may get DBpedia1, DBpedia2, ...

types_with_class_raw <- lapply(
seq_along(type_list),
types_with_category_raw <- lapply(
seq_along(el),
function(i) {
list_name <- names(type_list)[[i]]
list_elements <- type_list[[i]]
list_name <- names(el)[[i]]
list_elements <- el[[i]]
paste0(list_name, ":", list_elements)
})
types_with_class <- intersect(unlist(types_with_class_raw), mapping_vector)

if (length(types_with_class) > 0L) {
match_idx <- which(mapping_vector %in% types_with_class)

class_name <- paste(

types_with_category <- intersect(unlist(types_with_category_raw), mapping_vector)

if (length(types_with_category) > 0L) {
match_idx <- which(mapping_vector %in% types_with_category)

category <- paste(
sort(unique(names(mapping_vector)[match_idx])),
collapse = "|"
)
} else {
class_name <- other
category <- other
}
}
)
Expand All @@ -116,9 +112,9 @@ setMethod(

if (verbose)
cli_alert_info(
"mapping values in column {.var types} to new column {.var class}"
"mapping values in column {.var types} to new column {.var category}"
)

x[, class := entity_types_map(x = x[["types"]])]
x[, category := entity_types_map(x = x[["types"]], mapping_vector = mapping_vector, other = other, verbose = verbose)]
x
})
6 changes: 3 additions & 3 deletions R/utils.R
Expand Up @@ -105,11 +105,11 @@ as_chunks <- function(x, size){
#' Transform table with DBpedia URIs to subcorpus.
#'
#' @param x A `data.table` with DBpedia URIs.
#' @param highlight_by A `character vector` of the column in which entity names
#' are annotated. Defaults to NULL.
#' @param highlight_by A `character vector` of the column in which the types of
#' entities are annotated. Defaults to NULL.
#' @details If a `character vector` is supplied to `highlight_by`, selected
#' entity types (PERSON, LOCATION, ORGANIZATION, MISC) are assigned specific
#' color codes. Other entities in the column are assigned a single color.
#' color codes. Other types in the column are assigned a single color.
#' @importFrom fs path
#' @export
as_subcorpus <- function(x, highlight_by = NULL){
Expand Down
6 changes: 3 additions & 3 deletions man/as_subcorpus.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions man/entity_types_map.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 16 additions & 11 deletions man/get_dbpedia_uris.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 4a8fd3c

Please sign in to comment.