Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_dbpedia_uris() aborts: Error: protect(): protection stack overflow #52

Closed
ablaette opened this issue Apr 11, 2024 · 6 comments
Closed

Comments

@ablaette
Copy link
Contributor

Running this ...

library(polmineR)
library(RcppCWB)
library(dbpedia)
library(dplyr)

p_size <- cl_attribute_size(corpus = "GERMAPARL2", attribute = "p", attribute_type = "s")

p_strucs <- s_attr("GERMAPARL2", s_attribute = "ne", registry = Sys.getenv("CORPUS_REGISTRY")) %>%
  s_attr_size()  %>%
  (`-`)(1) %>%
  seq(from = 0L, to = .) %>%
  get_region_matrix(corpus = "GERMAPARL2", s_attribute = "ne", strucs = .) %>%
  .[, 1L] %>%
  cl_cpos2struc(corpus = "GERMAPARL2", s_attribute = "p", cpos = .) %>%
  unique()


logfile <- tempfile()
message("Using logfile: ", logfile)

decade_regex <- sprintf("^%d\\d-\\d{2}-\\d{2}", decade)

paras <- corpus("GERMAPARL2") %>%
  subset(p %in% !!p_strucs_speech) %>%
  subset(grepl(!!decade_regex, protocol_date)) %>%
  split(s_attribute = "p", values = FALSE)

uritab_paragraphs <- get_dbpedia_uris(
  x = paras,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20, 
  api = getOption("dbpedia.endpoint"),
  logfile = logfile,
  retry = 3,
  verbose = FALSE,
  expand_to_token = TRUE,
  progress = TRUE,
  s_attribute = "ne_type"
)

Results in this error:
Error: protect(): protection stack overflow

See this at Stackoverflow as a potential solution:
https://stackoverflow.com/questions/32826906/how-to-solve-protection-stack-overflow-issue-in-r-studio

So should I include something such as

options(expressions = 5e5)

before this expression?

@ablaette
Copy link
Contributor Author

ablaette commented Apr 12, 2024

The error does not occurr when I break up the entire corpus into smaller pieces (legistlative periods), but I see it for the 17th legislative period of GERMAPARL2. Observations:

  • The 17th legislative period is longer (more paragraphs) than any other previous legislative period.
  • From the logfile we learn that entity linking is complete, i.e. all calls to get_dbpedia_uris() are successful.
  • So there is an issue with the list of data.table objects that is passed into rbindlist()

To be tested experimentally:

options(expressions = 5e5)

We might also look at Cstack_info()

This is a minimal version of the code I used that resulted in the error:

library(RcppCWB)
library(polmineR)
library(dplyr)
library(dbpedia)

logfile <- tempfile()

p_strucs <- s_attr("GERMAPARL2", s_attribute = "ne", registry = Sys.getenv("CORPUS_REGISTRY")) %>%
  s_attr_size()  %>%
  (`-`)(1) %>%
  seq(from = 0L, to = .) %>%
  get_region_matrix(corpus = "GERMAPARL2", s_attribute = "ne", strucs = .) %>%
  .[, 1L] %>%
  cl_cpos2struc(corpus = "GERMAPARL2", s_attribute = "p", cpos = .) %>%
  unique()

p_types <- cl_struc2str(corpus = "GERMAPARL2", s_attribute = "p_type", struc = p_strucs)
p_strucs_speech <- p_strucs[which(p_types == "speech")]

paras <- corpus("GERMAPARL2") %>%
  subset(p %in% !!p_strucs_speech) %>%
  subset(protocol_lp == "17") %>%
  split(s_attribute = "p", values = FALSE)

uritab <- get_dbpedia_uris(
  x = paras,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20, 
  api = getOption("dbpedia.endpoint"),
  logfile = logfile,
  retry = 3,
  verbose = FALSE,
  expand_to_token = TRUE,
  progress = TRUE,
  s_attribute = "ne_type"
)

@ablaette
Copy link
Contributor Author

To get a better understanding of the issue, I tried to provoke it as follows: But it works without a problem, unfortunately. How can we provoke the error?

library(data.table)
dt <- data.table(
  A = 1:100,
  B = 1:100,
  C = 1:100,
  D = 1:100,
  E = 1:100,
  F = 1:100,
  G = 1:100,
  H = 1:100,
  I = 1:100,
  J = rep(list(a = "asdf", b = "asdf", c = "sdf"), times = 100)
)
dts <- lapply(1:500000, function(i) copy(dt))
foo <- rbindlist(dts)

@ablaette
Copy link
Contributor Author

Confirmed: The error does not occur when we drop the column "types" with list values. Dropping the column is implemented now only for get_dbpedia_uris() for subcorpus_bundle objects. A consistent implementation is a to do.

@ChristophLeonhardt
Copy link
Collaborator

I think I can second that.

With the nested lists in types, calling rbindlist() results in the error you described when the list of data.tables returned within get_dbpedia_uris() gets large. Dropping the types column seems to be a good solution. Information on types can be stored in other ways given the mechanism around types_src.

@ablaette
Copy link
Contributor Author

ablaette commented May 9, 2024

We now have the argument types_drop to remove the 'types' column, and the protect-issue disappears when dropping the column. So it is now a matter of documentation to convey this point.

@ablaette
Copy link
Contributor Author

ablaette commented May 9, 2024

I added a paragraph explaining this issue in the documentation of the get_dbpedia_uris()-method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants