Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progressive performance loss when processing large data #50

Open
ablaette opened this issue Apr 8, 2024 · 0 comments
Open

Progressive performance loss when processing large data #50

ablaette opened this issue Apr 8, 2024 · 0 comments

Comments

@ablaette
Copy link
Contributor

ablaette commented Apr 8, 2024

When I tried to run get_dbpedia_uris() on the entire GERMAPARL2 corpus, I had to abort because for whatever reason the processing time of paragraphs increased. To fix some observations:

  • The initial progress status message said that processing time would be 3 days. When I returned after a few days, estimated 'time of arrival'
    entitylinking.log
    was up to 5 days. This was when a bit mor of the entire data (1.8 million of 3.0 million paragraphs was processed.
  • Running htop from the shell did not give me any specific insight about the process: cores were used as expected and main memory had not yet been exhausted.
  • There was still about 25 GB of hard disk space left.
  • The information RStudio provides on memory consumption said that 10 GB were used: But I am not entirely sure that this information was correct.

Concerning the logfile:

  • It does not cover the entire data that has been processed: I started the process on April 1, but the first entries in the logfile are on April 4.
  • I would have expected 1.8 million entries in the logfile. But its length is 67212!

As a consequence, it is not possible to analyse when and why the slump of processing speed occurred. Anyway, these are some preliminary insights:

How many paragraphs have been processed per hour? Here, we do not see a decrease. My assumption is that the decrease occurred before the coverage of the logfile.

library(magrittr)
library(lubridate)
library(ggplot2)
library(dplyr)

logfile <-"~/Lab/tmp/entitylinking.log" 

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>%
  lubridate::floor_date(unit = "hour") %>%
  as_tibble() %>%
  group_by(value) %>%
  summarise(N = n()) %>%
  ggplot(aes(x = value, y = N)) +
    geom_line()

grafik

How long did it take to process one paragraph? This is much less telling, quite overloaded.

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>% 
  diff() %>%
  as_tibble() %>%
  mutate(id = 1L:nrow(.)) %>%
  ggplot(aes(x = id, y = value)) +
  geom_line()

grafik

What is the distribution of processing time?

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>% 
  diff() %>%
  as.numeric() %>%
  hist(main = "Distribution of processing time", xlab = "seconds")

grafik

There are quite a few paragraphs that took a long, long time to be processed. We should analyse in further depth: What are the features of paragraphs that take so much time. One possibility: Requests that fail, then there is the waiting period until processing the paragraph is retried?

I attach the logfile for further analysis.

entitylinking.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant