Progressive performance loss when processing large data #50

ablaette · 2024-04-08T08:59:18Z

When I tried to run get_dbpedia_uris() on the entire GERMAPARL2 corpus, I had to abort because for whatever reason the processing time of paragraphs increased. To fix some observations:

The initial progress status message said that processing time would be 3 days. When I returned after a few days, estimated 'time of arrival'
entitylinking.log
was up to 5 days. This was when a bit mor of the entire data (1.8 million of 3.0 million paragraphs was processed.
Running htop from the shell did not give me any specific insight about the process: cores were used as expected and main memory had not yet been exhausted.
There was still about 25 GB of hard disk space left.
The information RStudio provides on memory consumption said that 10 GB were used: But I am not entirely sure that this information was correct.

Concerning the logfile:

It does not cover the entire data that has been processed: I started the process on April 1, but the first entries in the logfile are on April 4.
I would have expected 1.8 million entries in the logfile. But its length is 67212!

As a consequence, it is not possible to analyse when and why the slump of processing speed occurred. Anyway, these are some preliminary insights:

How many paragraphs have been processed per hour? Here, we do not see a decrease. My assumption is that the decrease occurred before the coverage of the logfile.

library(magrittr)
library(lubridate)
library(ggplot2)
library(dplyr)

logfile <-"~/Lab/tmp/entitylinking.log" 

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>%
  lubridate::floor_date(unit = "hour") %>%
  as_tibble() %>%
  group_by(value) %>%
  summarise(N = n()) %>%
  ggplot(aes(x = value, y = N)) +
    geom_line()

How long did it take to process one paragraph? This is much less telling, quite overloaded.

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>% 
  diff() %>%
  as_tibble() %>%
  mutate(id = 1L:nrow(.)) %>%
  ggplot(aes(x = id, y = value)) +
  geom_line()

What is the distribution of processing time?

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>% 
  diff() %>%
  as.numeric() %>%
  hist(main = "Distribution of processing time", xlab = "seconds")

There are quite a few paragraphs that took a long, long time to be processed. We should analyse in further depth: What are the features of paragraphs that take so much time. One possibility: Requests that fail, then there is the waiting period until processing the paragraph is retried?

I attach the logfile for further analysis.

entitylinking.log

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progressive performance loss when processing large data #50

Progressive performance loss when processing large data #50

ablaette commented Apr 8, 2024 •

edited

Progressive performance loss when processing large data #50

Progressive performance loss when processing large data #50

Comments

ablaette commented Apr 8, 2024 • edited

ablaette commented Apr 8, 2024 •

edited