You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I tried to run get_dbpedia_uris() on the entire GERMAPARL2 corpus, I had to abort because for whatever reason the processing time of paragraphs increased. To fix some observations:
The initial progress status message said that processing time would be 3 days. When I returned after a few days, estimated 'time of arrival' entitylinking.log
was up to 5 days. This was when a bit mor of the entire data (1.8 million of 3.0 million paragraphs was processed.
Running htop from the shell did not give me any specific insight about the process: cores were used as expected and main memory had not yet been exhausted.
There was still about 25 GB of hard disk space left.
The information RStudio provides on memory consumption said that 10 GB were used: But I am not entirely sure that this information was correct.
Concerning the logfile:
It does not cover the entire data that has been processed: I started the process on April 1, but the first entries in the logfile are on April 4.
I would have expected 1.8 million entries in the logfile. But its length is 67212!
As a consequence, it is not possible to analyse when and why the slump of processing speed occurred. Anyway, these are some preliminary insights:
How many paragraphs have been processed per hour? Here, we do not see a decrease. My assumption is that the decrease occurred before the coverage of the logfile.
There are quite a few paragraphs that took a long, long time to be processed. We should analyse in further depth: What are the features of paragraphs that take so much time. One possibility: Requests that fail, then there is the waiting period until processing the paragraph is retried?
When I tried to run
get_dbpedia_uris()
on the entire GERMAPARL2 corpus, I had to abort because for whatever reason the processing time of paragraphs increased. To fix some observations:entitylinking.log
was up to 5 days. This was when a bit mor of the entire data (1.8 million of 3.0 million paragraphs was processed.
htop
from the shell did not give me any specific insight about the process: cores were used as expected and main memory had not yet been exhausted.Concerning the logfile:
As a consequence, it is not possible to analyse when and why the slump of processing speed occurred. Anyway, these are some preliminary insights:
How many paragraphs have been processed per hour? Here, we do not see a decrease. My assumption is that the decrease occurred before the coverage of the logfile.
How long did it take to process one paragraph? This is much less telling, quite overloaded.
What is the distribution of processing time?
There are quite a few paragraphs that took a long, long time to be processed. We should analyse in further depth: What are the features of paragraphs that take so much time. One possibility: Requests that fail, then there is the waiting period until processing the paragraph is retried?
I attach the logfile for further analysis.
entitylinking.log
The text was updated successfully, but these errors were encountered: