Performance issues on processing huge collections -> revise multithreading implementation #45

proycon · 2018-01-30T15:11:09Z

Processing of huge amounts of pre-tokenised FoLiA documents (for Nederlab) goes unexpectedly slow, despite disabling various modules (--skip=mcpa). In about 24 hours, about 90 documents have been processed.

Frog is called on a directory as follows (to eliminate initialisation overhead):

frog --skip=mcpa -override tokenizer.rulesFile=tokconfig-nld-historical --xmldir "." --threads 40 --testdir input/ -x

Log excerpt of a single document (rarity:/scratch/proycon/morr001cryp01_01.tok.folia.xml) in a long-running batch (days if not weeks):

frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 1 milliseconds and 510 microseconds
frog-:CGN tagging took:   42 seconds, 593 milliseconds and 430 microseconds
frog-:NER took:           30 seconds, 188 milliseconds and 829 microseconds
frog-:Mblem took:         0 seconds, 803 milliseconds and 711 microseconds
frog-:Frogging in total took: 89 seconds, 930 milliseconds and 350 microseconds
frog-:resulting FoLiA doc saved in ./mole002refe01_01_0005.tok.folia.xml
frog-:Frogging input/moll013albu01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 16 milliseconds and 169 microseconds
frog-:CGN tagging took:   263 seconds, 846 milliseconds and 415 microseconds
frog-:NER took:           184 seconds, 152 milliseconds and 171 microseconds
frog-:Mblem took:         5 seconds, 328 milliseconds and 128 microseconds
frog-:Frogging in total took: 559 seconds, 893 milliseconds and 805 microseconds
frog-:resulting FoLiA doc saved in ./moll013albu01_01.tok.folia.xml
frog-:Frogging input/mont003zome01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 47 milliseconds and 859 microseconds
frog-:CGN tagging took:   57 seconds, 762 milliseconds and 530 microseconds
frog-:NER took:           37 seconds, 773 milliseconds and 605 microseconds
frog-:Mblem took:         1 seconds, 304 milliseconds and 201 microseconds
frog-:Frogging in total took: 120 seconds, 269 milliseconds and 869 microseconds
frog-:resulting FoLiA doc saved in ./mont003zome01_01.tok.folia.xml
----------------------------- (my emphasis) ------------------------
frog-:Frogging input/morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 31 milliseconds and 59 microseconds
frog-:CGN tagging took:   322 seconds, 971 milliseconds and 28 microseconds
frog-:NER took:           728 seconds, 839 milliseconds and 11 microseconds
frog-:Mblem took:         20 seconds, 219 milliseconds and 307 microseconds
frog-:Frogging in total took: 1494 seconds, 657 milliseconds and 222 microseconds
-----------------------------------------------------
frog-:resulting FoLiA doc saved in ./morr001cryp01_01.tok.folia.xml
frog-:Frogging input/moul004vowe01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 7 milliseconds and 995 microseconds
frog-:CGN tagging took:   336 seconds, 695 milliseconds and 921 microseconds
frog-:NER took:           111 seconds, 161 milliseconds and 469 microseconds
frog-:Mblem took:         3 seconds, 0 milliseconds and 52 microseconds
frog-:Frogging in total took: 508 seconds, 408 milliseconds and 533 microseconds
frog-:resulting FoLiA doc saved in ./moul004vowe01_01.tok.folia.xml
frog-:Frogging input/mouw001brah01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 37 milliseconds and 979 microseconds
frog-:CGN tagging took:   30 seconds, 709 milliseconds and 748 microseconds
frog-:NER took:           28 seconds, 536 milliseconds and 162 microseconds
frog-:Mblem took:         0 seconds, 966 milliseconds and 700 microseconds
frog-:Frogging in total took: 75 seconds, 622 milliseconds and 890 microseconds
frog-:resulting FoLiA doc saved in ./mouw001brah01_01.tok.folia.xml
frog-:Frogging input/muld014janf01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!

(full log in rarity:/scratch/proycon/frog.log)

Comparison; a standalone run on only the highlighted document (without -nostdout) :

frog-:Frogging morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 17 milliseconds and 448 microseconds
frog-:CGN tagging took:   17 seconds, 26 milliseconds and 152 microseconds
frog-:NER took:           65 seconds, 981 milliseconds and 710 microseconds
frog-:Mblem took:         28 seconds, 174 milliseconds and 786 microseconds
frog-:Frogging in total took: 487 seconds, 310 milliseconds and 584 microseconds

I have some minor suggestions for better debugging:

Let frog output a date/timestamp before and after processing (makes it easier to inspect a long logs running for days)
Is some normalisation of timings possible (in addition to the absolute numbers), e.g. divided by the total amount of tokens?

Other possibility for testdir:

Parallellise document processing instead of modules? (if not too complicated)

The text was updated successfully, but these errors were encountered:

proycon · 2018-01-30T15:22:55Z

with --nostdout: frog-:Frogging in total took: 475 seconds, 806 milliseconds and 28 microseconds

(might not mean much, load varies)

proycon · 2018-01-30T15:32:22Z

With tokenisation:

frog-:Frogging morr001cryp01_01.notok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  3 seconds, 359 milliseconds and 505 microseconds
frog-:CGN tagging took:   15 seconds, 951 milliseconds and 475 microseconds
frog-:NER took:           59 seconds, 497 milliseconds and 132 microseconds
frog-:Mblem took:         25 seconds, 428 milliseconds and 531 microseconds
frog-:Frogging in total took: 495 seconds, 298 milliseconds and 53 microseconds

proycon · 2018-01-30T15:37:50Z

Reducing threads to 4 instead of 40 (no tokenisation, no stdout):

frog-:Frogging morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 49 milliseconds and 709 microseconds
frog-:CGN tagging took:   11 seconds, 411 milliseconds and 928 microseconds
frog-:NER took:           50 seconds, 811 milliseconds and 141 microseconds
frog-:Mblem took:         3 seconds, 60 milliseconds and 229 microseconds
frog-:Frogging in total took: 67 seconds, 930 milliseconds and 822 microseconds

proycon · 2018-01-30T15:39:52Z

single threaded:

frog-:Frogging morr001cryp01_01.tok.folia.xml
frog-tok-:ucto: --filter=NO is automatically set. inputclass equals outputclass!
frog-:tokenisation took:  0 seconds, 17 milliseconds and 784 microseconds
frog-:CGN tagging took:   9 seconds, 568 milliseconds and 987 microseconds
frog-:NER took:           44 seconds, 400 milliseconds and 315 microseconds
frog-:Mblem took:         1 seconds, 873 milliseconds and 655 microseconds
frog-:Frogging in total took: 56 seconds, 17 milliseconds and 539 microseconds

proycon · 2018-01-30T15:46:16Z

Another minor feature request for debugging:

frog-:Initialization done. (could it report time for this as well?)

proycon · 2018-01-30T15:48:10Z

So I think the conclusion is to go for parallellisation of document processing when --testdir is used instead of modules, and to set a smaller amount of threads?

kosloot · 2018-01-30T16:26:49Z

Some timestamps have been added already.

kosloot · 2018-02-05T13:28:33Z

I did a quick examination of the possibility to handle several files in parallel in Frog, but this is quite difficult to accomplish.
Every thread will need it's own copy of all the needed modules. Like CGN tagger, NER Tagger, MBLEM etc.
This is not an easy task. MBT provides a kind of 'cloning' mechanism. But MBLEM and MBMA don't yet.
(although based on Timbl, which HAS a clone() function)
Implementing this is therefore not easy. And only fruitful when there a 'enough' files to process in parallel.

A more direct way might be to create a wrapper that uses the Frog Server and feed that with several files in parallel. My guess is, that such a wrapper is workable, and faster to accomplish.

For now, I don't consider this a bug, but a design decision.

proycon assigned kosloot Jan 30, 2018

proycon added the enhancement label Jan 30, 2018

proycon changed the title ~~Performance issues on processing huge collections~~ Performance issues on processing huge collections -> revise multithreading implementation Feb 1, 2018

proycon added the bug label Feb 1, 2018

kosloot removed the bug label Feb 5, 2018

proycon mentioned this issue Feb 5, 2018

[dbnl] Redesign frog component in pipeline for performance LanguageMachines/PICCL#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues on processing huge collections -> revise multithreading implementation #45

Performance issues on processing huge collections -> revise multithreading implementation #45

proycon commented Jan 30, 2018 •

edited

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

kosloot commented Jan 30, 2018

kosloot commented Feb 5, 2018 •

edited

Performance issues on processing huge collections -> revise multithreading implementation #45

Performance issues on processing huge collections -> revise multithreading implementation #45

Comments

proycon commented Jan 30, 2018 • edited

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

proycon commented Jan 30, 2018

kosloot commented Jan 30, 2018

kosloot commented Feb 5, 2018 • edited

proycon commented Jan 30, 2018 •

edited

kosloot commented Feb 5, 2018 •

edited