mgiza++ force alignment: segmentation fault when reloading a big N table #2

lefterav · 2014-11-19T10:39:32Z

I am trying to produce word alignment for individual sentences. For this purpose I am using the "force align" functionality of mgiza++ Unfortunately when I am loading a big N table (fertility), mgiza crashes with a segmentation fault.

In particular, I have initially run mgiza on the full training parallel corpus using the default settings of the Moses script:

/project/qtleap/software/moses-2.1.1/bin/training-tools/mgiza  -CoocurrenceFile /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de.cooc -c /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en-de-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 24 -nodumps 0 -nsmooth 4 -o /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de -onlyaldumps 0 -p0 0.999 -s /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en.vcb

Afterwards, by executing the mgiza force-align script, I run the following command

/project/qtleap/software/moses-2.1.1/mgizapp-code/mgizapp//bin/mgiza giza.en-de/en-de.gizacfg -c /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en-de.snt -o /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de -s /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en.vcb -m1 0 -m2 0 -mh 0 -coocurrence /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de.cooc -restart 11 -previoust giza.en-de/en-de.t3.final -previousa giza.en-de/en-de.a3.final -previousd giza.en-de/en-de.d3.final -previousn giza.en-de/en-de.n3.final -previousd4 giza.en-de/en-de.d4.final -previousd42 giza.en-de/en-de.D4.final -m3 0 -m4 1

This runs fine, until I get the following error:

  We are going to load previous N model from giza.en-de/en-de.n3.final

Reading fertility table from giza.en-de/en-de.n3.final

Segmentation fault (core dumped)

The n-table that is failing has about 300k entries. For this reason, I thought I should try to see if the size is a problem. So I concatenated the table to 60k entries. And it works! But the alignments are not good.

I am struggling to fix this, so any help would be appreciated. I am running a freshly installed mgiza, on Ubuntu 12.04

The text was updated successfully, but these errors were encountered:

hala-maghout · 2014-11-19T11:26:37Z

Hi,
I'm having the same problem mentioned above by Lefteris. I'm running the latest version of MGIZA on openSUSE 12.2 . I ran force-align-moses script to align new data. The error message I get when loading the N table is :

We are going to load previous N model from giza.ja-en/ja-en.n3.final
Reading fertility table from giza.ja-en/ja-en.n3.final
./force-align-moses.sh: line 40: 984 Segmentation fault $MGIZA giza.$TGT-$SRC/$TGT-$SRC.gizacfg -c $ROOT/corpus/$TGT-$SRC.snt -o $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC} -s $ROOT/corpus/$SRC.vcb -t $ROOT/corpus/$TGT.vcb -m1 0 -m2 0 -mh 0 -coocurrence $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC}.cooc -restart 11 -previoust giza.$TGT-$SRC/$TGT-$SRC.t3.final -previousa giza.$TGT-$SRC/$TGT-$SRC.a3.final -previousd giza.$TGT-$SRC/$TGT-$SRC.d3.final -previousn giza.$TGT-$SRC/$TGT-$SRC.n3.final -previousd4 giza.$TGT-$SRC/$TGT-$SRC.d4.final -previousd42 giza.$TGT-$SRC/$TGT-$SRC.D4.final -m3 0 -m4 1

I have 787264 entires in the ja-en.n3.final file. I reduced the N table size and it also worked. Any suggestion on how to solve it?

Many thanks

prajdabre · 2014-11-19T11:47:46Z

Hello,

I think that this problem occurs in the file: NTables.cpp

More specifically the following lines of code:

while(!inf.eof()){
nFert++;
inf >> ws >> tok;
if (tok > MAX_VOCAB_SIZE){
cerr << "NTables:readNTable(): unrecognized token id: " << tok
<<'\n';
exit(-1);
}
for(i = 0; i < MAX_FERTILITY; i++){
inf >> ws >> prob;
getRef(tok, i)=prob;
}
}

Maybe at some point of time an index violation is done.

Perhaps: MAX_FERTILITY is at fault ???

I am just speculating.

Hope this helps.

hieuhoang · 2015-02-02T20:44:31Z

I'm closing this issue 'cos it hasn't been answered for a while. Reopen if u wanna carry on chatting

lefterav · 2015-02-03T19:21:58Z

This a show-stopper for the force alignment feature and as it seems it has not been solved. I would like to keep this open. I would be happy to help in futher debugging.

hieuhoang · 2015-02-03T20:30:51Z

no worries. It might be a good idea to make your data available so people can reproduce it. Otherwise the issue isn't gonna get anywhere

alvations · 2016-11-24T01:57:57Z

I'm having the same problem when it's chinese-english. mgiza on en-zh works but zh-en, it died after HMM training started in model 1:

Normalizing T 
 DONE Normalizing 
Model1: (5) TRAIN CROSS-ENTROPY 7.45211 PERPLEXITY 175.109
Model1: (5) VITERBI TRAIN CROSS-ENTROPY 8.16385 PERPLEXITY 286.791
Model 1 Iteration: 5 took: 107 seconds
Entire Model1 Training took: 525 seconds
NOTE: I am doing iterations with the HMM model!
Read classes: #words: 316590  #classes: 50
Actual number of read words: 316592 stored words: 316356
Read classes: #words: 825717  #classes: 50
Actual number of read words: 825719 stored words: 824545

==========================================================
Hmm Training Started at: Thu Nov 24 09:35:52 2016

-----------
Hmm: Iteration 1
Dump files 0 it 1 noIterations 5 dumpFreq 0
Reading more sentence pairs into memory ... 
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
Segmentation fault (core dumped)

I suspect it's it's the fertility too but it's rather strange ~~because i would expect it to be <= MAX_FERTILITY since the default ratio with clean-corpus-n.perl is set to 9 and MAX_FERTILITY is set to 9.~~ Oh wait it's zero indexed, so < MAX_FERTILITY corresponds to ratio=9 in clean-corpus-n.perl.

This unusually high fertility will almost always happen esp. when aligning logographic (Japanese/Chinese) languages to alphabetic ones. But they are rather rare < 200K sentence pairs from my 10M sample and most probably part of it is misaligned sentences or non-monotonic sentence alignments.

~~I'm trying to turn down to a max ratio of 5 at cleaning and i suppose mgiza would be happy. Let's see in 5-6 hours.~~

So the training works when I have fertility set to 5, 6, 7, 8 and even 9.

I've doubled checked, if ratio is set <= 9 when cleaning this shouldn't occur, i don't know how but i had rogue lines with ration > 9 that snugged in and mgiza by default doesn't like that.

hieuhoang closed this as completed Feb 2, 2015

lefterav reopened this Feb 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgiza++ force alignment: segmentation fault when reloading a big N table #2

mgiza++ force alignment: segmentation fault when reloading a big N table #2

lefterav commented Nov 19, 2014

hala-maghout commented Nov 19, 2014

prajdabre commented Nov 19, 2014

hieuhoang commented Feb 2, 2015

lefterav commented Feb 3, 2015

hieuhoang commented Feb 3, 2015

alvations commented Nov 24, 2016 •

edited

mgiza++ force alignment: segmentation fault when reloading a big N table #2

mgiza++ force alignment: segmentation fault when reloading a big N table #2

Comments

lefterav commented Nov 19, 2014

hala-maghout commented Nov 19, 2014

prajdabre commented Nov 19, 2014

hieuhoang commented Feb 2, 2015

lefterav commented Feb 3, 2015

hieuhoang commented Feb 3, 2015

alvations commented Nov 24, 2016 • edited

alvations commented Nov 24, 2016 •

edited