Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgiza++ force alignment: segmentation fault when reloading a big N table #2

Open
lefterav opened this issue Nov 19, 2014 · 6 comments

Comments

@lefterav
Copy link

I am trying to produce word alignment for individual sentences. For this purpose I am using the "force align" functionality of mgiza++ Unfortunately when I am loading a big N table (fertility), mgiza crashes with a segmentation fault.

In particular, I have initially run mgiza on the full training parallel corpus using the default settings of the Moses script:

/project/qtleap/software/moses-2.1.1/bin/training-tools/mgiza  -CoocurrenceFile /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de.cooc -c /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en-de-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 24 -nodumps 0 -nsmooth 4 -o /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de -onlyaldumps 0 -p0 0.999 -s /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en.vcb

Afterwards, by executing the mgiza force-align script, I run the following command

/project/qtleap/software/moses-2.1.1/mgizapp-code/mgizapp//bin/mgiza giza.en-de/en-de.gizacfg -c /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en-de.snt -o /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de -s /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en.vcb -m1 0 -m2 0 -mh 0 -coocurrence /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de.cooc -restart 11 -previoust giza.en-de/en-de.t3.final -previousa giza.en-de/en-de.a3.final -previousd giza.en-de/en-de.d3.final -previousn giza.en-de/en-de.n3.final -previousd4 giza.en-de/en-de.d4.final -previousd42 giza.en-de/en-de.D4.final -m3 0 -m4 1

This runs fine, until I get the following error:

  We are going to load previous N model from giza.en-de/en-de.n3.final

Reading fertility table from giza.en-de/en-de.n3.final

Segmentation fault (core dumped)

The n-table that is failing has about 300k entries. For this reason, I thought I should try to see if the size is a problem. So I concatenated the table to 60k entries. And it works! But the alignments are not good.

I am struggling to fix this, so any help would be appreciated. I am running a freshly installed mgiza, on Ubuntu 12.04

@hala-maghout
Copy link

Hi,
I'm having the same problem mentioned above by Lefteris. I'm running the latest version of MGIZA on openSUSE 12.2 . I ran force-align-moses script to align new data. The error message I get when loading the N table is :

We are going to load previous N model from giza.ja-en/ja-en.n3.final
Reading fertility table from giza.ja-en/ja-en.n3.final
./force-align-moses.sh: line 40: 984 Segmentation fault $MGIZA giza.$TGT-$SRC/$TGT-$SRC.gizacfg -c $ROOT/corpus/$TGT-$SRC.snt -o $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC} -s $ROOT/corpus/$SRC.vcb -t $ROOT/corpus/$TGT.vcb -m1 0 -m2 0 -mh 0 -coocurrence $ROOT/giza.${TGT}-${SRC}/$TGT-${SRC}.cooc -restart 11 -previoust giza.$TGT-$SRC/$TGT-$SRC.t3.final -previousa giza.$TGT-$SRC/$TGT-$SRC.a3.final -previousd giza.$TGT-$SRC/$TGT-$SRC.d3.final -previousn giza.$TGT-$SRC/$TGT-$SRC.n3.final -previousd4 giza.$TGT-$SRC/$TGT-$SRC.d4.final -previousd42 giza.$TGT-$SRC/$TGT-$SRC.D4.final -m3 0 -m4 1

I have 787264 entires in the ja-en.n3.final file. I reduced the N table size and it also worked. Any suggestion on how to solve it?

Many thanks

@prajdabre
Copy link

Hello,

I think that this problem occurs in the file: NTables.cpp

More specifically the following lines of code:

while(!inf.eof()){
nFert++;
inf >> ws >> tok;
if (tok > MAX_VOCAB_SIZE){
cerr << "NTables:readNTable(): unrecognized token id: " << tok
<<'\n';
exit(-1);
}
for(i = 0; i < MAX_FERTILITY; i++){
inf >> ws >> prob;
getRef(tok, i)=prob;
}
}

Maybe at some point of time an index violation is done.

Perhaps: MAX_FERTILITY is at fault ???

I am just speculating.

Hope this helps.

@hieuhoang
Copy link
Contributor

I'm closing this issue 'cos it hasn't been answered for a while. Reopen if u wanna carry on chatting

@lefterav
Copy link
Author

lefterav commented Feb 3, 2015

This a show-stopper for the force alignment feature and as it seems it has not been solved. I would like to keep this open. I would be happy to help in futher debugging.

@lefterav lefterav reopened this Feb 3, 2015
@hieuhoang
Copy link
Contributor

no worries. It might be a good idea to make your data available so people can reproduce it. Otherwise the issue isn't gonna get anywhere

@alvations
Copy link
Contributor

alvations commented Nov 24, 2016

I'm having the same problem when it's chinese-english. mgiza on en-zh works but zh-en, it died after HMM training started in model 1:

Normalizing T 
 DONE Normalizing 
Model1: (5) TRAIN CROSS-ENTROPY 7.45211 PERPLEXITY 175.109
Model1: (5) VITERBI TRAIN CROSS-ENTROPY 8.16385 PERPLEXITY 286.791
Model 1 Iteration: 5 took: 107 seconds
Entire Model1 Training took: 525 seconds
NOTE: I am doing iterations with the HMM model!
Read classes: #words: 316590  #classes: 50
Actual number of read words: 316592 stored words: 316356
Read classes: #words: 825717  #classes: 50
Actual number of read words: 825719 stored words: 824545

==========================================================
Hmm Training Started at: Thu Nov 24 09:35:52 2016

-----------
Hmm: Iteration 1
Dump files 0 it 1 noIterations 5 dumpFreq 0
Reading more sentence pairs into memory ... 
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
ERROR: Forbidden zero sentence length 0
Segmentation fault (core dumped)

I suspect it's it's the fertility too but it's rather strange because i would expect it to be <= MAX_FERTILITY since the default ratio with clean-corpus-n.perl is set to 9 and MAX_FERTILITY is set to 9. Oh wait it's zero indexed, so < MAX_FERTILITY corresponds to ratio=9 in clean-corpus-n.perl.

This unusually high fertility will almost always happen esp. when aligning logographic (Japanese/Chinese) languages to alphabetic ones. But they are rather rare < 200K sentence pairs from my 10M sample and most probably part of it is misaligned sentences or non-monotonic sentence alignments.

I'm trying to turn down to a max ratio of 5 at cleaning and i suppose mgiza would be happy. Let's see in 5-6 hours.


So the training works when I have fertility set to 5, 6, 7, 8 and even 9.

I've doubled checked, if ratio is set <= 9 when cleaning this shouldn't occur, i don't know how but i had rogue lines with ration > 9 that snugged in and mgiza by default doesn't like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants