Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic results when -ncpus != 1 (mgiza bin) #26

Open
cgr71ii opened this issue Jun 7, 2022 · 4 comments
Open

Non-deterministic results when -ncpus != 1 (mgiza bin) #26

cgr71ii opened this issue Jun 7, 2022 · 4 comments

Comments

@cgr71ii
Copy link

cgr71ii commented Jun 7, 2022

Hi!

I have been using mgiza and I have noticed that the generated files does not contain the same information among different executions, not even the same number of lines. This happens when -ncpus != 1. I have tested using the same files and changing -ncpus to 1, 2 and 8. Only when -ncpus 1 is provided, the two executions had exactly the same output files.

Command:

ncpus="1" # deterministic
#ncpus="2" # non-deterministic
#ncpus="8" # non-deterministic

for iteration in $(echo "1 2"); do
  mgiza -ncpus $ncpus -CoocurrenceFile corpus.fr-en.cooc -c corpus.fr-en-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -mh 5 -m5 0 -model1dumpfrequency 1 -o test${iteration}.ncpus${ncpus}.corpus.fr-en -s corpus.en.vcb -t corpus.fr.vcb -emprobforempty 0.0 -probsmooth 1e-7
done

for f1 in $(ls test1.ncpus${ncpus}.corpus.fr-en*); do
  f2=$(echo "$f1" | sed 's/^test1/test2/')
  c=$(comm -3 <(cat "$f1" | sort) <(cat "$f2" | sort) | wc -l)

  if [[ "$c" != "0" ]]; then
    echo "Not equal: $f1 - $f2"
  fi
done

The files has been generated using Bitextor 8.2. The files has been generated using data from this WARC. You may find the necessary files in order to reproduce the results attached in this issue (for corpus.fr-en.cooc.1.zip and corpus.fr-en.cooc.2.zip you will need to decompress and execute cat corpus.fr-en.cooc.1 corpus.fr-en.cooc.2 > corpus.fr-en.cooc).

input_mgiza.zip
corpus.fr-en.cooc.2.zip
corpus.fr-en.cooc.1.zip

@hieuhoang
Copy link
Contributor

I doubt anyone will look into it. Why is it a problem? In fact, I'm surprised cpu=1 is deterministic

@cgr71ii
Copy link
Author

cgr71ii commented Jun 8, 2022

Well... since there is not a proper documentation where I could look into it, I thought it was not the expected. Since you are not surprised about this, am I wrong thinking that to be non-deterministic is the expected?

@hieuhoang
Copy link
Contributor

hieuhoang commented Jun 9, 2022

you're right that the results should be determinstric or non-deterministic regardless of how many threads are used.

I don't know the code that well so don't take my word for it. In my mind, it should be non-determistic during training due to randomness in word clustering. However, you seem to find the it non-deter. even during inference. That could be an issue.

I'm not sure who can come to your rescue, mgiza is abadonware these days. Perhaps @edwardgao, the original author has some time

Btw, running the command with your data crashes for me. I'm not sure if that has anything to do with it

@cgr71ii
Copy link
Author

cgr71ii commented Jun 9, 2022

I have run the commands again and they work for me. Have you run

cat corpus.fr-en.cooc.1 corpus.fr-en.cooc.2 > corpus.fr-en.cooc

? I had to split the file to be able to upload it to the issue.

If you share the log perhaps I could find if something is wrong in my installation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants