Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can batch translation on CPU result in different output? #693

Open
robertBrnnn opened this issue Jan 19, 2022 · 9 comments
Open

Can batch translation on CPU result in different output? #693

robertBrnnn opened this issue Jan 19, 2022 · 9 comments
Labels
cpu Issues related to CPU execution question Further information is requested

Comments

@robertBrnnn
Copy link

I have a CPU model that produces different outputs for the same strings at different times.

I think it could be related to the bug from #546 where batch translation yielded different results on GPU, I'm currently using CTranslate2 1.20.1 so there's a lot of updates I'm missing.

Alternatively, I recall that on GPU, batch translation can have slightly different numerical results, and am curious whether the same can happen with CPU models and batch translation?

@guillaumekln
Copy link
Collaborator

Yes, the same string can have different outputs in batch translation on CPU.

I know this can happen with Intel MKL (default backend on Intel CPU) and oneDNN (default on AMD CPU). The numerical result of the dot product attention can be slightly different depending on the number of padding positions in the input.

If you are running on an Intel CPU, it is possible to work around this issue by enabling strict numerical reproducibility. Try setting this environment variable:

MKL_CBWR=AUTO,STRICT

@robertBrnnn
Copy link
Author

I actually use AMD for deployment, which is unfortunate 😔

If I set CT2_USE_MKL=1 on an AMD CPU, will CTranslate2 use MKL?

With CT2_USE_MKL=1 and MKL_CBWR=AUTO,STRICT I'm guessing results would be reproducible with the caveat that it'll be slower because of how MKL handles AMD CPUs.

@guillaumekln
Copy link
Collaborator

If I set CT2_USE_MKL=1 on an AMD CPU, will CTranslate2 use MKL?

Yes.

I requested to add a similar flag in oneDNN but they don't plan to implement it.

@guillaumekln guillaumekln added cpu Issues related to CPU execution question Further information is requested labels Jan 21, 2022
@robertBrnnn
Copy link
Author

robertBrnnn commented Jan 24, 2022

I've ran a couple of tests with AMD and Intel CPUs, MKL_CBWR=AUTO,STRICT doesn't seem to work with either. I can get reproducible output from both Intel and AMD using MKL_CBWR=COMPATIBLE, surprisingly the AMD CPUs perform much better than the Intel ones with this flag.
Are the ctranslate2 wheels built with MKL 2019 Update 3 or an earlier version? I'm guessing they're built against an earlier MKL version given the MKL_CBWR=AUTO,STRICT doesn't seem to work, and that's when MKL_CBWR=AUTO,STRICT flags were introduced.

I see the wheels are built using very recent versions now.

@guillaumekln
Copy link
Collaborator

According to the Intel document, MKL_CBWR=COMPATIBLE is indeed the only configuration that is supported for non-Intel CPUs:

Only the MKL_CBWR_COMPATIBLE option is supported on non-Intel CPUs.

which would explain why MKL_CBWR=AUTO,STRICT does not work on AMD. However, it should still work as expected on Intel. Can you double-check it was correctly set in your test on Intel?

Are the ctranslate2 wheels built with MKL 2019 Update 3 or an earlier version?

They use recent MKL versions. For example CTranslate2 1.20.1 wheels were already using Intel MKL 2021.2.

@robertBrnnn
Copy link
Author

This is the output with MKL_VERBOSE=1 set on Intel CPU, CNR is being set to AUTO,STRICT

MKL_VERBOSE SAXPBY(10,0x7ff56cb8eeb8,0x7ff56019f700,1,0x7ff56cb8eec0,0x7ff56019f700,1) 350ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ISAMAX(512,0x7ff56814c6c0,1) 276ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:1  NThr:1
MKL_VERBOSE SGEMM(N,N,1,10,10,0x7ff56cb8ed38,0x7ff560005440,1,0x7ff54c013880,10,0x7ff56cb8ed40,0x7ff5600e91c0,1) 9.57us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SAXPBY(10,0x7ff56cb8eeb8,0x7ff5600e91c0,1,0x7ff56cb8eec0,0x7ff5600e91c0,1) 194ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE GEMM_S8U8S32(T,N,C,1536,20,512,0x7ff56d38f4c0,0x562f12bf7d40,512,0x7ff56d38f530,0x7ff5682860c0,512,0x7ff56d38f518,0x7ff56d38f4c8,0x7ff56818fd00,1536,0x562f14807000) 125.56us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SGEMM_BATCH_STRIDED(T,N,2,1,64,0x7ff56d38f658,0x7ff56810f580,64,128,0x7ff5681edf00,64,64,0x7ff56d38f660,0x7ff568147ec0,2,2,160) 11.22us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SGEMM_BATCH_STRIDED(N,N,64,1,2,0x7ff56d38f658,0x7ff56818fd00,64,128,0x7ff5681826c0,2,2,0x7ff56d38f660,0x7ff568147ec0,64,64,160) 12.36us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE SAXPBY(5120,0x7ff56cb8ec08,0x7ff5601c0300,1,0x7ff56cb8ec10,0x7ff5601c0300,1) 3.32us CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:0  NThr:4
MKL_VERBOSE ISAMAX(512,0x7ff568160fc0,1) 328ns CNR:AUTO,STRICT Dyn:1 FastMM:1 TID:2  NThr:1

When it's set to COMPATIBLE on Intel the output is consistent, but significantly slower.

In one Intel doc I see:

Intel and Intel compatible CPUs have a few instructions, such as approximation instructions rcpps/rsqrtps, that may return different results. Setting the branch to MKL_CBWR_COMPATIBLE
ensures that Intel® oneAPI Math Kernel Library
does not use these instructions and forces a single Intel SSE2-only code path to be executed.

which seems to suggest that even STRICT CNR doesn't guarantee consistent results, only COMPATIBLE mode will.

@guillaumekln
Copy link
Collaborator

guillaumekln commented Jan 27, 2022

Thanks for the feedback. I have not seen a case where MKL_CBWR=AUTO,STRICT is not sufficient to get the same outputs on the same CPU. Not sure it matters, but are you running a vanilla or relative Transformer?

In any case, guaranteeing consistent results is generally hard. The easiest is to accept that translations can have slight variations, but I understand it is hard to explain that to end users.

Right now I'm not aware of another workaround without a performance penalty but I will keep exploring.

@robertBrnnn
Copy link
Author

Thanks Guillaume.

It's a very small subset of content that experiences this with MKL_CBWR=AUTO,STRICT, we mainly notice it for short numeric strings like currency patterns, but there are some short phrases too.

it is hard to explain that to end users.

Definitely! The most noticeable issue we get is currency strings, like €18.10 could be translated to French as 18,10 € the first time and translated as 18h10 the next.

Not sure it matters, but are you running a vanilla or relative Transformer?

It's vanilla Transformer

In any case, guaranteeing consistent results is generally hard.

Yeah, I completely understand, it's not an easy thing to fix.

We've switched to synchronous translation instead of batch for CPU without much of a performance impact, if any. So, I'm quite happy to stick with synchronous translation. We actually did the same for our GPU deployments previously too, consistent output is more of a priority for us, so we're willing to do synchronous translation over batch if it guarantees results.

@guillaumekln guillaumekln pinned this issue Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpu Issues related to CPU execution question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants
@guillaumekln @robertBrnnn and others