Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argon2-opencl fails on CPU and MIC #5417

Open
solardiz opened this issue Jan 6, 2024 · 5 comments
Open

argon2-opencl fails on CPU and MIC #5417

solardiz opened this issue Jan 6, 2024 · 5 comments
Labels

Comments

@solardiz
Copy link
Member

solardiz commented Jan 6, 2024

A known shortcoming/bug of the argon2-opencl format is that it fails self-test on CPU(-like) devices, as tested with ancient Intel OpenCL and AMD APP SDK that we have on our online dev boxes and with recent Intel OpenCL that @alainesp has on his laptop. We don't know exactly why - a guess is this has something to do with our usage of local memory.

The format works on most GPUs, the only exception identified so far being Intel HD Graphics, where it also fails.

The failures on CPUs and Intel GPU are FAILED (cmp_one(1)). The failure on MIC includes segfaults.

@solardiz solardiz added the bug label Jan 6, 2024
@solardiz
Copy link
Member Author

solardiz commented Jan 6, 2024

FWIW, the contents of out after the pre_processing kernel on Intel and AMD OpenCL on CPU match GPU's (so must be correct). On Intel HD Graphics, they don't match, so we seem to have/trigger a separate bug there.

So, not surprisingly, the main issue appears to be beyond pre-processing. This is consistent with this format already failing on CPUs before @alainesp moved the pre-processing from host to device.

@solardiz
Copy link
Member Author

solardiz commented Jan 6, 2024

Overriding these didn't make a difference (still works on GPUs, fails on CPUs):

#define upsample(a, b) (((ulong)(a) << 32) | (b))
#define mul_hi(a, b) ((ulong)(a) * (b) >> 32)

@solardiz
Copy link
Member Author

solardiz commented Jan 6, 2024

With the below hack and shmemSize forced to 32 KiB, it still works on a GPU, but still fails on CPUs like before:

-       uint warp   = (get_local_id(1) * get_local_size(0) + get_local_id(0)) / THREADS_PER_LANE;
+       uint warp   = (get_global_id(1) * get_global_size(0) + get_global_id(0)) / THREADS_PER_LANE;

So the issue is probably not specific to behavior of get_local_* on CPU.

@alainesp
Copy link
Contributor

alainesp commented Jan 6, 2024

Maybe we should print a warning to the user when detecting CPU or Intel GPUs besides the self-test fail? Explain the situation a little more.

@solardiz
Copy link
Member Author

solardiz commented Apr 6, 2024

In #5420, @magnumripper shows a macOS system where the format works for the first few test vectors on HD Graphics (edit: specifically, on Intel(R) UHD Graphics 630), only failing at FAILED (cmp_one(10)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants