Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md5crypt-opencl on Intel Alder Lake GPU: CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back #5439

Open
cpatulea opened this issue Feb 22, 2024 · 3 comments

Comments

@cpatulea
Copy link
Contributor

cpatulea commented Feb 22, 2024

Hi, I've got an Intel Alder Lake (N100) with "Intel Xe (Gen 12.2) GPU" and looking to use md5crypt-opencl on it.

I'm getting this error:

$ /snap/john-the-ripper/610/run/john --format=md5crypt-opencl shadow
Device 1: Intel(R) Graphics [0x46d1]
Using default input encoding: UTF-8
Loaded 1 password hash (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back
Segmentation fault

I searched for similar reports, and also tried with

GWS=64
GWS=524288

did not change the outcome (still CL_OUT_OF_RESOURCES).

Code link from error message:

BENCH_CLERROR(clEnqueueReadBuffer(queue[gpu_id], mem_out, CL_FALSE,

clinfo

Number of platforms                               1
  Platform Name                                   Intel(R) OpenCL HD Graphics
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 3.0
  Platform Profile                                FULL_PROFILE
...
  Platform Name                                   Intel(R) OpenCL HD Graphics
Number of devices                                 1
  Device Name                                     Intel(R) Graphics [0x46d1]
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 3.0 NEO
  Device UUID                                     86800000-d146-0000-0000-000000000000
  Driver UUID                                     32322e34-332e-3234-3539-350000000000
...
  Driver Version                                  22.43.24595
  Device OpenCL C Version                         OpenCL C 1.2
...
  Latest conformance test passed                  v2022-04-22-00
  Device Type                                     GPU
  Device PCI bus info (KHR)                       PCI-E, 0000:00:02.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               24
  Max clock frequency                             750MHz
  Device IP (Intel)                               0xc0000 (0.192.0)
  Device ID (Intel)                               18129
  Slices (Intel)                                  1
  Sub-slices per slice (Intel)                    2
  EUs per sub-slice (Intel)                       16
  Threads per EU (Intel)                          7
  Feature capabilities (Intel)                    DP4A
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             512x512x512
  Max work group size                             512
  Preferred work group size multiple (device)     64
  Preferred work group size multiple (kernel)     64
  Max sub-groups per work group                   64
  Sub-group sizes (Intel)                         8, 16, 32
...
  Global memory size                              13229461504 (12.32GiB)
  Error Correction support                        No
  Max memory allocation                           4294959104 (4GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
  Unified Shared Memory (USM)                     (cl_intel_unified_shared_memory)
  Host USM capabilities (Intel)                   USM access, USM atomic access
  Device USM capabilities (Intel)                 USM access, USM atomic access
  Single-Device USM caps (Intel)                  USM access, USM atomic access
  Cross-Device USM caps (Intel)                   (n/a)
  Shared System USM caps (Intel)                  (n/a)
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
...

john --list=build-info

$ /snap/john-the-ripper/610/run/john --list=build-info
Version: 1.9.0-jumbo-1+bleeding-39db7dd63e 2023-09-20 17:02:33 -0300
Build: linux-gnu 64-bit x86_64 AVX2 AC OMP OPENCL
SIMD: AVX2, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
System-wide exec: /snap/john-the-ripper/current/run
System-wide home: /snap/john-the-ripper/current/run
Private home: ~/.john
CPU tests: AVX2
CPU fallback binary: john-avx-omp
OMP fallback binary: john-avx2
$JOHN is /snap/john-the-ripper/current/run/
Format interface version: 14
Max. number of reported tunable costs: 4
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
SALT_HASH_SIZE: 1048576
SINGLE_IDX_MAX: 2147483648
SINGLE_BUF_MAX: 4294967295
Effective limit: Number of salts vs. SingleMaxBufferSize
Max. Markov mode level: 400
Max. Markov mode password length: 30
gcc version: 11.4.0
GNU libc version: 2.35 (loaded: 2.36)
OpenCL headers version: 1.2
Crypto library: OpenSSL
OpenSSL library version: 030000020	(loaded: 0300000b0)
OpenSSL 3.0.2 15 Mar 2022	(loaded: OpenSSL 3.0.11 19 Sep 2023)
GMP library version: 6.2.1
File locking: fcntl()
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's
times(2) sysconf(_SC_CLK_TCK) is 100
Using times(2) for timers, resolution 10 ms
HR timer: clock_gettime(), latency 42 ns
Total physical host memory: 15770 MiB
Available physical host memory: 12074 MiB
Terminal locale string: en_US.UTF-8
Parsed terminal locale: UTF-8

Input file

root:$1$uMJfnnig$O6<snip>X1:16314:0:99999:7:::

I'm familiar with modifying and building code, let me know if there's something I can try in the code.

@cpatulea cpatulea changed the title md5crypt-opencl: CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back md5crypt-opencl on Intel Alder Lake GPU: CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back Feb 22, 2024
@solardiz
Copy link
Member

Hi @cpatulea. Thank you for reporting this. Can you try these:

john --format=md5crypt-opencl --test -v=5
john --format=phpass-opencl --test -v=5
john --format=pbkdf2-hmac-md5-opencl --test -v=5
john --format=md5crypt-opencl --skip-self-test shadow

@cpatulea
Copy link
Contributor Author

john --format=md5crypt-opencl --test -v=5

$ /snap/john-the-ripper/610/run/john --format=md5crypt-opencl --test -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: Intel(R) Graphics [0x46d1]
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Loaded 68 hashes with 33 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-std=CL1.2 -D__GPU__ -DDEVICE_INFO=34 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=22 -DDEV_VER_MINOR=43 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15 $JOHN/opencl/cryptmd5_kernel.cl
binary size 340456
LWS=7 GWS=49 (7 blocks) 0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back
Segmentation fault

john --format=phpass-opencl --test -v=5

$ /snap/john-the-ripper/610/run/john --format=phpass-opencl --test -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: Intel(R) Graphics [0x46d1]
Benchmarking: phpass-opencl ($P$9) [MD5 OpenCL 4x]... Loaded 49 hashes with 18 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-std=CL1.2 -D__GPU__ -DDEVICE_INFO=34 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=22 -DDEV_VER_MINOR=43 -D_OPENCL_COMPILER -DV_WIDTH=4 -DPLAINTEXT_LENGTH=39 $JOHN/opencl/phpass_kernel.cl
binary size 124976
LWS=7 GWS=49 (7 blocks) PASS,
Test mask: ?a?a?l?u?d?d?s
Calculating best GWS for LWS=16; max. 100 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning for iteration count of 2048 and password length 7
xfer: 5.052 us, crypt: 12.135 ms, xfer: 4.856 us
gws:       256    84314 c/s       84314 rounds/s   12.145 ms per crypt_all()!
xfer: 7.447 us, crypt: 18.871 ms, xfer: 5.802 us
gws:       512   108447 c/s      108447 rounds/s   18.884 ms per crypt_all()+
xfer: 10.364 us, crypt: 33.706 ms, xfer: 8.351 us
gws:      1024   121453 c/s      121453 rounds/s   33.724 ms per crypt_all()+
xfer: 21.041 us, crypt: 59.523 ms, xfer: 13.635 us
gws:      2048   137546 c/s      137546 rounds/s   59.558 ms per crypt_all()+
xfer: 31.614 us, crypt: 116.152 ms (exceeds 100 ms)
xfer: 15.937 us, crypt: 33.715 ms, xfer: 8.237 us
gws:      1024   121400 c/s      121400 rounds/s   33.739 ms per crypt_all()-
Calculating best LWS for GWS=2048
Testing LWS=16 GWS=2048 ... 238.094 ms+
Testing LWS=32 GWS=2048 ... 238.107 ms
Testing LWS=64 GWS=2048 ... 238.093 ms
Testing LWS=128 GWS=2048 ... 238.127 ms
Testing LWS=256 GWS=2048 ... 238.402 ms
Testing LWS=512 GWS=2048 ... 288.140 ms
Calculating best GWS for LWS=16; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
xfer: 6.875 us, crypt: 10.545 ms, xfer: 4.036 us
gws:       192    72754 c/s       72754 rounds/s   10.556 ms per crypt_all()!
xfer: 6.718 us, crypt: 12.438 ms, xfer: 4.725 us
gws:       384   123377 c/s      123377 rounds/s   12.449 ms per crypt_all()+
xfer: 8.645 us, crypt: 23.740 ms, xfer: 6.641 us
gws:       768   129316 c/s      129316 rounds/s   23.755 ms per crypt_all()+
xfer: 16.666 us, crypt: 44.440 ms, xfer: 10.786 us
gws:      1536   138167 c/s      138167 rounds/s   44.467 ms per crypt_all()+
xfer: 19.739 us, crypt: 85.624 ms, xfer: 24.266 us
gws:      3072   143436 c/s      143436 rounds/s   85.668 ms per crypt_all()+
xfer: 44.791 us, crypt: 168.012 ms, xfer: 58.454 us
gws:      6144   146185 c/s      146185 rounds/s  168.115 ms per crypt_all()+
xfer: 73.333 us, crypt: 332.766 ms (exceeds 200 ms)
xfer: 20.052 us, crypt: 85.626 ms, xfer: 20.820 us
gws:      3072   143438 c/s      143438 rounds/s   85.667 ms per crypt_all()-
LWS=16 GWS=6144 (384 blocks) DONE
Speed for cost 1 (iteration count) of 2048
Warning: "Many salts" test limited: 12/256
Many salts:	145996 c/s real, 145996 c/s virtual
Only one salt:	145996 c/s real, 145276 c/s virtual

john --format=pbkdf2-hmac-md5-opencl --test -v=5

$ /snap/john-the-ripper/610/run/john --format=pbkdf2-hmac-md5-opencl --test -v=5
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
Device 1: Intel(R) Graphics [0x46d1]
Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL 4x]... Loaded 20 hashes with 19 different salts to test db from test vectors
Options used: -I opencl -cl-mad-enable -cl-std=CL1.2 -D__GPU__ -DDEVICE_INFO=34 -D__SIZEOF_HOST_SIZE_T__=8 -DDEV_VER_MAJOR=22 -DDEV_VER_MINOR=43 -D_OPENCL_COMPILER -DHASH_LOOPS=333 -DOUTLEN=16 -DPLAINTEXT_LENGTH=64 -DV_WIDTH=4 $JOHN/opencl/pbkdf2_hmac_md5_kernel.cl
binary size 504112
LWS=7 GWS=49 (7 blocks) PASS,
Test mask: ?a?a?l?u?d?d?s
Calculating best GWS for LWS=16; max. 100 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning for iterations of 1000 and password length 7
P xfer: 5.781 us, init: 145.312 us, loop: 3x3.691 ms, final: 10.052 us, res xfer: 4.222 us
gws:       256    90922 c/s   182025844 rounds/s   11.262 ms per crypt_all()!
P xfer: 8.750 us, init: 154.166 us, loop: 3x5.932 ms, final: 15.677 us, res xfer: 6.247 us
gws:       512   113664 c/s   227555328 rounds/s   18.017 ms per crypt_all()+
P xfer: 12.083 us, init: 180.156 us, loop: 3x10.390 ms, final: 36.458 us, res xfer: 7.585 us
gws:      1024   130155 c/s   260570310 rounds/s   31.469 ms per crypt_all()+
P xfer: 18.385 us, init: 325 us, loop: 3x18.268 ms, final: 54.427 us, res xfer: 12.175 us
gws:      2048   148067 c/s   296430134 rounds/s   55.326 ms per crypt_all()+
P xfer: 36.041 us, init: 637.291 us, loop: 3x35.554 ms, final: 106.927 us, res xfer: 34.276 us
gws:      4096   152137 c/s   304578274 rounds/s  107.691 ms per crypt_all()+
P xfer: 87.604 us, init: 1.125 ms, loop: 3x68.558 ms, final: 205.416 us, res xfer: 73.477 us
gws:      8192   157857 c/s   316029714 rounds/s  207.578 ms per crypt_all()+
P xfer: 259.843 us, init: 2.111 ms, loop: 3x136.129 ms (exceeds 100 ms)
P xfer: 46.927 us, init: 642.760 us, loop: 3x35.560 ms, final: 109.739 us, res xfer: 30.445 us
gws:      4096   152089 c/s   304482178 rounds/s  107.726 ms per crypt_all()-
Calculating best LWS for GWS=8192
Testing LWS=16 GWS=8192 ... 205.680 ms+
Testing LWS=32 GWS=8192 ... 205.680 ms
Testing LWS=64 GWS=8192 ... 205.679 ms
Testing LWS=128 GWS=8192 ... 205.702 ms
Testing LWS=256 GWS=8192 ... 205.743 ms
Testing LWS=512 GWS=8192 ... 217.335 ms
Calculating best GWS for LWS=16; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
P xfer: 5.416 us, init: 169.791 us, loop: 3x3.689 ms, final: 8.750 us, res xfer: 9.778 us
gws:       192    68051 c/s   136238102 rounds/s   11.285 ms per crypt_all()!
P xfer: 7.395 us, init: 154.218 us, loop: 3x5.924 ms, final: 17.864 us, res xfer: 5.451 us
gws:       384    85362 c/s   170894724 rounds/s   17.993 ms per crypt_all()+
P xfer: 11.510 us, init: 165.156 us, loop: 3x7.387 ms, final: 21.822 us, res xfer: 7.245 us
gws:       768   137070 c/s   274414140 rounds/s   22.411 ms per crypt_all()+
P xfer: 17.760 us, init: 298.906 us, loop: 3x13.659 ms, final: 42.343 us, res xfer: 12.499 us
gws:      1536   148296 c/s   296888592 rounds/s   41.430 ms per crypt_all()+
P xfer: 28.281 us, init: 488.020 us, loop: 3x26.237 ms, final: 83.177 us, res xfer: 18.874 us
gws:      3072   154590 c/s   309489180 rounds/s   79.487 ms per crypt_all()+
P xfer: 57.656 us, init: 841.458 us, loop: 3x51.375 ms, final: 155.781 us, res xfer: 56.939 us
gws:      6144   157996 c/s   316307992 rounds/s  155.547 ms per crypt_all()+
P xfer: 166.718 us, init: 1.643 ms, loop: 3x101.661 ms, final: 301.093 us, res xfer: 129.374 us
gws:     12288   159669 c/s   319657338 rounds/s  307.835 ms per crypt_all()+
P xfer: 594.479 us, init: 3.195 ms, loop: 3x202.240 ms (exceeds 200 ms)
P xfer: 59.062 us, init: 848.020 us, loop: 3x51.389 ms, final: 160.260 us, res xfer: 51.444 us
gws:      6144   157949 c/s   316213898 rounds/s  155.594 ms per crypt_all()-
LWS=16 GWS=12288 (768 blocks) DONE
Speed for cost 1 (iterations) of 1000 and 10000
Raw:	29170 c/s real, 29170 c/s virtual

john --format=md5crypt-opencl --skip-self-test shadow

$ /snap/john-the-ripper/610/run/john --format=md5crypt-opencl --skip-self-test shadow
Device 1: Intel(R) Graphics [0x46d1]
Using default input encoding: UTF-8
Loaded 1 password hash (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
LWS=32 GWS=192 (6 blocks)
Proceeding with single, rules:Single
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
0: OpenCL CL_OUT_OF_RESOURCES (-5) error in opencl_md5crypt_fmt_plug.c:404 - Copy data back

@solardiz
Copy link
Member

Thanks @cpatulea. So the issue is specific to md5crypt-opencl. While we could possibly have a bug in there causing this, that format works just fine on many other devices, including on older Intel HD Graphics with older Intel OpenCL backend. So I don't see what we'd do about your report now, other than being aware of it.

Also, as you can see these other related formats' speeds are quite low so that even if you do get this working, the speed will probably be similar to what you're getting on the CPU cores, so you'll at most double the total speed by using both CPU and GPU at once (or less than double, especially if the total TDP limit kicks in). You can estimate this by benchmarking --format=phpass on CPU and comparing to what you're getting on this GPU. It should scale for md5crypt similarly.

OTOH, this isn't bad for a 6W, $50 CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants