Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Silent failure on multi-threaded runs #614

Open
ivan-pi opened this issue Mar 9, 2024 · 1 comment
Open

[BUG] Silent failure on multi-threaded runs #614

ivan-pi opened this issue Mar 9, 2024 · 1 comment
Labels

Comments

@ivan-pi
Copy link

ivan-pi commented Mar 9, 2024

Describe the bug

likwid-pin appears to silently fail when using more than one thread, judging by the fact that the command exits almost immediately, and nothing is written to standard output.

To Reproduce

  • LIKWID command and/or API usage: $ likwid-pin -V 2 -c 0,1 ./albm

  • LIKWID version and download source (Github, FTP, package manger, ...): likwid-pin -- Version 5.3.0 (commit: 0123456789)

  • Operating system: Linux maxwell 5.15.0-100-generic #110~20.04.1-Ubuntu SMP

  • Does your application use libraries like MPI, OpenMP or Pthreads? Yes, OpenMP.

  • Are you using the MarkerAPI (CPU code instrumentation)? No.

To Reproduce with a LIKWID command

Please supply the output of the command with -V 3 added to the command:

(base) ivan@maxwell:~/lrz/rbfxlbm/build$ likwid-pin -V 3 -c 0,1 ./albm
DEBUG - [hwloc_init_cpuInfo:359] HWLOC CpuInfo Family 6 Model 167 Stepping 1 Vendor 0x0 Part 0x0 isIntel 1 numHWThreads 16 activeHWThreads 16
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 8 Thread 1 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 9 Thread 1 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 2 Thread 0 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 10 Thread 1 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 3 Thread 0 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 11 Thread 1 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 4 Thread 0 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 12 Thread 1 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 5 Thread 0 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 13 Thread 1 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 6 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 14 Thread 1 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 7 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 15 Thread 1 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 0 Level 1 Size 49152 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 1 Level 2 Size 524288 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 2 Level 3 Size 16777216 Threads 16
DEBUG - [affinity_init:547] Affinity: Socket domains 1
DEBUG - [affinity_init:549] Affinity: CPU die domains 1
DEBUG - [affinity_init:554] Affinity: CPU cores per LLC 8
DEBUG - [affinity_init:557] Affinity: Cache domains 1
DEBUG - [affinity_init:561] Affinity: NUMA domains 1
DEBUG - [affinity_init:562] Affinity: All domains 5
DEBUG - [affinity_addNodeDomain:370] Affinity domain N: 16 HW threads on 8 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S0: 16 HW threads on 8 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D0: 16 HW threads on 8 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 16 HW threads on 8 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M0: 16 HW threads on 8 cores
DEBUG - [create_lookups:290] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 2 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 3 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 4 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 5 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 6 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 7 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 8 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 9 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 10 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 11 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 12 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 13 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 14 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 15 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
Evaluated CPU string to CPUs: 0,1
Running: ./albm
Using 2 thread(s) (cpuset: 0x3)

In contrast with a single thread I get:

...
Evaluated CPU string to CPUs: 0
[likwid-pin] Main PID -> hwthread 0 - OK
Running: ./albm
Using 1 thread(s) (cpuset: 0x1)
 num_steps =         1000
 tau / dt ratio =   2.0000000E-02
 CFL  =   0.6270693    
 U0   =   1.1547005E-02
 Mach =   2.0000000E-02
 Re   =    1000.000    
 Everything okay
       51486     1081185
 In assembly routine:
    n   =        51485
    nnz =      1081185
    rownnz_max =           21
    rhs_max =            9
 Attempting to allocate memory
 n =        51485 , nz =           21 , q =            9
 sysclock (s)    3.43853497505188     
 mlups    14.9729455041758     
 ompwtime (s)    3.43853306770325     
 mlups    14.9729538096440     
 Total time (s)   3.43853306770325     
 Collision time ratio   1.559326410374301E-002
 Streaming time ratio   0.984065665745844     

If I run the application directly, it works as expected:

(base) ivan@maxwell:~/lrz/rbfxlbm/build$ OMP_NUM_THREADS=2 ./albm
 num_steps =         1000
 tau / dt ratio =   2.0000000E-02
 CFL  =   0.6270693    
 U0   =   1.1547005E-02
 Mach =   2.0000000E-02
 Re   =    1000.000    
 Everything okay
       51486     1081185
 In assembly routine:
    n   =        51485
    nnz =      1081185
    rownnz_max =           21
    rhs_max =            9
 Attempting to allocate memory
 n =        51485 , nz =           21 , q =            9
 sysclock (s)    1.81032705307007     
 mlups    28.4396107920625     
 ompwtime (s)    1.81032490730286     
 mlups    28.4396445013620     
 Total time (s)   1.81032490730286     
 Collision time ratio   1.993282543925349E-002
 Streaming time ratio   0.979440022346742     
@ivan-pi ivan-pi added the bug label Mar 9, 2024
@TomTheBear
Copy link
Member

Thanks for reporting. I never seen such a behavior.

Does it work with other applications and multiple threads? Are you using some computing library like TBB, Cilk+, SYCL, ...? If it is OpenMP, is it one of the common implementations (GCC, LLVM, Intel)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants