Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing two bugs which result in crashes on a 24-core a64fx (related to issue #599) #601

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jdomke
Copy link
Contributor

@jdomke jdomke commented Feb 1, 2024

bug 1: numCachesPerSocket is 0 and numberOfDomains is therefore calculated incorrectly resulting in writes to not allocated mem

bug 2: 24-core version of a64fx is essentiually a 48 core with some cores being disabled, but the core IDs are not numbered consecutively by the kernel resulting in segfaults in likwid

bug 1: numCachesPerSocket is 0 and numberOfDomains is therefore
calculated incorrectly resulting in writes to not allocated mem

bug 2: 24-core version of a64fx is essentiually a 48 core with some
cores being disabled, but the core IDs are not numbered consecutively by
the kernel resulting in segfaults in likwid
@@ -639,7 +641,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
if (NULL != (fp = fopen (bdata(file), "r")))
{
bstring src = bread ((bNread) fread, fp);
hwThreadPool[i].coreId = ownatoi(bdata(src));
hwThreadPool[i].coreId = (++last_coreid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using a simple counter and not the content of the core_id file from sysfs? LIKWID historically used a lot of counters for HWthreads, NUMA nodes and other topological entities but this has been proven to be error-prone. We changed a lot of code in the past to use counters less frequently and rely on OS info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because core_id on that system is not 0, 1, 2, ..., 23; but instead has holes like 0, 2, 3, 4, 7, ...., 48 depending on which cores have been disabled in the CPU's "firmware". the issue is that some of likwid code allocates arrays based on the number of cores, and uses core_id as index, meaning you get out-of-bounds access which i tracked down with valgrind to see why likwid was segfaulting on that system. usually the OS takes care of the remapping of core_id into a consecutive number range, but for some reason what kernel fix is not available for this system/cpu. i admit that it is a very rare bug...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for my clumbsy explanation, but just to give you an idea, here is how it actually looks on the system:

⇒  cat /sys/devices/system/cpu/cpu*/topology/core_id | sort -u
0
1
10
11
5
6
7
8

instead of counting 0,...,5 for each CMG, you get 0,...,11 with missing numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants