Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temperature reporting fix for AMD Zen CPUs #1176

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

JohnsonGlenT
Copy link

Reports Package temperature across CPU dies for
AMD Zen CPU's. Temperature is now selected by
package die Tccd[X] for for temperature reporting
and ignores the Tctl reporting.

Tested and Validated with:
Ryzen 3600 on Artix Linux
Ryzen Threadripper 3970x on Arch Linux

Reconfigured temperature reporting for AMD Zen Architecture to reflect
how libsensors reports temperature

Need to test with additional AMD zen architectures namely ones with
multiple packages

Tested with: Ryzen 3600 on Artix Linux
Use list of k10temp sensor temperatures to populate temperatures
for each die package. Assuming that AMD Zen CPU cores are listed
by die.
Reports Package temperature across CPU dies for
AMD Zen CPU's. Temperature is now selected by
package die (Tccd[X]) for temperature reporting
and ignores the Tctl reporting.

Tested and Validated with:
Ryzen 3600
Ryzen Threadripper 3970x
linux/LibSensors.c Outdated Show resolved Hide resolved
data[i] = data[1];
/* Check AMD Zen CPUs packages, and mirror temps across packages */
if (coreTempCount != existingCPUs) {
double temp[coreTempCount];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is left partially uninitialized if data contains NAN values.

@BenBE BenBE added enhancement Extension or improvement to existing feature Linux 🐧 Linux related issues labels Jan 27, 2023
Moved conditional check for AMD Zen CPUs after HT/SMT check

Added additional conditional check for AMD Zen CPUs to ensure it wont be
triggered when no temps are reported

Fixed possible garbage value reporting if triggered by another process
Comment on lines +281 to +283

/* No further adjustments */
goto out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be one level up, because the special-casing should end here regardless of a global temperature found or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the goto statement should stay here and a fallback conditional should be created for the case where coreTempCount is 1 and data[0] has the only value in it.

This check can be added into the current conditional statement, however, from looking at all the other conditional statements, we should have that moved out into its own check so that it remains a fallback and is not coupled with the section specified for AMD Zen CPU temperature reporting. (change idea added below)

I may be needlessly worried about a impossible or niche hardware/platform config.

Suggested change
/* No further adjustments */
goto out;
/* No further adjustments */
goto out;
}
/* AMD Zen CPU can report coreTempCount as 1 - use fallback conditional */
}
/* Fallback for single temperature count reporting */
if (coreTempCount == 1 && !isnan(data[0])) {
for (unsigned int i = 1; i <= existingCPUs; i++) {
data[i] = data[0];
/* No further adjustments */
goto out;
}

Created fallback for when only one temperature is actually reported.
Fixed static analyzers by setting initial value of temp[0]
@BenBE BenBE added this to the 3.3.0 milestone Feb 5, 2023
@CyberX5
Copy link

CyberX5 commented Feb 27, 2023

Hi, I compiled with these patches and it seems to not work correctly.

This is on a Ryzen 5900X.

Screenshot_2023-02-27_16-35-01

@JohnsonGlenT
Copy link
Author

JohnsonGlenT commented Mar 4, 2023

Hi, I compiled with these patches and it seems to not work correctly.

This is on a Ryzen 5900X.

Screenshot_2023-02-27_16-35-01

can you submit the output of sensors?

@CyberX5
Copy link

CyberX5 commented Mar 4, 2023

Oh sorry about that. I was thinking of including that initially.

sensors k10temp-pci-00c3

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +41.4°C
Tccd1:        +41.2°C
Tccd2:        +38.2°C
sensors -u k10temp-pci-00c3

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:
  temp1_input: 44.125
Tccd1:
  temp3_input: 37.750
Tccd2:
  temp4_input: 39.250

@JohnsonGlenT
Copy link
Author

@CyberX5
Can you run and upload the ryzen.log that https://github.com/JohnsonGlenT/htop/tree/ryzen5900X branch creates.

@C0rn3j
Copy link
Contributor

C0rn3j commented Mar 12, 2023

Just adding a test result (compiled your fresh logging branch)

7600X works great - top is current stable, bottom is your branch
image

@CyberX5
Copy link

CyberX5 commented Mar 12, 2023

@CyberX5 Can you run and upload the ryzen.log that https://github.com/JohnsonGlenT/htop/tree/ryzen5900X branch creates.

Sure 🙂 ryzen.log

@JohnsonGlenT
Copy link
Author

@CyberX5 What OS/kernel/Distro are you on?

From the log, your system seems to be reporting coreTempCount as 3. I am assuming it is counting the Tctl, which so far seems to be an issue not with your CPU, but it may be how htop is dealing with polling the k10temp sensors output on older kernels. If this issue will exist on major versions of the kernel, namely LTS, additional changes may need to be made.

@CyberX5
Copy link

CyberX5 commented Mar 14, 2023

@JohnsonGlenT Well, I do want to apologize it does seem to be an issue on my end, I'm on Arch Linux, 6.1 LTS Kernel, but I run a custom kernel config and I'm guessing that's the problem, now I do keep that in mind so I did test this on the stock Arch kernel before reporting, but must have messed something up...

ryzen-6.1.19-1-lts.log Here's the log from the stock Arch LTS kernel. CoreTempCount is 2 now.

Also, I have no idea what config option could be causing this... Haven't seen anything like this in any other program 🤷

Now, I have encountered a different issue.

Screenshot_2023-03-14_04-43-55

To my understanding the CCD layout of this CPU is 0-5,12-17 = CCD1 and 6-11,18-23 = CCD2.

lstopo for reference:
Screenshot_2023-03-14_04-58-08

I actually have the columns set to 2 for this reason, so visually 1 row would contain both threads of a core. For example thread 0 and 12 would be core 0.

Here's a command I found that lists sibling threads. Found here.

grep -H . /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sort -n -t ':' -k 2 -u
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,12
/sys/devices/system/cpu/cpu1/topology/thread_siblings_list:1,13
/sys/devices/system/cpu/cpu2/topology/thread_siblings_list:2,14
/sys/devices/system/cpu/cpu3/topology/thread_siblings_list:3,15
/sys/devices/system/cpu/cpu4/topology/thread_siblings_list:4,16
/sys/devices/system/cpu/cpu5/topology/thread_siblings_list:5,17
/sys/devices/system/cpu/cpu6/topology/thread_siblings_list:6,18
/sys/devices/system/cpu/cpu7/topology/thread_siblings_list:7,19
/sys/devices/system/cpu/cpu8/topology/thread_siblings_list:8,20
/sys/devices/system/cpu/cpu9/topology/thread_siblings_list:9,21
/sys/devices/system/cpu/cpu10/topology/thread_siblings_list:10,22
/sys/devices/system/cpu/cpu11/topology/thread_siblings_list:11,23

And again sorry about that ❤️

@C0rn3j
Copy link
Contributor

C0rn3j commented Mar 14, 2023

My test log above with stable htop that is mostly broken is from I believe Arch Linux with stock 6.2.2 kernel.

@JohnsonGlenT
Copy link
Author

To my understanding the CCD layout of this CPU is 0-5,12-17 = CCD1 and 6-11,18-23 = CCD2.

lstopo for reference: Screenshot_2023-03-14_04-58-08

I actually have the columns set to 2 for this reason, so visually 1 row would contain both threads of a core. For example thread 0 and 12 would be core 0.

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

@CyberX5
Copy link

CyberX5 commented Mar 14, 2023

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

Ah I see, interesting, didn't think of that. 🙂

@JohnsonGlenT
Copy link
Author

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

Ah I see, interesting, didn't think of that. slightly_smiling_face

I have confirmed that htop reads CPU number by physical id. I will update when I have a fix that is ready for testing.

@BenBE BenBE marked this pull request as draft March 15, 2023 06:30
@JohnsonGlenT
Copy link
Author

JohnsonGlenT commented Mar 16, 2023

@CyberX5 This is likely a problem. However, if htop reports CPU based on logical numbering, not physical, then the current temperature readout is correct. I will take a look into this, and will update on my findings.

Ah I see, interesting, didn't think of that. slightly_smiling_face

I have confirmed that htop reads CPU number by physical id. I will update when I have a fix that is ready for testing.

siblist.txt

So, as of doing looking with other zen architectures (threadripper 3970X, file attached) there seems to be no uniform way to ensure mapping of the CPU thread siblings will receive the same temperature reading, except for polling the thread siblings on every update tick. In this case I think mapping the temp by logical process number is a good enough solution without possibly affecting other systems, or needing to overhaul CPU startup/update loop.

leahneukirchen added a commit to leahneukirchen/htop that referenced this pull request Dec 25, 2023
leahneukirchen added a commit to leahneukirchen/htop that referenced this pull request Dec 25, 2023
@BenBE BenBE modified the milestones: 3.3.0, 3.4.0 Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Extension or improvement to existing feature Linux 🐧 Linux related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants