Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaining issues with RDNA3 and 0.5.2 (kernel 6.7) #255

Open
43615 opened this issue Jan 24, 2024 · 28 comments
Open

Remaining issues with RDNA3 and 0.5.2 (kernel 6.7) #255

43615 opened this issue Jan 24, 2024 · 28 comments

Comments

@43615
Copy link

43615 commented Jan 24, 2024

Note that I do have the kernel parameteramdgpu.ppfeaturemask=0xffffffff.
image

  • Fan speed always shows 0. Seems to be a global kernel/driver issue (sensors also shows 0).
  • Static fan control doesn't work. Can't tell if there's a cutoff due to the above. Curve works fine.
  • Power limit can't be raised above 350 W. Slider goes to 389, which is also wrong (this card has a default of 420).
    image
@ilya-zlobintsev
Copy link
Owner

  • Fan speed reading: not much can be done on LACT's side if it's a lower level reporting issue
  • Static fan speed: the static setting works by setting a curve with all of the points at the same speed. Are you sure the behaviour is different when you're using a custom curve? What i've previously seen during testing is that the GPU might have a point below which it turns off the fan regardless of settings, but once it crosses that point it starts using the configured speed. Maybe that's what is happening?
  • Power limit: this is a known issue on the kernel side, it's being worked on.

@43615
Copy link
Author

43615 commented Jan 24, 2024

Glad to hear about the power limit, hopefully that's coming soon.
As for the fan control: It ramps up correctly when using the curve, but static speed doesn't seem to do anything at any level (even 100%). Seems like it doesn't change the speed at all. I might test it some more tomorrow, but it's hard to get accurate results due to the broken speed reading.

@ilya-zlobintsev
Copy link
Owner

If you can manage to replicate the proper "static speed" behaviour using a curve (by having all of its points on the same speed), then please tell me what that curve looks like. The current implementation uses a single minimum temperature point and fills the rest with the maximum, it might perform differently if the curve is configured in some other way.

You can check the actual curve that's applied in:

cat /sys/class/drm/card*/device/gpu_od/fan_ctrl/fan_curve

@43615
Copy link
Author

43615 commented Jan 28, 2024

Sorry, didn't get around to this until now.
Apparently I can't actually get it to a custom speed at all! The curve does get applied judging by your command, but there doesn't seem to be any effect.
It only ramps up when hot (testing with a short benchmark) and nothing I change affects that behavior. Again, I can only test it by ear due to the broken readout.

@Dominik-Zehnter-17
Copy link

I have a 7800XT and have tested controlling the fan curves. If you remain below a usage limit on the GPU, fan control does nothing. Only once you have higher usage (when booting up a game e.g.) the GPU\ applies your fan curves. This seems to be a hardware/driver issue that LACT has no control over. Run a game in the background, then play with your fan curves, that's what worked for me

@Nama
Copy link

Nama commented Jan 30, 2024

On my XFX SPEEDSTER MERC 310 AMD Radeon RX 7900 XT the fan speed is correctly read.

Setting static fan speed or a curve doesn't error, but wont work. Same with tuxclocker.

This works:

echo "0 36 20" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "1 40 30" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "2 45 35" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "3 50 40" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "4 55 45" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "c" | sudo tee /sys/class/drm/card*/device/gpu_od/fan_ctrl/fan_curve

But its not possible to change the mode to manual for static fan speed:

# echo 1 > /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
# cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
2
# echo 100 > /sys/class/drm/card0/device/hwmon/hwmon2/pwm1
# dmesg
amdgpu: manual fan speed control should be enabled first

debug.tgz
Can't upload .tar files here, maybe change it to .tgz

PS: I had to chmod o+rw /var/run/lactd.sock to make the GUI connect to the daemon.

@ilya-zlobintsev
Copy link
Owner

This works:

Setting a curve through lact should use exactly the same commands, can you check how the contents of fan_curve differ between setting it manually through these commands and setting the same curve in lact?

But its not possible to change the mode to manual for static fan speed:

This is expected, the hwmon interface is readonly on RDNA3

I had to chmod o+rw /var/run/lactd.sock to make the GUI connect to the daemon.

You should add your user's group to the start of admin_groups under daemon in /etc/lact/config.yaml.

@Nama
Copy link

Nama commented Jan 31, 2024

nvm, can't get it working again
echoing:

# cat /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 30C 55%
1: 40C 65%
2: 45C 70%
3: 50C 75%
4: 55C 80%
OD_RANGE:
FAN_CURVE(hotspot temp): 25C 100C
FAN_CURVE(fan speed): 15% 100%

LACT:

# cat /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 40C 100%
1: 50C 100%
2: 60C 100%
3: 70C 100%
4: 80C 100%
OD_RANGE:
FAN_CURVE(hotspot temp): 25C 100C
FAN_CURVE(fan speed): 15% 100%

I got it to spin up once a few days ago, but it was weird and didn't seem right...

This is expected, the hwmon interface is readonly on RDNA3

maaaaan, I thought everythings implemented on 6.7 >_<

@ilya-zlobintsev
Copy link
Owner

maaaaan, I thought everythings implemented on 6.7 >_<

This isn't a missing feature, it's a change in how the GPU firmware works, you're supposed to use the new fan_curve and target temperature/speed interfaces instead of it.

@ilya-zlobintsev
Copy link
Owner

There have been some updates regarding the power limit setting in kernel 6.7.3:

drm/amd/pm: update the power cap setting
drm/amd/pm: Fetch current power limit from FW

It's worth checking if that helps with the incorrect limit

@FerrumMaster
Copy link

But there is problem with OC in general. When you enable static FAN it breaks OC settings being saved, they reset back to stock.

OC wise it is still a mess. While Kernel 6.8 allows you to set the right power limit now, it uses it in weird fashion and breaks clocking higher, thus you get slower performance.

@misaligar
Copy link

misaligar commented Feb 23, 2024

I have 7900 XTX on Arch Linux. Power limit works but fan control doesn't. System still turns on/off the fan at built-in card thresholds instead of the custom curve I set up using the LACT GUI. I have both tried the curve and static. No changes to the fan speed at all. Any recommendations?

Note that the OC is enabled, system rebooted, and I do have the kernel parameter amdgpu.ppfeaturemask=0xffffffff

Debug file: LACT-sysfs-snapshot-20240223-193349.zip

@ilya-zlobintsev
Copy link
Owner

System still turns on/off the fan at built-in card thresholds

Unfortunately there isn't anything you can do about this currently. It will use your custom settings after it crosses the threshold, but you cannot configure this threshold.

@misaligar
Copy link

System still turns on/off the fan at built-in card thresholds

Unfortunately there isn't anything you can do about this currently. It will use your custom settings after it crosses the threshold, but you cannot configure this threshold.

Thanks for responding to my message. Does this mean Lact will never work for my card? Or is it something fixed?

@ilya-zlobintsev
Copy link
Owner

If the driver adds support for configuring this, then LACT will have an option for it.

@misaligar
Copy link

misaligar commented Feb 24, 2024

Thank you again. Much appreciate your time replying to me. If you don't mind, one final question, what's the best way to find out if the driver will add a support for configuring fan curves? Should I follow the linux kernel updates?

I found the following which appears to be adding fan control support for RDNA3 cards. Not sure why mine still doesn't work though.

https://lore.kernel.org/lkml/CAPM=9txd+1FtqU-R_8Zr_UePUzu7QUWsDBV1syKBo16v_gx2XQ@mail.gmail.com/

Linux arch 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 23 Feb 2024 16:31:48 +0000 x86_64 GNU/Linux

@ilya-zlobintsev
Copy link
Owner

Fan control itself is supported, the card will use your custom fan speed settings, but only after a builtin threshold when the fan gets turned on - that's the part you cannot currently configure.

As for updates: kernel changelog will have info about it if something changes, you can also track these issues in amd's repo:
https://gitlab.freedesktop.org/drm/amd/-/issues/2406
https://gitlab.freedesktop.org/drm/amd/-/issues/2402

@misaligar
Copy link

but only after a builtin threshold when the fan gets turned on

Oh I see it now. That makes sense. I was wondering why the fan speed goes up and down randomly. So it does recognize my custom curve but still tied to the built in thresholds. Thanks for the explanation. I will check out the links you included.

@In-line
Copy link
Contributor

In-line commented Mar 3, 2024

XFX merc 310 7900XTX
ArchLinux 6.7.6-zen-1-2

Any change to FAN settings results in Input/Output error and failure to change settings again, but the following manual script works. Although I would prefer GUI to be fixed.

#!/usr/bin/bash

GPU_DEVICE="/sys/class/drm/card1/device"
GPU_SYSFS_FAN="$GPU_DEVICE/gpu_od/fan_ctrl"
GPU_SYSFS_HWMON="$GPU_DEVICE/hwmon/hwmon0"

POWER_LIMIT=402 # watts
GPU_FAN_CURVE="$GPU_SYSFS_FAN/fan_curve"
GPU_FAN_CURVE_0="0 30 15"
GPU_FAN_CURVE_1="1 40 30"
GPU_FAN_CURVE_2="2 50 60"
GPU_FAN_CURVE_3="3 60 70"
GPU_FAN_CURVE_4="4 75 100"

GPU_FAN_TARGET="$GPU_SYSFS_FAN/fan_target_temperature"
GPU_FAN_TARGET_TEMP="85"

echo "Setting fan curve"
echo "$GPU_FAN_CURVE_0" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_1" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_2" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_3" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_4" > "$GPU_FAN_CURVE"
echo "c" > "$GPU_FAN_CURVE"
echo "Committed fan curve"

echo "Setting power limit"
echo "$((POWER_LIMIT * 1000000))" > "$GPU_SYSFS_HWMON/power1_cap"
echo "Comitted power limit"

cat $GPU_SYSFS_FAN/fan_curve

@In-line
Copy link
Contributor

In-line commented Mar 3, 2024

After some debugging problematic line of code is this, ignoring error here fixes the issue on 7900 XTX. I would try to prepare some patch or workaround, but I'm not sure how this will impact RDNA2 or older cards.

// Reset the power profile mode for switching to/from manual performance level
self.daemon_client
.set_power_profile_mode(&gpu_id, None)
.context("Could not set default power profile mode")?;

@ilya-zlobintsev
Copy link
Owner

@In-line could you post the full error that happens when you try to apply settings as well as your /etc/lact/config.yaml? The line you linked doesn't change anything fan related, but it does trigger a reapply of existing settings, so maybe it is trying to apply an invalid configuration.

@In-line
Copy link
Contributor

In-line commented Mar 3, 2024

@ilya-zlobintsev Already fixed it myself in #279

@dinotheextinct
Copy link

Uhm I have the problem, regardless of game I start with LACT, my GPU is stuck at 100% usage. The GPU Clock is kind of "locked around 2200 Mhz and the current stays around 750mV.

signal-2024-05-04-112043

Once I change ANY setting and apply it while the game is running that "lock" is lifted and the GPU seems to ignore any settings made with LACT.

LACT-sysfs-snapshot-20240504-113539.tar.gz

@In-line
Copy link
Contributor

In-line commented May 4, 2024

@dinotheextinct More info please.

Kernel version, mesa version, LACT version, distribution, etc..

@dinotheextinct
Copy link

Is the info not in the sysfs snapshot?

@dinotheextinct
Copy link

Kernel 6.8.8-1-default
glxinfo | grep Mesa client glx vendor string: Mesa Project and SGI OpenGL core profile version string: 4.6 (Core Profile) Mesa 24.0.5 OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.0.5 OpenGL ES profile version string: OpenGL ES 3.2 Mesa 24.0.5
LACT 0.5.4
Opensuse Tumbleweed

@In-line
Copy link
Contributor

In-line commented May 4, 2024

@dinotheextinct You're using 0.5.3 version of the LACT. RX 7900 has known problems in it, update to the last version. This is what I fetched from info.json in sysfs-snapshot.

{
  "initramfs_type": null,
  "system_info": {
    "amdgpu_overdrive_enabled": true,
    "commit": "d99cfdf",
    "kernel_version": "6.8.8-1-default",
    "profile": "release",
    "version": "0.5.3"
  }
}

@dinotheextinct
Copy link

sorry I just updated it, the issue is exactly the same after updating, I just attached the sysfs snapshot again, but like I said issue is the same:
LACT-sysfs-snapshot-20240504-120732.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants