New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corsair Commander Pro occasional unexpected OS error #698
Comments
Can you add the output immediately preceding (i.e. current and previous two liquidctl executions) an error from |
From the linked gitlab issue, looks like the kernel driver returns an What is listed under the hwmon dir ( (This seems to be the upstream repo.) |
As is also reflected in the directory, currently there are 5 out of 6 possible fans connected and 1 out of 2 possble LED controllers. |
Fails with voltage this time. Looks like the device is getting confused for some reason, or the kernel driver is not sending something correctly? Perhaps the requests need to be spaced out a bit? Does it fail if you try with |
It might be wise to manually unload the kernel driver instead of trying |
So, as advised, I unloaded the driver with |
Have you noticed any unreasonable values being reported now? We don't check the specific (supposed) error flag that the kernel driver checks, so we might just be reporting garbage. On the other hand, it's also possible that the kernel driver check was resulting in false positives, and that the actual data was fine.
The direct access code paths in liquidctl have not been deprecated. Their use is discouraged when a hwmon driver is available, but not every system runs Linux and has the corresponding hwmon driver loaded. When a hwmon driver is available, there are generally a few advantages to using it:
The liquidctl direct access code path, on the other hand, generally supports more features. Footnotes
|
Now that you mention it, there actually are unreasonable values. I didn't check what was actually being reported, just if it errors out. And looking at the RPM graph, there is definitely something off. The purplish lines are all the fans connected to the Commander Pro and as you can see the RPM seems way off because the fans are obviously not spinning at 10'000 RPM for one singular second before returning (unless they actually intend to, but don't have enough time to ramp). The same goes for dipping down to 0 before going back up. Edit: It seems like the RPM spikes have actually stopped as soon as I terminated the watch command. Maybe they were caused by multiple applications trying to access the data? However, I restarted the command again and now it doesn't seem to be happening anymore? It's quite confusing for me at the moment. Also when looking at the voltage more closely I see a few outliers that might hint at issues, here are a few select ones:
Now I'm not familiar with the actual electronics, but it seems a bit unlikely to me that these voltages are actually correct. That would either cause the device to crash or take damage over time. But it kinda looks like the values for the rails might be getting mixed up from time to time as it seems like the values are shifted by one? Not sure, just throwing around ideas here. |
Yes, that makes a lot of sense, especially if any one of those applications/processes is using hidraw (i.e. direct access). These applications don't synchronize (like I said, liquidctl could to a lot better and at least synchronize across liquidctl instances, but other applications certainly wouldn't synchronize with liquidctl), and trample over each other when talking with the device, especially given how this specific protocol works. That said, similar race issues (either errors or unreasonable values as a result of those races) when only using hwmon would mean a bug in that driver. But given the protocol and a cursory glance at the driver's source code, I think that's unlikely. Any chance there was a non-hwmon process talking to the device all this time, even when your watch command was using hwmon? |
So after restarting and fully disabling CoolerControl to rule out any conflicts, I wasn't able to reproduce the error in three hours. From my understanding regarding the different discussions, this probably means, that it actually might be a CoolerControl issue since it seems to cause race issues trying to access the data. At least for me, that seems to be the only logical conclusion, since turning the CoolerControl service back on after three hours of |
I'm not sure it's that simple: when you first reported the issue, both Cooler Control and the watch command were both using the hwmon driver, and races are not supposed to be a problem when using it. I think the more likely scenarios are:
If either is confirmed to be the case here, then the fix should be done in the kernel driver. |
On your part, it would help to confirm the issue just using the kernel driver, that is, without the involvement of liquidctl, coolercontrol, or any other program of the sort. I would start by spamming reads of all hwmon attributes from the device in quick succession and/or from racing threads. |
I think I may have found an issue with the kernel driver:
With just (b) alone this means that any unexpected/spurious report from the device will, during a brief window, take the place of the actual report the driver is waiting for. But because of (a) and (c), it's also possible for a spurious input report to change what command will be sent to the device next, possibly explaining why the device seems to report some invalid commands. And this would happen as part of a data race with hidcore/usbhid (so it might not just replace the command, but instead change it in rather hard to predict ways). @aleksamagicka, do you think this makes sense? |
Yes...
... also yes, in case something arrives between Also, if I see correctly, just doing a The driver seems to be written without hidraw in mind. Unfortunately, the command outputs don't seem to return what the command was AFAICS, so parsing them in And it even says so in the header:
|
I'm glad to help diagnose. Could you perhaps lead me how to do that effectively? I have never touched the hwmon drivers directly, so I have no idea what would count as an effective spam in terms of their capabilities. |
I'll report what we found to the upstream repo you found.
Yes, but it seems that reports can be switched and commands corrupted even just through hwmon (as in the setup the user originally reported, where both coolercontrol and liquidctl were using hwmon). |
Yes, that's just what I'm pondering right now after reading through the thread again... |
It does lock it, but that doesn't prevent the device from sending a report anyway (for some reason) and hid-core from processing it and calling the That said, whether the device can ever send a spurious report without some interference from hidraw is an open question. But, given the little information we (and, probably, Marcus too) have about the device, it might not be wise to assume it can't ever do that... even if it may appear to be rare. |
True, but if it wants to act possessed without giving info as to why, probably not much we can do 😆 I'd still rather make static int ccp_raw_event(struct hid_device *hdev, struct hid_report *report, u8 *data, int size)
{
struct ccp_device *ccp = hid_get_drvdata(hdev);
/* only copy buffer when requested */
if (!completion_done(&ccp->wait_input_report)) {
memcpy(ccp->buffer, data, min(IN_BUFFER_SIZE, size));
complete_all(&ccp->wait_input_report);
}
return 0;
} ... possibly with a spinlock around, so that it only parses once after a reinit (as to what it's parsing - well...). Would be much easier if only the protocol repeated back the command ID. |
It might be enough to run a few simultaneous loops of: grep "" /sys/class/hwmon/hwmonX/*_input Where If we're on the right track, these simultaneous loops should stress the driver as much (or more) than liquidctl + cooler control were doing before. Regarding which loop to use, I'm not sure what will be best approach to reproduce the issue we're looking for. You can start with many instances of And make sure to have no liquidctl and cooler control instances running while running these tests. If you can't reproduce the issue with just grep loops, then maybe also try to see if two or more |
Indeed, the protocol doesn't give us much choice. But perhaps synchronizing with
I think this is also an improvement.
Hm, related to this, I forget what was the conclusion (only that the docs had some issue)... is |
Agreed.
From the docs:
IMO the first part applies to completing as well, since those functions also touch |
I followed your instruction and let ten instances of this bash script run for 20 minutes, however no errors so far. #!/bin/bash
set -e;
while true;
do grep "" /sys/class/hwmon/hwmon3/*_input;
done; I then added ten instances of However, as soon as I started the Also, there is another thing I remembered, CoolerControl does also have the ability to control the RGB. Since the hwmon driver does not support the RGB control, is it possible that there is a weird interaction where sending an RGB signal needs to use the liquidctl implementation while the fan speed related things use the hwmon driver? I tried confirming this theory by running instances of |
That sounds good...? If the device were sending something on its own it probably would have errored at some point if I'm not missing something.
I don't know what it does exactly, but perhaps it ran
Yes, that's how it works generally. Use hwmon if/where possible, fall back to direct access otherwise. The point being that devices are usually specific about what they are communicating, unlike this one. So, what can at least be improved in the driver:
|
Hello, developer of corsair-cpro driver here, In the driver, every access to buffer and the completion is enveloped in a mutex. There should be no races, when accessing only the hwmon attributes. (please correct me, if I am wrong) But there can be races when using hidraw. All RGB software is using this (since there is no kernel support for RGB yet) |
Except in
That's true, but I think we can reduce the possibility as much as possible. I've made a branch at my fork for the improvements I've noted earlier, and I'll announce when I have something to show.
Indeed, disabling hidraw would net no good, and we'd just get issue reports. |
Opened a PR upstream, please take a look. |
The commits from the PR I linked above have just been merged by Guenter upstream. I marked them with Fixes: so they should be propagated to earlier kernel versions. |
Describe the bug
When constantly monitoring the status of the Corsair Commander Pro, liquidctl produces an error saying “ERROR: Corsair Commander Pro: unexpected OS error: OSError(95, 'Operation not supported')”, I stumbled upon the issue because I use CoolerControl which constantly monitors the devices thus regularly logging errors due to this liquidctl error (issue). It is possible that this might be an issue with hwmon driver, but I thought reporting this issue level by level before reaching the kernel makes sense, in case anything else is causing this issue.
Commands executed
Output of all relevant commands with
--debug
flagAffected device
Corsair Commander Pro
Does your version of liquidctl support the device in question?
Yes, my version supports it
Operating system and version
Pop!_OS 22.04
Installation method
I upgraded the Ubuntu repo version with
sudo pip install liquidctl --upgrade
to ensure it's not an issue with the outdated version.Version of liquidctl
liquidctl v1.13.0 (Linux-6.8.0-76060800daily20240311-generic-x86_64-with-glibc2.35)
The text was updated successfully, but these errors were encountered: