Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dccutil 2.0 regression, sporadic errors (Display not found, unsupported feature code) #347

Open
fabian-thomas opened this issue Oct 30, 2023 · 27 comments

Comments

@fabian-thomas
Copy link

fabian-thomas commented Oct 30, 2023

First of all, thanks for this amazing tool! Much appreciated!

Unfortunately the upgrade to ddcutil 2.0 broke it to some part for me. I'm now running 2.0.0-1 on Arch Linux. Downgrading to 1.4.1-1 fixes all issues so it seems to only be related to the ddcutil package.

Issues I noticed with the new version:

  • Sometimes ddcutil getvcp 10 --display 1 works perfectly fine, but sometimes it gives me Display not found. In other cases I get Unsupported feature code or errors about the pci edid file not existing (Error opening "/sys/devices/pci0000:00/0000:00:02.0/edid", No such file or directory).
  • ddcutil detect prints phantom displays: Associated non-phantom display: bus /dev/i2c-10. But that seems to still be there when I downgrade ddcutil. But then it does not say it's a phantom display. But everything other than that works just fine. (Note that I have not seen such duplicated displays prior to version 2)

Feel free to give me any instructions on what information I could further provide or if not super hard, how to debug this issue myself. :)

Ah, in case that helps:
My monitor is connected via display port to an USB-C hub. Also, since it doesn't seem confidential, here is the output of ddcutil detect:

Invalid display
   I2C bus:  /dev/i2c-7
   DRM connector:           card1-eDP-1
   EDID synopsis:
      Mfg id:               AUO - UNK
      Model:
      Product code:         8429  (0x20ed)
      Serial number:
      Binary serial number: 0 (0x00000000)
      Manufacture year:     2018,  Week: 0
   DDC communication failed. (getvcp of feature x10 returned Error_Info[DDCRC_RETRIES in ddc_write_read_with_retry, causes: EREMOTEIO(10)])
   This is an eDP laptop display. Laptop displays do not support DDC/CI.

Phantom display
   Associated non-phantom display: bus /dev/i2c-10
   I2C bus:  /dev/i2c-8
   DRM connector:           card1-DP-1
   EDID synopsis:
      Mfg id:               MSI - Microstep
      Model:                Optix MAG27CQ
      Product code:         5218  (0x1462)
      Serial number:        0000000000001
      Binary serial number: 1 (0x00000001)
      Manufacture year:     2018,  Week: 13
   VCP version:         2.1

Display 2
   I2C bus:  /dev/i2c-10
   DRM connector:           card1-DP-7
   EDID synopsis:
      Mfg id:               MSI - Microstep
      Model:                Optix MAG27CQ
      Product code:         5218  (0x1462)
      Serial number:        0000000000001
      Binary serial number: 1 (0x00000001)
      Manufacture year:     2018,  Week: 13
   VCP version:         2.1
@rockowitz
Copy link
Owner

The intermittent communication errors suggest to me that there may be a problem with the new dynamic sleep algorithm. What happens if you run with option --disable-dynamic-sleep?

Secondarily, I suspect that the phantom display problem arises because because of something unexpected in /sys.
As a first step, please execute ddcutil environment --verbose and submit the output as an attachment.

Thank you!

@rockowitz rockowitz added the bug label Oct 30, 2023
@fabian-thomas
Copy link
Author

The intermittent communication errors suggest to me that there may be a problem with the new dynamic sleep algorithm. What happens if you run with option --disable-dynamic-sleep?

That does seem to fix the reliability issues.

Secondarily, I suspect that the phantom display problem arises because because of something unexpected in /sys. As a first step, please execute ddcutil environment --verbose and submit the output as an attachment.

Here you go: https://gist.github.com/fabian-thomas/3a4b5664a14fd75aea757f45e636dc15
There is one with --disable-dynamic-sleep and one without. The one without dynamic sleep got executed while brightness readings were unreliable.
But I did not see any phantom displays yet again. So maybe this is a non-issue. At least it is for me. But I can report back and create another environment file if they are back.

@rockowitz
Copy link
Owner

@fabian-thomas, @akdevservices

Debugging these intermittent failures remotely is likely going to be a pain, but let's give it a go.

Please invoke ddcutil with options --istats and ---enable-dynamic-sleep. Immediately after issuing a command that fails with what seems to be dynamic sleep errors, please send the following:

  • the command output
  • a copy of file $HOME/.cache/ddcutil/dsa

Thank you.

@rockowitz
Copy link
Owner

@fabian-thomas Re the phantom display issues:

  • What is the model of your hub?
  • "/sys/devices/pci0000:00/0000:00:02.0/edid" is a malformed sysfs node name. Please execute with option --trcfunc is_phantom_display and when you see that error message send me the output. Thanks.

@fabian-thomas
Copy link
Author

For the first:
https://gist.github.com/fabian-thomas/27d9a81d819d9648d6ebf735e879e4d2

I used:
ddcutil getvcp 10 --display 1 --istats --enable-dynamic-sleep > istats
Stderr was: Display not found

@fabian-thomas
Copy link
Author

For the second:

@fabian-thomas
Copy link
Author

I came up with an idea where this phantom display may come from, but I don't know much about the Linux display/dev stack, so it might be that this makes absolutely no sense:
I occassionally run both xorg and wayland at the same time, also today. I saw that wayland creates new devices for the same monitor. Maybe the phantom displays appear for that reason?

@rockowitz
Copy link
Owner

The phantom displays is a device driver issue. The problem is that for one of the displays connected to the dock, it appears as both a /dev/i2c device on the dock and on the laptop port to which the dock is connected.

See this bug report for an explanation by one of the i915 developers of why this happens and why it's hard to fix.

@fabian-thomas
Copy link
Author

Thanks for looking that up. I'm happy that this is only a cosmetic problem for me 😅

@rockowitz
Copy link
Owner

@fabian-thomas , @akdevservices Branch 2.0.2-dev contains changes to resolve or at least further diagnose the problems reported in this issue.

I suspect that the intermittent failures are the result of interaction between the dynamic sleep algorithm and existing special handling of the DDC Null Response, which some monitors abuse as a way to indicate an unsupported feature. As a first step, message "Maximum retries ... for DDC Null Response exceeded" is written to the terminal and the system log if the number of retries has been reduced for DDC Null Response. Do you see this message along with the intermittent failures?

As for phantom displays, the algorithm was changed for release 2.0, so I would expect to see more displays reported as phantom. The one problem I saw in your output is the valid display was numbered 2 instead of 1. This should now be corrected.

@fabian-thomas
Copy link
Author

Sorry for letting this wait so long. Forgot to do this on the weekend. I've now build from 2.0.2-dev and it seems like the detection is completely broken now. At least my monitor is not recognized anymore by ddcutil detect. Also with ddcutil detect --disable-dynamic-sleep.

@rockowitz
Copy link
Owner

Development option --f7 disables phantom display detection. Please submit the output of ddcutil detect --verbose ---disable-dynamic-sleep and ddcutil detect --verbose --disable-dynamic-sleep --f7. Also please submit recent entries in the system log. (Command journalctl -t ddcuti -b reports ddcutil entries for the current boot.) Thank you.

@fabian-thomas
Copy link
Author

fabian-thomas commented Nov 9, 2023

https://gist.github.com/fabian-thomas/27df948b345a57071b2506b0ff0b714c

Both outputs are identical, nevertheless I uploaded both. The journal seems to not have anything related to this detect call. Note that the errors in there come from invocations in the morning (9 am), and now its 8 pm for me.

@rockowitz
Copy link
Owner

Am I correctly inferring that ddcutil detect does not report the monitor attached to your hub?

If so, please execute ddcutil detect -v --f2 and submit the output as an attachment. Thank you.

@fabian-thomas
Copy link
Author

Yes, the docked monitor is never detected with the dev branch. Here is the output:
https://gist.github.com/fabian-thomas/c17fc52795957ed2144e35c18de3250b

@rockowitz
Copy link
Owner

I just pushed extensive changes to branch 2.0.2-dev. Please run your tests again against the updated 2.0.2-dev branch.

If errors persist, please execute ddcutil detect --verbose --f2 --trace ddc --trace ddcio --trace i2c. The amount of output will be insane, but hopefully I can find the needle in the haystack. What I can see is that the i915 driver is not using /sys in the way I would expect based on information from the i915 driver folks, but ddcutil should have been able to cope with that.

Thank you for your help diagnosing the bugs you've reported. Once this issue is resolved I should be able to push a 2.0.2 release, or at least a release candidate.

@fabian-thomas
Copy link
Author

It looks like you haven't pushed your changes. At least I can't fetch them. 😅

@rockowitz
Copy link
Owner

My bad. Done.

@fabian-thomas
Copy link
Author

Ok, so now I tried again with the old commit (a9a4ea1) and weirdly now everything works perfectly fine. That means, both detect and getvcp. Also, I see no phantom display anymore. I can't explain why. So currently everything is as stable as with ddcutil 1.

I did hibernate in between, but I've probably done that between the other two invocations too. Maybe unplugging the dock fixed something, but I've probably done that too before.

Here is the output with latest dev:
https://gist.github.com/fabian-thomas/51f146fc0fb498b24d90df9140e55f5b

It also seems to work perfectly fine.

@rockowitz
Copy link
Owner

Thank you for the quick test. Do keep me posted. If you encounter problems again, the output with just trace class DDC should be sufficient, i.e. ddcutil detect --verbose --f2 --trace ddc

@fabian-thomas
Copy link
Author

Weird, the problems are back this morning, after hibernating yesterday. Cant find the display in ddc detect.

Here are the logs with commit 71a6d25:
https://gist.github.com/fabian-thomas/e7f6e7535bb5f924a142b2116f87213e

@rockowitz
Copy link
Owner

rockowitz commented Nov 11, 2023

Reading the EDID is failing. This can be seen at approx line 677 and following, function i2c_get_raw_edid_by_fd() and i2c_get_edid_bytes_directly_using_ioctl(). There appears be a problem with your dock/hibernation/driver i915.

At first glance, I don't see any difference between the code for these functions between 1.4.1 and 2.0.2-dev, but I believe you when you say you didn't see the problem in 1.4.1. When the problem occurs, what do you get when you execute ddcutil detect --verbose --f2 --trace i2c?

@rockowitz
Copy link
Owner

rockowitz commented Nov 11, 2023

Also, when you hit the bug, what does get-edid -b 10|parse-edid show?

@fabian-thomas
Copy link
Author

It seems like above logs already included one with i2c tracing enabled. Nevertheless, here are more logs:
https://gist.github.com/fabian-thomas/c56491c16598f98967054a5f1a39dbb7

It looks like get-edid works perfectly fine, while ddcutil fails when reading the edid. Should I try to bisect this issue with git-bisect?

@rockowitz
Copy link
Owner

Yes, there have been changes in how the EDID is read, attempting to make it more efficient and robust. Apparently not.

The problem can be seen at lines 676 and following in your trace file, where the EDID read is entirely x00.

Background: There are variations in how the EDID is read. First, /sys is checked. Depending on how the video driver populates /sys, it may be possible to associate an EDID with a /dev/i2c bus.

If the EDID must explicitly be read from the monitor, there's a 2x2 matrix of possibilities:

The EDID can be read with the full stack normally used for reading from /dev/i2c devices. Alternatively, there's special code for simply reading the EDID, making it easier to insert special handling. Normally, the EDID is read using the dedicated EDID code. Alternatively if experimental option --f5 is specified, the standard stack is used.

Driver i2c-dev provides two interfaces, one using ioctl() and the second using a file io abstraction layered on top of it. The latter introduces the possibility of EBUSY errors, so the ioctl() interface is preferred. Depending on version, there's an incompatibility between the i2c-dev driver and the proprietary nvidia driver, which can be worked around using the file io interface. This is unfortunate, because there's a large amount of code that I'd rather eliminate, and because it introduces the need for option -force-slave-address. However for our testing it may be helpful. By default, the ioctl() interface is used. However, option --use-file-io causes the file io interface (write()/read()) to be used.

So when you encounter the problem, try executing the detect command with each of the 4 possibilities: no options, option --f5 only, option --use-file-io only , and both --f5 and --use-file-io. Along with these, use options --verbose --f2 --trcfile i2c_bus_core.c

If none of these consistently show success, then yes please try bisecting the source.

Thank you.

@fabian-thomas
Copy link
Author

I couldn't get the logging to work. None of the commands change the output. Both ddcutil detect --verbose --f2 --trcfile i2c_bus_core.c --use-file-io --f5 and ddcutil detect --verbose --f2 --trcfile i2c_bus_core.c result in the same output with 28b9cc4. Not sure what I did wrong.

But good news is that git bisect worked like a charm. The faulty commit seems to be 3808a0a:

3808a0ad55cb2315ca3c53c42e58b0049939fb5f is the first bad commit
commit 3808a0ad55cb2315ca3c53c42e58b0049939fb5f
Author: Sanford Rockowitz (/shared/home/rock/dot_gitconfig) <rockowitz@minsoft.com>
Date:   Thu Nov 2 09:08:54 2023 -0400

    rework i2c_check_bus()

    also:
    -  report "Is laptop display" instead of "Is eDP device" and "Is LVDS device"
    - add is_laptop_drm_connector_name()
    - iftest out is_laptop_drm_connector(), IS_EDP_DEVICE(), IS_LVDS_DEVICE()

 src/i2c/i2c_bus_core.c | 113 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 69 insertions(+), 44 deletions(-)

I had a quick look through the changes but I can only guess what may have went wrong. Hopefully you have an idea. Otherwise I'll just revert parts of the commit and try to find the faulty part like that.

@rockowitz
Copy link
Owner

Commit 3808a0a is dated 11/2/2023. This recent commit is after the 2.0.0 release, so it can't be the cause of a bug that appeared in that release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants