Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloaded logs from OrangePi5 are not chronological when run on robot #1307

Open
crschardt opened this issue Apr 4, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@crschardt
Copy link
Contributor

crschardt commented Apr 4, 2024

Describe the bug
Many times, we ask for logs to help diagnose the problem, but the logs pulled from an OrangePi by journalctl have the results from all the boots mixed together like:

-- Boot be037651ff23461c9ad449dc23c6964d --
Jan 30 22:17:07 photonvision systemd[1]: Started Service that runs PhotonVision.
-- Boot b22d720b09b74c8ca77afdca8c176f36 --
Jan 30 22:17:08 photonvision systemd[1]: Started Service that runs PhotonVision.
-- Boot 49093dc146794ba0979239c037ee3acb --
Jan 30 22:17:09 photonvision java[656]: [2024-01-30 22:17:09] [General - Main] [INFO] Native libraries loaded.
-- Boot a32b0a81609f46dca824d8ff5fd5721c --
Jan 30 22:17:09 photonvision java[669]: [2024-01-30 22:17:09] [General - Main] [INFO] Native libraries loaded.
-- Boot 49093dc146794ba0979239c037ee3acb --
Jan 30 22:17:09 photonvision java[656]: [2024-01-30 22:17:09] [Camera - RknnDetectorJNI] [INFO] Successfully loaded shared object librga.so
Jan 30 22:17:09 photonvision java[656]: [2024-01-30 22:17:09] [Camera - RknnDetectorJNI] [INFO] Successfully loaded shared object librknnrt.so
Jan 30 22:17:09 photonvision java[656]: [2024-01-30 22:17:09] [Camera - RknnDetectorJNI] [INFO] Successfully loaded shared object librknn_jni.so
-- Boot 4c92eb2e580d4c0087e2db1f07524fc8 --
Jan 30 22:17:09 photonvision java[678]: [2024-01-30 22:17:09] [General - Main] [INFO] Native libraries loaded.
-- Boot a32b0a81609f46dca824d8ff5fd5721c --
Jan 30 22:17:09 photonvision java[669]: [2024-01-30 22:17:09] [Camera - RknnDetectorJNI] [INFO] Successfully loaded shared object librga.so

The entries above come from 5 different boots of the coprocessor. They also aren't necessarily in any chronological order, so when looking at them, you can't be sure that the earlier boots are listed first in the log output. The result is a tangled log that is nearly impossible to make sense of.

To Reproduce
Steps to reproduce the behavior:

  1. Install Photonvision on an OrangePi5 connected to a robot and without internet access
  2. Reboot the processor several times
  3. Download logs and observe that the entries are intermingled.

Proposed Root Cause
The OrangePi doesn't have a real-time clock (RTC), so it must query a network time provider via NTP every time it boots to get the correct time. If an NTP server isn't available, then it defaults to using the time that the file /var/lib/systemd/timesync/clock was last updated. In versions of systemd prior to 250, this file is only updated when the system gets a valid response from a timeserver. A coprocessor mounted on a robot never connects to a timeserver, so this file is never updated and every time it boots, it sets the clock to the same time. When journalctl outputs the logs, it sorts the entries by the timestamps, which are the same across multiple boots, so the entries get mixed together. Ubuntu 22.04 (jammy) comes with systemd version 249.11.

systemd version 250 includes SaveIntervalSec setting for timesyncd, which will update the clock file regularly regardless of it getting a timeserver connection. This provides a generally monotonic clock on any coprocessor, even it it never gets a valid time sync and lacks an RTC. Ubuntu versions from 23.04 include this feature.

systemd version 254 includes a fix that will order journal entries by boot ID if the system doesn't have a RTC. This version is available in Ubuntu 24.04 (pre-release as of 4/3/2024).

Additional context
sd-journal: find a smart way to order log lines from different boots if neither has a useful wallclock time.
journald: support ordering journal entries by boot ID if no RTC is available

@crschardt crschardt added the bug Something isn't working label Apr 4, 2024
@crschardt
Copy link
Contributor Author

crschardt commented Apr 4, 2024

Updating the version of Ubuntu will directly address this problem, but it may not be desirable to update to a very recent version until it has been thoroughly tested. In the meantime, we could keep track of the boots in the logs and that would allow us to disentangle them with a tool like this simple python script:

from pathlib import Path
from collections import defaultdict

TANGLED = R"C:\path\to\file\photonvision-journalctl.txt"

def detangle(tangled: Path): 
    logs = defaultdict(list)
    boot = "0"

    with tangled.open() as infile:
        for line in infile.readlines():
            if line.startswith("-- Boot"):
                boot = line.split(" ")[2]
            else: 
                logs[boot].append(line)
    
    fixed = tangled.with_name(tangled.stem + "_fixed.txt")
    print(f"Found {len(logs)} boots.")
    with fixed.open("w") as outfile:
        for boot, lines in logs.items():
            print(f"{boot} - {len(lines)}")
            outfile.write(f"\n-- Boot {boot} --\n")
            outfile.writelines(lines)

if __name__ == "__main__":
    detangle(Path(TANGLED))

@mcm001
Copy link
Contributor

mcm001 commented Apr 4, 2024

Are you sure those are actually unique boots? Is there any other cause that could make those happen?

@crschardt
Copy link
Contributor Author

Very reasonable questions. I've been researching it and I think that there is strong evidence that this is the cause, but I haven't had the time to definitively prove that it is the root cause, so I've updated my original report to call it a "Proposed Root Cause". Here is some additional support for suggested root cause.

Every time the system boots, it generates a random boot id which is stored at /proc/sys/kernel/random/boot_id. This is the value that journalctl uses in the logs. You can get a list of the boots in chronological order using journalctl --list-boots.

I had access to our robot today and downloaded logs and also ran journalctl --list-boots. Every boot returned by --list-boots is present in the downloaded logs. There was a small problem with a journal file being truncated, so I was only able to get a list of 21 boots, even though there were 84 total boots recorded in the full logs, but I don't think that discredits the evidence that these are unique boots.

The jumbled logs come from coprocessors on robots, but they don't happen for me on my coprocessor connected to my home router, since it has access to a timeserver every time it boots.

The sources I linked describe this exact same behavior for coprocessors that don't have access to a way to set and maintain the clock. They also provide fixes that should correct the problem if we upgrade to a distro with a newer version of systemd. I haven't come across any other explanation for this behavior. This isn't to say that there can't be one, but I think there is strong evidence pointing to the root cause that I described.

I have some ideas for tracking the boots in order so that we can straighten out the logs. I'll submit a PR once I have time to get something working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants