Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncing stuck on XX% on one device, despite the synced files are all ok and checksums match on all devices involved. #9482

Open
digital-spinner opened this issue Mar 25, 2024 · 28 comments
Labels
bug A problem with current functionality, as opposed to missing functionality (enhancement) needs-triage New issues needed to be validated

Comments

@digital-spinner
Copy link

digital-spinner commented Mar 25, 2024

What happened?

I was syncing a new device (Windows fresh install) to my pool of devices, and synchronization went well but at the end on that new device Syncthing shows 26% syncing progress and it is completely stuck since 12h+. All other devices show status synced (up to date). The checksum of all files within multiple directories match, so it means that desired content is properly synced, but Syncthing says something else. I tried rescaning, restarting etc. nothing changed. It is just completely stuck.

In the log I see a lot of those entries related to the affected directories and device reporting stuck syncing:

[XXXXX] 2024/03/24 22:46:48 INFO: Puller (folder "adsfasdfasdf" (uuuuu-uuuuu), item "asdfasdfasdf"): syncing: no connected device has the required version of this file
.......................
[XXXXX] 2024/03/24 22:46:49 INFO: Folder "adsfasdfasdf" (uuuuu-uuuuu) isn't making sync progress - retrying in 2m10s.

I think the same thing happen to me we another device few hours before, but I was not paying sufficient attention and I just recreated sync pair and resynced one affected folder. But then data transfer was not completed and half of the files were missing on the newest device. Now I have 3 affected folders and a lot of more data which is actually synced.

Also what is worth to mention, there should be some kind of better error message maybe, because it is not possible to reattach missing source device because there is no more, plus only viable info can be found buried in log files.

image

More info on previous sync stuck problem: https://forum.syncthing.net/t/missing-half-of-the-photo-files-but-the-status-shows-100-completed-on-src-and-target/21841/3

Syncthing version

v1.27.4

Platform & operating system

Windows 11 PRO x64, Linux Ubuntu 22 x64

Browser version

Firefox 124.0.1 (64-bit)

Relevant log output

[XXXXX] 2024/03/24 22:46:48 INFO: Puller (folder "adsfasdfasdf" (uuuuu-uuuuu), item "wsugfuxtfc756q"): syncing: no connected device has the required version of this file
..........
..........
..........
[XXXXX] 2024/03/24 22:46:49 INFO: Folder "adsfasdfasdf" (uuuuu-uuuuu) isn't making sync progress - retrying in 2m10s.
@digital-spinner digital-spinner added bug A problem with current functionality, as opposed to missing functionality (enhancement) needs-triage New issues needed to be validated labels Mar 25, 2024
@digital-spinner
Copy link
Author

digital-spinner commented Mar 25, 2024

I have removed the sync directory from the Syncthing only from the newest device and readded it again with default settings except Alphabetic download order (pointing to already synced data folder, entering the desired directory ID) and after the scan Syncthing said that everything is synced OK.
Again I have checked the files checksums (calculated + verified on all additional devices added to my "master" node) and all the data match.

I guess there is some kind of a bug in Syncthing and it happen already twice for me. I will probably setup a VM for testing purposes and I will try to make another full sync just for my curiosity. If there is any advise what shall I check during that process, please let me know. One think which may be a culprit is that the one directory has approx 200_000 super small files and maybe because of that Syncthing got stuck also with other directories, IDK. This is super weird.

@digital-spinner
Copy link
Author

Shouldn't the Syncthing rescan the whole directory when it sees such error like: INFO: Puller (folder "adsfasdfasdf" (uuuuu-uuuuu), item "wsugfuxtfc756q"): syncing: no connected device has the required version of this file when there is no possibility that another "proper" source of data exist? I guess something got screwed with the Syncthing's database or there is some kind of logical issue in the code which triggers such state.

@digital-spinner
Copy link
Author

I have reproduced the issue for the 2nd time.
I created fresh VM with Windows 11 PRO x64 and left it for the night to synchronize with 2 devices from my pool.
The effect is the same - device C (this newest) got stuck with device A (the oldest one).
Please see attached pictures. What I see is that on picture A, the device A has a remote with the name of A but it should be C - WTF happen here?!?!?!

I will preserve the whole Syncthing configs with logs from A, B and C if anybody wants to ask for more details, maybe I will be able to dig through it.

Explanation:
A - The oldest one
B - New one I have added few days ago
C - Test device

What I see is that the newest device C got the same name as my device A somehow.
And probably this is why this thing got stuck.

Please tell me what should I look for in the logs to help fixing this issue.
I can even run that for the 4th time if it helps but it is boring ;-)

A

B

C

@digital-spinner
Copy link
Author

digital-spinner commented Mar 26, 2024

In the config file I have found such definition which points to DEVICE C, but the name of that DEVICE is WRONG
First you see device A PROPERLY, and than you can see device C with the same name as device A which is not true.

    <device id="IYZXYIA-XXXXXXXXXXXXXXXXXXXXXXXXXX" name="A" compression="metadata" introducer="false" skipIntroductionRemovals="false" introducedBy="">
        <address>dynamic</address>
        <paused>false</paused>
        <autoAcceptFolders>false</autoAcceptFolders>
        <maxSendKbps>0</maxSendKbps>
        <maxRecvKbps>0</maxRecvKbps>
        <maxRequestKiB>0</maxRequestKiB>
        <untrusted>false</untrusted>
        <remoteGUIPort>0</remoteGUIPort>
        <numConnections>0</numConnections>
    </device>
    <device id="MDJW2O6-XXXXXXXXXXXXXXXXXXXX" name="A" compression="metadata" introducer="false" skipIntroductionRemovals="false" introducedBy="">
        <address>dynamic</address>
        <paused>false</paused>
        <autoAcceptFolders>false</autoAcceptFolders>
        <maxSendKbps>0</maxSendKbps>
        <maxRecvKbps>0</maxRecvKbps>
        <maxRequestKiB>0</maxRequestKiB>
        <untrusted>false</untrusted>
        <remoteGUIPort>0</remoteGUIPort>
        <numConnections>0</numConnections>
    </device>

@digital-spinner
Copy link
Author

digital-spinner commented Mar 26, 2024

Interesting... Why "C" got a name of "A"?

  47380,68: [IYZXY] 2024/03/25 20:13:19 INFO: Established secure connection to MDJW2O6 at 192.168.100.111:22000-192.168.103.180:22000/tcp-server/TLS1.3-TLS_AES_128_GCM_SHA256/LAN-P10-5U02V35F9VLU0HIN9DG4KK1CAI
  47381,42: [IYZXY] 2024/03/25 20:13:19 INFO: Device MDJW2O6 client is "syncthing v1.27.4" named "TEST" at 192.168.100.111:22000-192.168.103.180:22000/tcp-server/TLS1.3-TLS_AES_128_GCM_SHA256/LAN-P10-5U02V35F9VLU0HIN9DG4KK1CAI
  47383,42: [IYZXY] 2024/03/25 20:13:46 INFO: Device MDJW2O6 folder "...... INDEX THING"...
  ....
  ....
  47559,62: [IYZXY] 2024/03/25 23:54:30 INFO: Lost primary connection to MDJW2O6 at 192.168.100.111:22000-192.168.103.180:22000/tcp-server/TLS1.3-TLS_AES_128_GCM_SHA256/LAN-P10-5U02V35F9VLU0HIN9DG4KK1CAI: Syncthing is being stopped (0 remain)
  47560,49: [IYZXY] 2024/03/25 23:54:30 INFO: Connection to MDJW2O6 at 192.168.100.111:22000-192.168.103.180:22000/tcp-server/TLS1.3-TLS_AES_128_GCM_SHA256/LAN-P10-5U02V35F9VLU0HIN9DG4KK1CAI closed: Syncthing is being stopped
  47582,42: [IYZXY] 2024/03/26 06:51:02 INFO: Device MDJW2O6-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX is "A" at [dynamic]
  47612,68: [IYZXY] 2024/03/26 06:51:04 INFO: Established secure connection to MDJW2O6 at 192.168.100.111:22000-192.168.103.180:22000/tcp-client/TLS1.3-TLS_AES_128_GCM_SHA256/LAN-P10-5U07A9SQA9SGOCGIL17P4SFECU
  47613,42: [IYZXY] 2024/03/26 06:51:04 INFO: Device MDJW2O6 client is "syncthing v1.27.4" named "TEST" at 192.168.100.111:22000-192.168.103.180:22000/tcp-client/TLS1.3-TLS_AES_128_GCM_SHA256/LAN-P10-5U07A9SQA9SGOCGIL17P4SFECU

@acolomb
Copy link
Member

acolomb commented Mar 26, 2024

The device name is whatever you put in there. If you leave it blank, it will copy the remote device's name as configured there, maybe that wasn't set right at the time?

Thank you very much for digging into this. It should prove really helpful to get logs if you can reproduce, although I understand it's tedious. Just maybe better to move it to the forum, as long as the bug isn't clearly identified. It helps keep a better overview in the issue tracker, for when a developer starts working on a fix. Thus we like to have the debugging sessions on the forum and just describe the final findings here.

@digital-spinner
Copy link
Author

digital-spinner commented Mar 26, 2024

IMO the bug is clearly described, isn't it?

Why my remote is getting stuck from one of devices point of view?
Why my main device sees the remote at the same name like it's own when other devices sees it properly?

I don't care about the names, I just care about sync stuck and trying to find something useful. The bug is about sync stuck, all other things are extras.

As a .net dev myself I like as much info as it can be provided in one place, no too look for it, so I did this way. But It looks like I have to start learning a GO to try to dig into some code and see if something catches my eyes or try to debug it.

@digital-spinner
Copy link
Author

Something is really wrong with my setup or IDK. I'm having the same situation usually with one client all the times, It is getting stuck even with very small transfers like few MiB. I'm going to get rid of it and replace with a fresh new config.

@digital-spinner
Copy link
Author

digital-spinner commented Apr 3, 2024

Again, the same story. I have added a new REMOTE which synced fine (in terms of files integrity), but some of the source devices (one Android, and one Windows 11 PRO) are reporting that there are out of sync files and syncing process pending on their GUIs. Any guidance how I can debug this thing? I can's see anything useful in the logs though, there is no errors, nothing, even I checked all the debug output options.

Worth to notice that this time I'm not seeing any log entries: "Puller isn't making sync progress", and the names of the devices seems to be ok (I have recreated one device from scratch - the one which was failing me before, but yet again the new one fails!).

What is super strange, that list of 629 out of sync items, shows me that the MOST of the files were modified by some other device which is nowhere to be found or it was not involved to any activities during the synchronization. I had such device before (IYZXYIA - please see previous experiment with sync) but it is no more (I removed it and it is completely gone along with the whole VM), why it is shown here as a "Mod. Device" in that list? And the rest of those files should not be marked as modified. Anyway that list is a lie, because as I wrote everything is already synced which I verified by checksum calculations.

Please advise.

image

@w4grfw
Copy link

w4grfw commented Apr 4, 2024

Same here, Syncthing v1.27.5, Linux (64-bit Intel/AMD):
Device A (send only folder) → Device B (receive only folder)
(Browser-)GUI on says syncing (99%, 1 item out of sync), (browser-)GUI on B says everything up to date.
Complained item exits on both devices with same sha512sum.

@JulienPalard
Copy link

JulienPalard commented Apr 5, 2024

Same here:

  • From my device I see : Éléments non synchronisés 2 élément(s), ~9,63 MiB
  • From the other device I see it's 100% in sync.

I checked the sha1sum of the two elements, they are the same.

Also I did not modified those files since like 2018, so that's strange seeing this popping right now.

My device is running syncthing 1.27.2~ds4-1 (on Debian Trixie), the other one is running syncthing 1.19.2~ds1-1+b4 (on Debian bookworm).

I grepped the out of sync files in the logs: nothing:

$ journalctl --user -u syncthing --grep sh3d
-- No entries --

(yes the two "out of sync" files are sh3d files).

In my side I have a .stignore containing:

(?d).venv
.envrc

in the remote there's no .stignore file.

I don't know where to gather more info from here.

@imsodin
Copy link
Member

imsodin commented Apr 5, 2024

My device is running syncthing 1.27.2~ds4-1 (on Debian Trixie), the other one is running syncthing 1.19.2~ds1-1+b4 (on Debian bookworm).

While your issue might be the same one as otherwise reported here, there a lot of changes and some very relevant to components involved here. Generally involving multiple versions increases the problem space a lot and we don't support/troubleshoot those. If you want support/fixes for distro versions, please report to your distros bug tracker.

As to the overall issue: We are very aware that there is a problem with "some index information missing", it's also not new. Unfortunately a way to reproduce it reliably and independently has yet to come up. There's no lack of looking into it, multiple maintainers did multiple times. I know this doesn't help anyone and isn't a satisfactory update, unfortunately it's the only one I have.

@digital-spinner
Copy link
Author

@imsodin because I think I can easily reproduce this issue (got it 4 times in a row) I guess I have to learn the GO and how to debug it (I'm a .NET developer and have no knowledge or any experience with the GO, but at least I may try to do so). Can you give me some starting points where should I look first? Or where I can learn how effectively debug the GO please? If I can have some kind of conditional break point thing here than I think that won't be so difficult maybe - at least I learn something new.

@edi9999
Copy link

edi9999 commented Apr 5, 2024

I think I also have this issue now.

I have a script that tells me when a last successful 100% sync was finished, and it tells me it was 1 day, 6 hours, 9 minutes ago.

Usually a full sync is done every 5 minutes or so.

So it likely means that in a recent change, something got broken.

Both of my servers run syncthing version v1.27.5

@imsodin
Copy link
Member

imsodin commented Apr 6, 2024

The latest RC has a potential fix for this/someting related: https://github.com/syncthing/syncthing/releases/tag/v1.27.6-rc.1
See #9498 resp. https://forum.syncthing.net/t/some-deletions-and-modifications-getting-stuck-in-sync-since-v1-27-5/21893

@digital-spinner I'd suggest to try the RC to see if you can still reprodcue. Generally for this "kind of reproducer" (run in a large production system), I think logging is more likely to be helpful than a debugger (there's of course a debugger for go too). As for where to look: That's problematic, because it can be basically anywhere given how elusive it is. The forum thread linked above gives a good example: indexhandler.go is where the additional logging went, as that's a central piece of functionality to exchange index information. The potential fix is in the DB. However it might also concievable be in lib/protocol or many places in lib/model. The hard thing here is not go, it's the domain (I went from go to .NET, went fine, I think the other way around should be even easier).

@edi9999
Copy link

edi9999 commented Apr 9, 2024

I have just updated my instances to 1.27.6 and it works well now !

@digital-spinner
Copy link
Author

@imsodin thank you, I'm going to test it asap, maybe even today at night, because my sync takes few hours and I need to setup fresh vm for testing on my homelab.

@digital-spinner
Copy link
Author

I saw there is a 1.27.6 version update in place already. I have upgraded my devices accordingly and.... No luck.
I resycned whole device worth of 600 GiB of data and got ~31k items out of sync. I have not tested the checksums but I'm also investigating another new issue like wrong directory names on that one target device.

IDK what is going on..

Also there is no info in the log that puller is not getting any files.. but it should be.

@digital-spinner
Copy link
Author

I think during last synchronization it started to pouring files into a new directories with names of folder ID instead of the name it should, still IDK what is going on here, but I'm going to run it again.

@digital-spinner
Copy link
Author

Another run and it just stopped pumping - view from the device that I'm resyncing. Other deices also report no active transfers.
It just got stuck but in different way then previously.

I have a lot of those in the log..
2024-04-11 20:12:38 Connection from XXXXXXXXXXXXXXX at XXXXXXXX:22000 (tcp-server) rejected: already connected to this device

image

Update: it looks like it started to sync again.. Well let me see what it does.

@digital-spinner
Copy link
Author

digital-spinner commented Apr 11, 2024

Now it went to a crawl speed, when usually it was not respecting the speed limits going the first time. CPU usage is almost nonexistent. There are no new log entries despite those that it finished initial scans and some other sharing chit-chat. It is maxing at 6 MiB/sec almost doing nothing. It should go full speed but it is not. I leave it here. I have a backup so I don't care if it damages something.

I wonder if I should recreate all my network from scratch if this kind of things started happening.

.... restart from UI crashed / stalled it completely, I restarted the client device and it is going on again at full speed (ignoring speed limits).

.... (few minutes later) my linux machine gets disconnected often which was not happening with previous version and Syncthing is randomly stuck on sync also (some of those ends up of disconnection of everything with that machine (xrdp, ssh dies))..

image

I'm going to create a fresh Windows machine for testing I guess on different env. So far I think it is much worse than it was before with older version. I never had a connection issues of any kind before.

@digital-spinner
Copy link
Author

I was super annoyed with the issues I got as described and I have recreated the env with fewer devices with fresh 3 instances of Syncthing (I have wiped the configuration directories) with the same payload about 600 GiB of data and guess what?

Again, the synchronization got stuck. But this time at least it didn't broke my connection (not once) with the 3rd device (which is actually the same as previously but with wiped Syncthing configuration).

So no, for me issue still exist as it was before. It super annoys me but I have no time currently to dig into GO LANG world to see what is going on by myself. I guess having the smaller payload will not generate this issue so maybe many people can live with it, but apparently not me as it looks like.

@digital-spinner
Copy link
Author

I did some more testing with fresh instances. What I have noticed the Syncthing is more stable and not causing so many syncing issues when I decreased the devices performance - testing on the VM I just simply decreased the cores amount just to 2 and RAM to 2GB on linux machine and it got synced easier (just one resetup / resync was needed instead of many previously with 6 CPU cores). Another thing is that if the syncing directories are smaller I guess it is also more stable.

Anyway the issue persist and it is worse than before IMO.

@lanmarc77
Copy link

lanmarc77 commented Apr 25, 2024

It seems I have the same issue. Only one of my 3 devices claims syncing issues. Still the files were transferred correctly.
I figured out that when I set the modification date of one of those not synced file to +1s it does sync that file correctly and removes it from the not synced state. So writing a script which sets all modification timestamps +1s wait for sync and then -1s to stick with the original should for files mitigate the issue.
Still this does not work for directories. Once all files were "timestamp synced" correctly the directories are stuck. Even if I change the modification timestamp of such a directory it does not get updated.
All my devices are Linux.

@bgoodmansf
Copy link

I think I'm seeing the same issue. 3 devices (2 Linux, 1 Windows 11) all running V.1.27.6. Each device says it is synced and up to date, but one device constantly shows another device as "Syncing (50%, 8.22GiB)".

I observe that the description of the out of sync remote says:
"Out of Sync Items 2,651 items, ~8.22GiB"

If I click on the items link, the "Out of Sync Items" window is empty. Perhaps this is a clue.

@FlavioCestMoi
Copy link

I've noticed the same issue: although all folders appear to be fully synchronized, not all remote devices are (version 1.27.6 for all devices). On my Raspberry Pi (running Linux 64-bit ARM), configured as a "Receive encrypted" folder type, problems arise between this Raspberry Pi and other devices (Windows 11/Android/Linux), as well as from other devices to this Raspberry Pi. However, other devices do not encounter this issue. It seems that the "Receive encrypted" mode is responsible for this malfunction, as I haven't noticed any problems in my case when this mode isn't activated. Additionally, the number of unsynchronized files between device A and the Raspberry Pi differs from that between the Raspberry Pi and device A, and all encounter connection problems with the Raspberry Pi and vice versa.

@bgoodmansf
Copy link

I will add that doing a rescan on the folder or restarting Syncthing does not help. However, running Syncthing with the "--reset-database" parameter clears the error.

@AFulgens
Copy link

AFulgens commented May 9, 2024

Just adding my 2¢ to the issue, I have the same, syncing between multiple machines, one Win10 (Machine A) and one Win11 (Machine B), whilst having an Ubuntu (Machine C) and another Win11 (Machine D) in the following setup:

  • A ↔ B
  • A ↔ C
  • A ↔ D
  • C ↔ D

On A it says that B is out of sync by 290B (2 files, both temporary files already deleted; I really should have excluded them...)
On B it says that A is out of sync by 57.8GiB (6278 files, they are however correctly synced).
C/D are OK.

Starting syncthing on both A and B with --reset-deltas did NOT solve the problem.
--reset-database on B did the trick, and clear the "out of sync" issue on B, not on A though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A problem with current functionality, as opposed to missing functionality (enhancement) needs-triage New issues needed to be validated
Projects
None yet
Development

No branches or pull requests

10 participants