Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flashing errors with recent Windows update #1025

Open
microbit-carlos opened this issue Apr 28, 2023 · 14 comments
Open

Flashing errors with recent Windows update #1025

microbit-carlos opened this issue Apr 28, 2023 · 14 comments

Comments

@microbit-carlos
Copy link
Contributor

microbit-carlos commented Apr 28, 2023

A recent Windows 10 and Windows 11 update has started triggering checksum and time our errors on DAPLink.

This has been reported by micro:bit & Calliope users, and we have been able to replicate in Windows 10 and 11 when the OS is kept up-to-date. We haven't tried Windows 8.1, but that has reached end of life last January.

Triggering Windows Update

The cumulative updates have been found and installed using this Microsoft catalogue:
https://www.catalog.update.microsoft.com/Search.aspx?q=Cumulative%20Update%20Windows%2011%2022H2%20x64

Windows 11 22H2

I went through installing and uninstalling cumulative updates, and in my findings the problem is triggered when installing 2023-02 Cumulative Update Preview for Windows 11 Version 22H2 for x64-based Systems (KB5022913) from the 28th of February, which updates Windows 11 22H2 to OS Build 22621.1344.

The previous cumulative update KB5022845 from the 14th of Feb (OS Build 22621.1265) doesn't trigger this issue.

Windows 11 21H2

The Microsoft update catalog doesn't show any updates for Win 11 1 21H2 since November 2022, so I won't bother to test this Windows version.

Windows 10 22H2

The issue was triggered for me using 2023-03 Cumulative Update Preview for Windows 10 Version 22H2 for x64-based Systems (KB5023773) from the 21st of March, which updates the OS to build 1904x.2788.

The previous cumulative update KB5023696 from the 14th of March (OS Build 1904x.2728) doesn't trigger this issue.

Windows 10 21H2

We've also been able to replicate this issue in Win 10 21H2, and Microsoft is still releasing updates for this OS version, so it makes sense that we could identify a specific cumulative update to introduce this issue.
I probably won't be looking into this one, nor Win 10 20H2 as it's unlikely to provide any additional useful information.

Failure modes

Note
It's worth mentioning that the micro:bit V2 port contains an additional feature where if DAPLink encounters an error, it will reflash the target with a custom small programme that scrolls the error code in the micro:bit LED matrix display.
This is relevant because in some occasions this error programme is not flashed.

We've encountered a few different ways in which errors emerge:

  • The file transfer takes a little bit less time than usual, the target programme doesn't run, DAPLink MSD drive remounts with a fail.txt
    • For micro:bit specifically, in this case the error programme is flashed and error code does scrolls on the display
    • fail.txt errors we've seen so far:
      • Checksum error (error code 21) - the most common error
      • An error occurred during the transfer (error code 3)
      • Timeout error (error code 4)
      • Update sent was incomplete (error code 37)
      • Blocks out of order (error code 6)
  • The file transfer gets stuck at some point and can take a significant time before it errors (in my test environment with Win 11 2 to 3 minutes)
    • Explorer shows an error transferring the file, first screenshoot below
    • The target programme doesn't run, the DAPLink MSD drive remounts with one of the two assert.txt files listed below
    • I'm not 100% sure, but I think possibly DAPLink is crashing in this scenario, as I am using Windows VMs and the MSD drive unmounts from the VM and mounts back in my host OS, which usually only happens when DAPLink crashes.
  • In some cases the explorer error is different, as shown in screenshots 2 and 3
    • I haven't been able to replicate these myself, but have gotten multiple people reporting them, so trying to get more info at the moment

The errors are not triggered on every flash, but different users have reported different error frequencies. In our internal testing some teammates measured 20% failure rate and others up to 60%. Some users have reported errors happenning on "almost every flash".

We've used micro:bit Universal Hex files for the majority of these tests, which are a bit more resilient to this issue (more info in the "Identifying the Cause" section), so other DAPLink users flashing Intel Hex files might encounter this problem more often (it's also likely that the micro:bit user that reported an error on "almost every flash" was using Intel Hex files as well).

screenshot1
screenshot2 image
screenshot3 image
Assert
File: ../../../source/daplink/drag-n-drop/vfs_manager.c
Line: 361
Source: Application
Hexdumps
fffffff1
20000fc0
20005e88
00000000
20005eac
00000000
00000000
1fffeb9c
Assert
File: ../../../source/daplink/drag-n-drop/vfs_manager.c
Line: 361
Source: Application
Hexdumps
fffffff1
20000fc0
20005e88
00000000
20005eac
00000000
00000000
1fffeb9c

Identifying the Cause

I’ve collected a couple of RTT logs from DAPLink with additional debug prints to track how the OS writes the file blocks to disk, and peaking at the actual data. While it’s still a bit early (I need more time to capture more data and analyse it), initial findings point at the problem being caused by file blocks being sent out of order by the OS.

In previous Windows versions, the file blocks are sent in order, but after the listed Windows updates are installed it looks like some file blocks are first sent as zeros, and then later down the file transfer the blocks are sent again with the real file data.

For example:

  • OS sends file blocks 0 to 512 correctly
  • Blocks 512 to 1024 are sent as zeros
  • Blocks 1024 to 1536 are sent correctly
  • Real data for blocks 512 to 1024 are sent at this point

And this can happen more than once on the same file transfer.

However, not every file transfer sends files out of order, some are sent in order and it all works fine.

The check sum errors are encountered when the OS sends a block filled with zeros and DAPLink tries to calculate the checksum of an Intel Hex record. I still need to capture a better log for timeout errors, but I believe those are usually triggered when out of order blocks are ignored by DAPLink and then when the OS has finished sending the file, then DAPLink waits for more data to arrive (as the ignored blocks are not counted when measuring how much file data was transferred) until it eventually times out.

For the micro:bit specifically we use Universal Hex files, a superset of the Intel Hex format, which contains data for micro:bit V1 and micro:bit V2 in the same file. In file transfers where the out-of-order blocks correspond only for a section of the Universal Hex file that is not relevant the target MCU being flashed, the flash can still be successful. So while I haven't yet compared failure rates of Intel vs Universal Hex, it's very likely Intel Hex (and bin) files fail more frequently.

A checksum error log and Universal Hex file can be found here:

(Also note that because there is a lot of log data captured, data is sometimes dropped, so it might look like some blocks are not being sent, but we can look at the variables tracking the file size transferred to confirm that data has been processed, it's just that the RTT buffer was likely full).

Workarounds

Using robocopy with the /z flag, for restartable mode, seems to be work so far.

For example, with the terminal at the path where your file.hex is located, and assuming DAPLink is mounted as drive E:\:

robocopy /z . E:\ file.hex

Also, WebUSB flashing works, so for Intel Hex and bin files this demo from DAPJs can still flash the boards:
https://armmbed.github.io/dapjs/examples/daplink-flash/web.html

For micro:bit Universal Hex files, with online WebUSB tool will work too:
https://microbit.org/tools/webusb-hex-flashing/

@mathias-arm
Copy link
Collaborator

@polat-ahmet reported in #1032:

While flashing the max32666fthr board with Drag and Drop using max32625 debugger, it often fails(nearly 70% fail, %30 success). It's ok when I try with small size(~40kb) binary, but I'm having trouble with bigger size(~500kb) binary.

It was working properly before. I observed the problem using Windows updates KB5026361, KB5025221. When I uninstall these updates and tried it on KB5023696, I did not encounter any problems, successfully flash.

@free2create
Copy link

free2create commented Jun 29, 2023

@microbit-carlos In case this is related. I am seeing this error when using the microbit on Ubuntu 23.04. It initially works a few times then the timeouts, 503, start happening. I haven't tried stopping/starting the USB bus yet since one bus impacts the keyboard and the other my wireless. But I could dig in deeper.

@microbit-carlos FYI: For same hardware I booted into Windows 10 and it worked every time. This sort of feels like USB emulation is incomplete so could this bug be on microbit side ? By that I mean is that after file is dropped into microbit the USB connection seems to be reset so users have another go at flashing again. This USB reset process may be faulty and some required DAPLink API calls are not made, but should have been.

@ozersa
Copy link
Contributor

ozersa commented Jul 3, 2023

@polat-ahmet reported in #1032:

While flashing the max32666fthr board with Drag and Drop using max32625 debugger, it often fails(nearly 70% fail, %30 success). It's ok when I try with small size(~40kb) binary, but I'm having trouble with bigger size(~500kb) binary.
It was working properly before. I observed the problem using Windows updates KB5026361, KB5025221. When I uninstall these updates and tried it on KB5023696, I did not encounter any problems, successfully flash.

@mathias-arm This is a critical issue for us, we will appreciate if you provide an estimation this issue, when can it be fixed?

@microbit-carlos
Copy link
Contributor Author

We had an update from Microsoft that they expect to release a fix in the September Windows update 🎉

@fesc-q
Copy link

fesc-q commented Sep 14, 2023

Tested with Windows 11 22H2 22621.2283 September 12th 2023 build.
Issue is still reproduced

@microbit-carlos
Copy link
Contributor Author

Yes, it looks like the update has been pushed for October, hopefully it'll be finally be out by then.

@fesc-q
Copy link

fesc-q commented Oct 12, 2023

From Microsoft

checked internally the update is released to fix the issue already this week.

I Tested the Windows 11 22H2 22621.2428 Oct 10th 2023 build.
Issue is still reproduced on an old DAPLink release and a relatively new J-Link OB release

@top-5
Copy link

top-5 commented Oct 21, 2023

I can still consistently repro this bug on Win 11 22H2 23560.1000 insider preview. DAPLink Build ID: v0257-gc782a5ba
Doesn't seem like any recent updates would fix much yet.

@selimgullulu
Copy link

selimgullulu commented Nov 1, 2023

@polat-ahmet reported in #1032:

While flashing the max32666fthr board with Drag and Drop using max32625 debugger, it often fails(nearly 70% fail, %30 success). It's ok when I try with small size(~40kb) binary, but I'm having trouble with bigger size(~500kb) binary.
It was working properly before. I observed the problem using Windows updates KB5026361, KB5025221. When I uninstall these updates and tried it on KB5023696, I did not encounter any problems, successfully flash.

@mathias-arm This is a critical issue for us, we will appreciate if you provide an estimation this issue, when can it be fixed?

Issue is still reproduced while flashing fw to the MAX32625PICO

@microbit-carlos
Copy link
Contributor Author

This should be fix with Windows 11 build 22621.2506, released on the 31st of October.
https://support.microsoft.com/en-gb/topic/october-31-2023-kb5031455-os-builds-22621-2506-and-22631-2506-preview-6513c5ec-c5a2-4aaf-97f5-44c13d29e0d4

I've tested this build with a BBC micro:bit with DAPLink 0257 and could not replicate the issue anymore.

@felix-qorvo @top-5 @selimgullulu could you update this this version and try again? Thanks!

@selimgullulu
Copy link

This should be fix with Windows 11 build 22621.2506, released on the 31st of October. https://support.microsoft.com/en-gb/topic/october-31-2023-kb5031455-os-builds-22621-2506-and-22631-2506-preview-6513c5ec-c5a2-4aaf-97f5-44c13d29e0d4

I've tested this build with a BBC micro:bit with DAPLink 0257 and could not replicate the issue anymore.

@felix-qorvo @top-5 @selimgullulu could you update this this version and try again? Thanks!

Hi @microbit-carlos , is there an equivalent update for Windows 10? This seems to be for Windows 11.
Thanks
Selim

@microbit-carlos
Copy link
Contributor Author

microbit-carlos commented Nov 10, 2023

I don't know, sorry. Do you have the latest cumulative update installed? (probably KB5031445) And it still has issue there?

@fesc-q
Copy link

fesc-q commented Nov 14, 2023

@selimgullulu
Copy link

Hi, I used the KB5031445 on two different Windows 10 Laptops (Surface & Dell) and the drag&drop success rate was 100%. I'm waiting confirmation from some colleagues about the resolution. In the meantime, can you please let me know if your PCs are also using encryption for storage? Has anyone experienced this problem on a PC withOUT encryption? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants