Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monotonic memory growth bug in azcopy jobs show <jobID> for large job, significantly worse with --with-status flag #2642

Open
jidicula opened this issue Apr 7, 2024 · 0 comments

Comments

@jidicula
Copy link

jidicula commented Apr 7, 2024

Which version of the AzCopy was used?

Note: The version is visible when running AzCopy without any argument
  • 10.23.0
  • 10.24.0
  • 10.25.0-Preview-1

Which platform are you using? (ex: Windows, Mac, Linux)

Linux: 6.5.0-1017-azure #17~22.04.1-Ubuntu SMP Sat Mar 9 10:04:07 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

What command did you run?

Note: Please remove the SAS to avoid exposing your credentials. If you cannot remember the exact command, please retrieve it from the beginning of the log file.
  • azure-storage-azcopy jobs show <jobID> --with-status=Failed
  • azure-storage-azcopy jobs show <jobID>

What problem was encountered?

Out of memory kill from the OS

How can we reproduce the problem in the simplest way?

Run the above commands on any of the above AzCopy versions on an Ubuntu VM on a large (my scenario included 225 million files) completed job's result.

Have you found a mitigation/solution?

The only workaround I have for inspecting errors is to grep the job's logs for COPYFAILED and pipe that to a separate file for further examination:

grep COPYFAILED ~/.azcopy/<jobID>* > logged-failures.txt

I noticed that when running azure-storage-azcopy jobs show <jobID> --with-status=Failed for a large job (~370 TB over 225 million files), the command exits with 137 and a Killed stderr message. This seems to correspond to an out-of-memory error from the kernel, and it kill(9)s the azcopy process.

Is this a known bug?

Some data

I captured some really crude logs with free on an Ubuntu 22.04 ARM64 VM in Azure running nothing but azure-storage-azcopy jobs show <jobID> --with-status=Failed in a tmux session and saw that system RAM usage grows monotonically until the OS kills azcopy (haven't correlated it fully with azcopy's invocation, but azcopy definitely gets killed before my memory sample collection is complete).

I've reproduced this with various combinations of Go and AzCopy versions:

Go 1.18.1 Go 1.22.2
azcopy 10.23.0 azcopy-10.23.0-go1.18.1-linux-arm64-memprofile.log azcopy-10.23.0-go1.22.2-linux-arm64-memprofile.log
azcopy 10.24.0 azcopy-10.24.0-go1.18.1-linux-arm64-memprofile.log azcopy-10.24.0-go1.22.2-linux-arm64-memprofile.log
azcopy 10.25.0-Preview-1 didn't test azcopy-10.25.0-Preview-1-go1.22.2-linux-arm64-memprofile.log

I also captured a single free sample with azcopy 10.25.0-Preview-1 and Go 1.22.2 just running azure-storage-azcopy jobs show <jobID>, and that also shows a monotonic memory increase, but the azcopy command completes before it runs out of memory: azcopy-10.25.0-Preview-1-go1.22.2-linux-arm64-summary-memprofile.log

Here's how the system memory usage for each of these scenarios looks when plotted together:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants