Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: cannot allocate memory on job with many files. #2632

Open
alexpersin opened this issue Apr 2, 2024 · 6 comments
Open

panic: cannot allocate memory on job with many files. #2632

alexpersin opened this issue Apr 2, 2024 · 6 comments

Comments

@alexpersin
Copy link

alexpersin commented Apr 2, 2024

Which version of the AzCopy was used?

AzCopy 10.24.0

Which platform are you using? (ex: Windows, Mac, Linux)

Linux, Ubuntu 20.04

What command did you run?

export AZCOPY_CONCURRENCY_VALUE=3000
export AZCOPY_JOB_PLAN_LOCATION="/mnt/azcopy_plans"
export AZCOPY_LOG_LOCATION="/mnt/azcopy_logs"
azcopy copy "<source storage account><source SAS>" "<destination storage account><destination sas>" --recursive --block-blob-tier=hot --log-level=warning

What problem was encountered?

The job is copying ~1e9 blobs averaging 600KB in size between two azure storage accounts and failed after about 2 days with

82.6 %, 516821789 Done, 0 Failed, 100208211 Pending, 0 Skipped, 617030000 Total (scanning...), 2-sec Throughput (Mb/s): 18029.1914panic: cannot allocate memory

goroutine 1 [running]:
github.com/Azure/azure-storage-azcopy/v10/common.PanicIfErr(...)
        /home/vsts/work/1/s/common/lifecyleMgr.go:711
github.com/Azure/azure-storage-azcopy/v10/ste.JobPartPlanFileName.Map({0xc0af829700, 0x32})
        /home/vsts/work/1/s/ste/JobPartPlanFileName.go:73 +0x19b
github.com/Azure/azure-storage-azcopy/v10/ste.(*jobMgr).AddJobPart2(0xc000342400, 0xc042c23590)
        /home/vsts/work/1/s/ste/mgr-JobMgr.go:453 +0x2a6
github.com/Azure/azure-storage-azcopy/v10/jobsAdmin.ExecuteNewCopyJobPartOrder({0x0, {0x2fd571ba, 0x7589, 0xad4a, {0x62, 0xc4, 0x1c, 0xe7, 0xf7, 0xd4, ...}}, ...})
        /home/vsts/work/1/s/jobsAdmin/init.go:194 +0x2b5
github.com/Azure/azure-storage-azcopy/v10/cmd.inprocSend({0x10c512e, 0x10}, {0xe7f2c0?, 0xc00053b180?}, {0xe7f300?, 0xc0590087b0?})
        /home/vsts/work/1/s/cmd/rpc.go:39 +0x138
github.com/Azure/azure-storage-azcopy/v10/cmd.glob..func7({0x10c512e?, 0x0?}, {0xe7f2c0?, 0xc00053b180?}, {0xe7f300?, 0xc0590087b0?})
        /home/vsts/work/1/s/cmd/rpc.go:31 +0x32
github.com/Azure/azure-storage-azcopy/v10/cmd.addTransfer(_, {{0xc0e12c3080, 0x52}, {0xc0e12c3260, 0x58}, 0x0, {0x0, 0xedcf14962, 0xc07c5c9110}, 0x3020, ...}, ...)
        /home/vsts/work/1/s/cmd/copyEnumeratorHelper.go:24 +0xb2
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CookedCopyCmdArgs).initEnumerator.func5({{0xc059b53e9c, 0x1b}, 0x0, {0x0, 0xedcf14962, 0xc07c5c9110}, {0x0, 0x0, 0x0}, 0x3020, ...})
        /home/vsts/work/1/s/cmd/copyEnumeratorInit.go:321 +0x5a7
github.com/Azure/azure-storage-azcopy/v10/cmd.processIfPassedFilters({_, _, _}, {{0xc059b53e9c, 0x1b}, 0x0, {0x0, 0xedcf14962, 0xc07c5c9110}, {0x0, ...}, ...}, ...)
        /home/vsts/work/1/s/cmd/zc_enumerator.go:839 +0x9f
github.com/Azure/azure-storage-azcopy/v10/cmd.(*blobTraverser).parallelList(0xc00054b810, 0xc000151b60, {0xc0001b243a, 0xc}, {0xc0006481e8, 0x6}, {0x0, 0x0}, 0x0, 0xc0004aea80, ...)
        /home/vsts/work/1/s/cmd/zc_traverser_blob.go:437 +0x379
github.com/Azure/azure-storage-azcopy/v10/cmd.(*blobTraverser).Traverse(0xc00054b810, 0x0?, 0x0?, {0x199cb28, 0x0, 0x0})
        /home/vsts/work/1/s/cmd/zc_traverser_blob.go:308 +0xaa8
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CopyEnumerator).enumerate(0xc00064cfc0)
        /home/vsts/work/1/s/cmd/zc_enumerator.go:787 +0x42
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CookedCopyCmdArgs).processCopyJobPartOrders(0xc000835680)
        /home/vsts/work/1/s/cmd/copy.go:1618 +0xe8c
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CookedCopyCmdArgs).process(0xc000672000?)
        /home/vsts/work/1/s/cmd/copy.go:1273 +0x65
github.com/Azure/azure-storage-azcopy/v10/cmd.init.2.func2(0xc0007c1680?, {0xc000488900?, 0x2?, 0x4?})
        /home/vsts/work/1/s/cmd/copy.go:2023 +0x1f4
github.com/spf13/cobra.(*Command).execute(0xc0007c1680, {0xc0004888c0, 0x4, 0x4})
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x195d140)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
github.com/Azure/azure-storage-azcopy/v10/cmd.Execute({0xc00003c014?, 0xc000146110?}, {0xc00003c079?, 0x110323b?}, 0x74?, {0x2fd571ba, 0x7589, 0xad4a, {0x62, 0xc4, ...}})
        /home/vsts/work/1/s/cmd/root.go:220 +0x106
main.main()
        /home/vsts/work/1/s/main.go:84 +0x507

azcopy jobs resume then fails after less than a minute with

panic: cannot allocate memory

goroutine 1 [running]:
github.com/Azure/azure-storage-azcopy/v10/common.PanicIfErr(...)
        /home/vsts/work/1/s/common/lifecyleMgr.go:711
github.com/Azure/azure-storage-azcopy/v10/ste.JobPartPlanFileName.Map({0xc0022c8332, 0x32})
        /home/vsts/work/1/s/ste/JobPartPlanFileName.go:73 +0x19b
github.com/Azure/azure-storage-azcopy/v10/jobsAdmin.(*jobsAdmin).ResurrectJob(0xc0001f97c0, {0x989facaa, 0xa273, 0x8a45, {0x53, 0x22, 0xc9, 0x7b, 0xcb, 0xb7, ...}}, ...)
        /home/vsts/work/1/s/jobsAdmin/JobsAdmin.go:392 +0x1e5
github.com/Azure/azure-storage-azcopy/v10/jobsAdmin.ResumeJobOrder({{0x989facaa, 0xa273, 0x8a45, {0x53, 0x22, 0xc9, 0x7b, 0xcb, 0xb7, 0xa9, ...}}, ...})
        /home/vsts/work/1/s/jobsAdmin/init.go:240 +0xfb
github.com/Azure/azure-storage-azcopy/v10/cmd.inprocSend({0x10bd7ae, 0x9}, {0xe7fcc0?, 0xc000520a80?}, {0xe7f140?, 0xc0001aa630?})
        /home/vsts/work/1/s/cmd/rpc.go:60 +0x8f8
github.com/Azure/azure-storage-azcopy/v10/cmd.glob..func7({0x10bd7ae?, 0x24?}, {0xe7fcc0?, 0xc000520a80?}, {0xe7f140?, 0xc0001aa630?})
        /home/vsts/work/1/s/cmd/rpc.go:31 +0x32
github.com/Azure/azure-storage-azcopy/v10/cmd.resumeCmdArgs.process({{0x7ffda695b564, 0x24}, {0x0, 0x0}, {0x0, 0x0}, {0x7ffda695b596, 0x87}, {0x7ffda695b630, 0x8a}})
        /home/vsts/work/1/s/cmd/jobsResume.go:406 +0x90f
github.com/Azure/azure-storage-azcopy/v10/cmd.init.9.func2(0xc000352280?, {0xc00040b710?, 0x1?, 0x3?})
        /home/vsts/work/1/s/cmd/jobsResume.go:221 +0x38
github.com/spf13/cobra.(*Command).execute(0xc000352280, {0xc00040b6b0, 0x3, 0x3})
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x195d140)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
github.com/Azure/azure-storage-azcopy/v10/cmd.Execute({0xc00003c014?, 0xc000190100?}, {0xc00003c079?, 0x110323b?}, 0x74?, {0xe562f92d, 0x8d50, 0xa947, {0x7b, 0x61, ...}})
        /home/vsts/work/1/s/cmd/root.go:220 +0x106
main.main()
        /home/vsts/work/1/s/main.go:84 +0x507

The VM had a lot of available memory at the time of the crash
image

and used no more than 8GB of memory before crashing when attempting to resume the job. The plan files total 239GB and the VM is a Standard D96d v5 (96 vcpus, 384 GiB memory).

Five other similar jobs were running at the same time on other VMs on other directories with the same setup, and all crashed after similar amounts of time.

How can we reproduce the problem in the simplest way?

Run a similarly sized job?

Have you found a mitigation/solution?

No, I am unable to resume the job.

@adreed-msft
Copy link
Member

Ah. So, this doesn't sound like actual "memory" that's causing the crash per se, but it looks like where the crash occurred is in attempting to map the job plan file into memory. I notice this is a massive job. It's entirely possible that the job plan file's memory mapping simply eats through allocatable space.

This is a known AzCopy issue (And something I'd like to address; but we don't really encounter transfers of this scale that often). We usually mitigate it by breaking jobs down into smaller, more manageable chunks. If you have files separated into folders, or there is some consistent naming scheme that could be filtered against, AzCopy has pattern/path filters. If there's no way to filter against names, perhaps breaking down by LMT may be another strategy with --include-before and --include-after.

@alexpersin
Copy link
Author

alexpersin commented Apr 2, 2024

Thanks for the prompt reply @adreed-msft. As a test I tried adding an --include-regex='2024-01-*' argument to filter by pattern but after about an hour it had not started copying any files. I think it might have been listing the whole directory and applying the regex on each blob. I can try filtering by LMT instead, would that be more efficient - in the code it looks like they both get applied as filters in the same way?

The next level in directory structure has about 1e4 directories per parent. My next approach will be to write a script that runs azcopy on each of those 10k subdirectories sequentially: the jobs would be well-sized but my concern is that the overhead of starting each job would slow things down. I could maybe include a few in each azcopy --include-path arg to group them together?

I guess there's no way to recover the plan files at this point and I'll have to run azcopy syncs to avoid copying the data twice? (It's a cross-region transfer so using the --overwrite IfSourceNewer option would be much more expensive.)

@adreed-msft
Copy link
Member

Try include-pattern instead of include-regex here if you'd like to use a single wildcard character. Keep in mind that regexp expects .* for any character, 0 or more, or .+ for any character, one or more.

@alexpersin
Copy link
Author

Does include pattern only work for filenames though? https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-files#use-wildcard-characters

@adreed-msft
Copy link
Member

Include pattern operates on file name, yes.

@jidicula
Copy link

jidicula commented Apr 7, 2024

👋 Hey @alexpersin! I just finished an S3->Azure copy a few weeks ago of about half the size (~370 TB over 225 million files) and didn't run into any OOM issues, but without modifying the default concurrency values - it finished in about 54 hours. Have you had any luck running AzCopy without setting AZCOPY_CONCURRENCY_VALUE?

I came across your issue while looking into #2642 🙂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants