`panic: cannot allocate memory` on job with many files. #2632

alexpersin · 2024-04-02T09:55:26Z

Which version of the AzCopy was used?

AzCopy 10.24.0

Which platform are you using? (ex: Windows, Mac, Linux)

Linux, Ubuntu 20.04

What command did you run?

export AZCOPY_CONCURRENCY_VALUE=3000
export AZCOPY_JOB_PLAN_LOCATION="/mnt/azcopy_plans"
export AZCOPY_LOG_LOCATION="/mnt/azcopy_logs"
azcopy copy "<source storage account><source SAS>" "<destination storage account><destination sas>" --recursive --block-blob-tier=hot --log-level=warning

What problem was encountered?

The job is copying ~1e9 blobs averaging 600KB in size between two azure storage accounts and failed after about 2 days with

82.6 %, 516821789 Done, 0 Failed, 100208211 Pending, 0 Skipped, 617030000 Total (scanning...), 2-sec Throughput (Mb/s): 18029.1914panic: cannot allocate memory

goroutine 1 [running]:
github.com/Azure/azure-storage-azcopy/v10/common.PanicIfErr(...)
        /home/vsts/work/1/s/common/lifecyleMgr.go:711
github.com/Azure/azure-storage-azcopy/v10/ste.JobPartPlanFileName.Map({0xc0af829700, 0x32})
        /home/vsts/work/1/s/ste/JobPartPlanFileName.go:73 +0x19b
github.com/Azure/azure-storage-azcopy/v10/ste.(*jobMgr).AddJobPart2(0xc000342400, 0xc042c23590)
        /home/vsts/work/1/s/ste/mgr-JobMgr.go:453 +0x2a6
github.com/Azure/azure-storage-azcopy/v10/jobsAdmin.ExecuteNewCopyJobPartOrder({0x0, {0x2fd571ba, 0x7589, 0xad4a, {0x62, 0xc4, 0x1c, 0xe7, 0xf7, 0xd4, ...}}, ...})
        /home/vsts/work/1/s/jobsAdmin/init.go:194 +0x2b5
github.com/Azure/azure-storage-azcopy/v10/cmd.inprocSend({0x10c512e, 0x10}, {0xe7f2c0?, 0xc00053b180?}, {0xe7f300?, 0xc0590087b0?})
        /home/vsts/work/1/s/cmd/rpc.go:39 +0x138
github.com/Azure/azure-storage-azcopy/v10/cmd.glob..func7({0x10c512e?, 0x0?}, {0xe7f2c0?, 0xc00053b180?}, {0xe7f300?, 0xc0590087b0?})
        /home/vsts/work/1/s/cmd/rpc.go:31 +0x32
github.com/Azure/azure-storage-azcopy/v10/cmd.addTransfer(_, {{0xc0e12c3080, 0x52}, {0xc0e12c3260, 0x58}, 0x0, {0x0, 0xedcf14962, 0xc07c5c9110}, 0x3020, ...}, ...)
        /home/vsts/work/1/s/cmd/copyEnumeratorHelper.go:24 +0xb2
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CookedCopyCmdArgs).initEnumerator.func5({{0xc059b53e9c, 0x1b}, 0x0, {0x0, 0xedcf14962, 0xc07c5c9110}, {0x0, 0x0, 0x0}, 0x3020, ...})
        /home/vsts/work/1/s/cmd/copyEnumeratorInit.go:321 +0x5a7
github.com/Azure/azure-storage-azcopy/v10/cmd.processIfPassedFilters({_, _, _}, {{0xc059b53e9c, 0x1b}, 0x0, {0x0, 0xedcf14962, 0xc07c5c9110}, {0x0, ...}, ...}, ...)
        /home/vsts/work/1/s/cmd/zc_enumerator.go:839 +0x9f
github.com/Azure/azure-storage-azcopy/v10/cmd.(*blobTraverser).parallelList(0xc00054b810, 0xc000151b60, {0xc0001b243a, 0xc}, {0xc0006481e8, 0x6}, {0x0, 0x0}, 0x0, 0xc0004aea80, ...)
        /home/vsts/work/1/s/cmd/zc_traverser_blob.go:437 +0x379
github.com/Azure/azure-storage-azcopy/v10/cmd.(*blobTraverser).Traverse(0xc00054b810, 0x0?, 0x0?, {0x199cb28, 0x0, 0x0})
        /home/vsts/work/1/s/cmd/zc_traverser_blob.go:308 +0xaa8
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CopyEnumerator).enumerate(0xc00064cfc0)
        /home/vsts/work/1/s/cmd/zc_enumerator.go:787 +0x42
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CookedCopyCmdArgs).processCopyJobPartOrders(0xc000835680)
        /home/vsts/work/1/s/cmd/copy.go:1618 +0xe8c
github.com/Azure/azure-storage-azcopy/v10/cmd.(*CookedCopyCmdArgs).process(0xc000672000?)
        /home/vsts/work/1/s/cmd/copy.go:1273 +0x65
github.com/Azure/azure-storage-azcopy/v10/cmd.init.2.func2(0xc0007c1680?, {0xc000488900?, 0x2?, 0x4?})
        /home/vsts/work/1/s/cmd/copy.go:2023 +0x1f4
github.com/spf13/cobra.(*Command).execute(0xc0007c1680, {0xc0004888c0, 0x4, 0x4})
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x195d140)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
github.com/Azure/azure-storage-azcopy/v10/cmd.Execute({0xc00003c014?, 0xc000146110?}, {0xc00003c079?, 0x110323b?}, 0x74?, {0x2fd571ba, 0x7589, 0xad4a, {0x62, 0xc4, ...}})
        /home/vsts/work/1/s/cmd/root.go:220 +0x106
main.main()
        /home/vsts/work/1/s/main.go:84 +0x507

azcopy jobs resume then fails after less than a minute with

panic: cannot allocate memory

goroutine 1 [running]:
github.com/Azure/azure-storage-azcopy/v10/common.PanicIfErr(...)
        /home/vsts/work/1/s/common/lifecyleMgr.go:711
github.com/Azure/azure-storage-azcopy/v10/ste.JobPartPlanFileName.Map({0xc0022c8332, 0x32})
        /home/vsts/work/1/s/ste/JobPartPlanFileName.go:73 +0x19b
github.com/Azure/azure-storage-azcopy/v10/jobsAdmin.(*jobsAdmin).ResurrectJob(0xc0001f97c0, {0x989facaa, 0xa273, 0x8a45, {0x53, 0x22, 0xc9, 0x7b, 0xcb, 0xb7, ...}}, ...)
        /home/vsts/work/1/s/jobsAdmin/JobsAdmin.go:392 +0x1e5
github.com/Azure/azure-storage-azcopy/v10/jobsAdmin.ResumeJobOrder({{0x989facaa, 0xa273, 0x8a45, {0x53, 0x22, 0xc9, 0x7b, 0xcb, 0xb7, 0xa9, ...}}, ...})
        /home/vsts/work/1/s/jobsAdmin/init.go:240 +0xfb
github.com/Azure/azure-storage-azcopy/v10/cmd.inprocSend({0x10bd7ae, 0x9}, {0xe7fcc0?, 0xc000520a80?}, {0xe7f140?, 0xc0001aa630?})
        /home/vsts/work/1/s/cmd/rpc.go:60 +0x8f8
github.com/Azure/azure-storage-azcopy/v10/cmd.glob..func7({0x10bd7ae?, 0x24?}, {0xe7fcc0?, 0xc000520a80?}, {0xe7f140?, 0xc0001aa630?})
        /home/vsts/work/1/s/cmd/rpc.go:31 +0x32
github.com/Azure/azure-storage-azcopy/v10/cmd.resumeCmdArgs.process({{0x7ffda695b564, 0x24}, {0x0, 0x0}, {0x0, 0x0}, {0x7ffda695b596, 0x87}, {0x7ffda695b630, 0x8a}})
        /home/vsts/work/1/s/cmd/jobsResume.go:406 +0x90f
github.com/Azure/azure-storage-azcopy/v10/cmd.init.9.func2(0xc000352280?, {0xc00040b710?, 0x1?, 0x3?})
        /home/vsts/work/1/s/cmd/jobsResume.go:221 +0x38
github.com/spf13/cobra.(*Command).execute(0xc000352280, {0xc00040b6b0, 0x3, 0x3})
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x195d140)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /home/vsts/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
github.com/Azure/azure-storage-azcopy/v10/cmd.Execute({0xc00003c014?, 0xc000190100?}, {0xc00003c079?, 0x110323b?}, 0x74?, {0xe562f92d, 0x8d50, 0xa947, {0x7b, 0x61, ...}})
        /home/vsts/work/1/s/cmd/root.go:220 +0x106
main.main()
        /home/vsts/work/1/s/main.go:84 +0x507

The VM had a lot of available memory at the time of the crash

and used no more than 8GB of memory before crashing when attempting to resume the job. The plan files total 239GB and the VM is a Standard D96d v5 (96 vcpus, 384 GiB memory).

Five other similar jobs were running at the same time on other VMs on other directories with the same setup, and all crashed after similar amounts of time.

How can we reproduce the problem in the simplest way?

Run a similarly sized job?

Have you found a mitigation/solution?

No, I am unable to resume the job.

The text was updated successfully, but these errors were encountered:

adreed-msft · 2024-04-02T15:12:05Z

Ah. So, this doesn't sound like actual "memory" that's causing the crash per se, but it looks like where the crash occurred is in attempting to map the job plan file into memory. I notice this is a massive job. It's entirely possible that the job plan file's memory mapping simply eats through allocatable space.

This is a known AzCopy issue (And something I'd like to address; but we don't really encounter transfers of this scale that often). We usually mitigate it by breaking jobs down into smaller, more manageable chunks. If you have files separated into folders, or there is some consistent naming scheme that could be filtered against, AzCopy has pattern/path filters. If there's no way to filter against names, perhaps breaking down by LMT may be another strategy with --include-before and --include-after.

alexpersin · 2024-04-02T17:33:53Z

Thanks for the prompt reply @adreed-msft. As a test I tried adding an --include-regex='2024-01-*' argument to filter by pattern but after about an hour it had not started copying any files. I think it might have been listing the whole directory and applying the regex on each blob. I can try filtering by LMT instead, would that be more efficient - in the code it looks like they both get applied as filters in the same way?

The next level in directory structure has about 1e4 directories per parent. My next approach will be to write a script that runs azcopy on each of those 10k subdirectories sequentially: the jobs would be well-sized but my concern is that the overhead of starting each job would slow things down. I could maybe include a few in each azcopy --include-path arg to group them together?

I guess there's no way to recover the plan files at this point and I'll have to run azcopy syncs to avoid copying the data twice? (It's a cross-region transfer so using the --overwrite IfSourceNewer option would be much more expensive.)

adreed-msft · 2024-04-02T17:37:34Z

Try include-pattern instead of include-regex here if you'd like to use a single wildcard character. Keep in mind that regexp expects .* for any character, 0 or more, or .+ for any character, one or more.

alexpersin · 2024-04-02T19:01:15Z

Does include pattern only work for filenames though? https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-files#use-wildcard-characters

adreed-msft · 2024-04-03T16:51:28Z

Include pattern operates on file name, yes.

jidicula · 2024-04-07T20:14:57Z

👋 Hey @alexpersin! I just finished an S3->Azure copy a few weeks ago of about half the size (~370 TB over 225 million files) and didn't run into any OOM issues, but without modifying the default concurrency values - it finished in about 54 hours. Have you had any luck running AzCopy without setting AZCOPY_CONCURRENCY_VALUE?

I came across your issue while looking into #2642 🙂.

gapra-msft added the waiting for customer label Apr 2, 2024

adreed-msft added bug known issue and removed waiting for customer labels Apr 23, 2024

gapra-msft added triage and removed known issue labels Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`panic: cannot allocate memory` on job with many files. #2632

`panic: cannot allocate memory` on job with many files. #2632

alexpersin commented Apr 2, 2024 •

edited

adreed-msft commented Apr 2, 2024

alexpersin commented Apr 2, 2024 •

edited

adreed-msft commented Apr 2, 2024

alexpersin commented Apr 2, 2024

adreed-msft commented Apr 3, 2024

jidicula commented Apr 7, 2024

panic: cannot allocate memory on job with many files. #2632

panic: cannot allocate memory on job with many files. #2632

Comments

alexpersin commented Apr 2, 2024 • edited

Which version of the AzCopy was used?

Which platform are you using? (ex: Windows, Mac, Linux)

What command did you run?

What problem was encountered?

How can we reproduce the problem in the simplest way?

Have you found a mitigation/solution?

adreed-msft commented Apr 2, 2024

alexpersin commented Apr 2, 2024 • edited

adreed-msft commented Apr 2, 2024

alexpersin commented Apr 2, 2024

adreed-msft commented Apr 3, 2024

jidicula commented Apr 7, 2024

`panic: cannot allocate memory` on job with many files. #2632

`panic: cannot allocate memory` on job with many files. #2632

alexpersin commented Apr 2, 2024 •

edited

alexpersin commented Apr 2, 2024 •

edited