Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzCopy downloads with --from-to BlobPipe fails when downloading multiple files #2575

Open
robinlandstrom opened this issue Feb 14, 2024 · 4 comments
Assignees

Comments

@robinlandstrom
Copy link

Which version of the AzCopy was used?

azcopy version 10.23.0

Which platform are you using? (ex: Windows, Mac, Linux)

Linux x86-64

What command did you run?

# Downloading multiple files to local dir works
$ azcopy cp "https://ACCOUNT.blob.core.windows.net/CONTAINER/*?SAS" --include-pattern="*.csv.gz" .
100.0 %, 3 Done, 0 Failed, 0 Pending, 0 Skipped, 3 Total,
...

# Downloading multiple files to pipe fails silently with exitcode 0
$ azcopy cp "https://ACCOUNT.blob.core.windows.net/CONTAINER/*?SAS" --include-pattern="*.csv.gz" --from-to BlobPipe | pv > /dev/null
0,00  B 0:00:01 [0,00  B/s]
$ azcopy cp "https://ACCOUNT.blob.core.windows.net/CONTAINER/*?SAS" --include-pattern="*.csv.gz" --from-to BlobPipe > testfiles
$ echo $?
0
$ du testfiles
0       testfiles

# Downloading single file to pipe works
$ azcopy cp "https://ACCOUNT.blob.core.windows.net/CONTAINER/data01.csv.gz?SAS" --include-pattern="*.csv.gz" --from-to BlobPipe | pv > /dev/null
345MiB 0:01:25 [4,03MiB/s]

What problem was encountered?

I expect azcopy to be able to download multiple files to a BlobPipe, it does not work.

How can we reproduce the problem in the simplest way?

Try to download multiple files from a storage account with --from-to BlobPipe

Have you found a mitigation/solution?

Not yet, probably possible to do it in multiple steps, list files first and then start multiple azcopy processes..

@vibhansa-msft
Copy link
Member

When you download multiple files or directory to a pipe, what is your expectation. AzCopy downloads files in parallel and writing contents of multiple files to a single pipe does not make sense.

@robinlandstrom
Copy link
Author

Streaming multiple files in a single pipe can absolutely make sense.

My expectation is similar output as if I download all the files and then cat them to a pipe. But without all the files landing on disk in between. Order of the files does not matter in my case, but one complete file should be streamed before the next file is added to the stream.

Use case is streaming through compressed csv data larger than disk/memory available the machine and calculating metrics/stats.

azcopy cp "https://ACCOUNT.blob.core.windows.net/CONTAINER/*?SAS" --include-pattern="*.csv.gz" .  && \
cat *.csv.gz | \
mlr --gzin --csv cut -f "Fields,I,want" | ... # Do some fancy metrics calculation 

@vibhansa-msft
Copy link
Member

This will not work because AzCopy does not download file in any particular order, rather all are downloaded in parallel in small chunks (blocks). As and when a block arrives it's sent to output file (pipe in your case). This means data of the files being downloaded will be intermixed and not in an order where you can expect one file in full before another begins. This is against the AzCopy logic of downloading blocks in parallel and hence can not be honored.

@robinlandstrom
Copy link
Author

AzCopy does the right thing with just one file, pipes the data of the file in order before the whole file is downloaded.

Work around below that might be useful for someone else if BlobPipe for multiple files is not implemented in AzCopy.

export AZ_BASE_URL=https://ACCOUNT.blob.core.windows.net/CONTAINER/PATH/
export AZ_SAS='...'
azcopy list "${AZ_BASE_URL}?${AZ_SAS}" | grep -oP 'INFO: \K[^;]+' | grep .csv.gz | while read -r f 
  do azcopy cp "${AZ_BASE_URL}${f}?${AZ_SAS}" --from-to BlobPipe
done | pv | # Do streaming processing of multiple files here... 

Probably breaking for blobs with ';' in the name thou, bit surprised that azcopy list has a --output-type json mode but no way that I can find to just give you blob names or blob urls in a structured manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants