Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering attributes/tags/access tier when transfering blobs between storage accounts #2621

Open
catalin-micu opened this issue Mar 25, 2024 · 9 comments
Assignees

Comments

@catalin-micu
Copy link

catalin-micu commented Mar 25, 2024

AzCopy 10.23

Linux OS

azcopy copy "source_storage_account_container" "destination_storage_account_container" --recursive

Problem: Copying entire storage containers and using azcopy to filter some blobs

There is an unpredictible amount of data, scattered throughout the container, that we want to filter out. We are talking about petabytes worth of data in total. We can identifiy all the data that needs to be filtered. Due to internal policies, we cannot alter the data (cannot rename/add prefix or anything of the sort , therefore cannot use --exclude-pattern or --exclude-regex), nor can we archive it. These two options are out of the question.

What I want to do is filter data in a storage account to storage account transfer, through azcopy copy, based on either a tag, or access tier (everything is currently hot tier, but unwanted data can be moved to cool or cold) or any other blob attribute that can be assigned to the data, without changing names, directory structure or archiving.

Can this be done?

@souravgupta-msft souravgupta-msft self-assigned this Mar 25, 2024
@souravgupta-msft
Copy link
Member

Hi @catalin-micu, filtering blobs using tags or access tier is currently not supported. You can use one of the below ways for filtering blobs during copy.

  • Name based using the include/exclude pattern/regex flags.
  • Last modified time based using the include-after or include-before flags.
  • Some other flags you can refer are include-path, exclude-path and exclude-blob-type.

@catalin-micu
Copy link
Author

catalin-micu commented Mar 27, 2024

All my data is Block blob at the moment. Is there a way to change that?

@souravgupta-msft
Copy link
Member

Do you mean changing the blob type from Block Blob to Append Blob or Page Blob? If yes, then there is no direct way to do that.
Can you use last modified time for filtering the blobs during copy?

@catalin-micu
Copy link
Author

Yes, I meant changing the blob type, I understand it's not possible.
I can't use the last modified timestamp either, because there is no pattern for uploading this data that I'm trying to filter.
Situation is like this: over the course of years, from time to time, wrong data was uploaded. Now I'm need to move the whole content of the storage account, preferably filtering this wrong data. About the wrong data the only thing I can find out is the directory name. All directory names (for both good data and bad data) are UUIDs, so can't use any pattern filtering there. We are talking hundreds to thousands of directories, so adding each name I'm trying to filter in the AzCopy command is also not an option.

Is there anything else worth trying? I was leaning towards filtering based on blob tags or blob attributes, but it does not seem possible

@souravgupta-msft
Copy link
Member

What blob attribute do you want to use for filtering the wrong data (other than tags or access tier)?

@catalin-micu
Copy link
Author

I don't have any in mind, basically anything that I can set to a specific value for all the wrong data, then pass said value to azcopy to filter
be it blob property, directory property, anything

@catalin-micu
Copy link
Author

Alright, I see a feature-request label was added; to summarize, I would best like to filter by access tier

@schoag-msft
Copy link
Member

Blob Inventory (https://learn.microsoft.com/azure/storage/blobs/blob-inventory) captures metadata/attributes on objects like Access Tier. You could use a Blob Inventory report as a input to AzCopy with the --list-of-files param (https://github.com/Azure/azure-storage-azcopy/wiki/Listing-specific-files-to-transfer).

@catalin-micu
Copy link
Author

Interesting solution, but sadly, it won't work because of the performance issues. The resulting list of files would have millions of entries, every time, for multiple transer jobs I will do (200+)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants