Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzCopy Sync do not provide 'include-path' parameter. #2594

Open
leelax22 opened this issue Feb 28, 2024 · 4 comments
Open

AzCopy Sync do not provide 'include-path' parameter. #2594

leelax22 opened this issue Feb 28, 2024 · 4 comments

Comments

@leelax22
Copy link

Which version of the AzCopy was used?

azcopy version 10.23.0

Which platform are you using? (ex: Windows, Mac, Linux)

Linux

What command did you run?

azcopy sync --exclude-path
azcopy copy --include-path

What problem was encountered?

I used azcopy for syncing azure files(region1) to azure files(region2).
Azure files support GRS except for test-failover function.
So, for training situation, we use azcopy for azure files to sync.
The problem is, there is too many files from many folders.
Azcopy recommend 1 job for under 10 million files.
So I decided to divide folders into jobs.

For example, in azure files source, there is folder 1,2,3,4 and each folder has 10 million files.
First, I used command
[azcopy sync "src" "dst" --exclude-path 3;4;
azcopy sync "src" "dst" --exclude-path 1;2;]
But it ran differently I expected. Azcopy scan all of the folder(1,2,3,4) and after that, I guess, sync process is done.
There is so many files that scanning folder 3,4 is time-wasting.
I found there is --include-path parameter in "azcopy copy" command.
[azcopy copy "src" "dst" --include-path 1;2;
azcopy copy "src" "dst" --include-path 3;4;]
Unlike azcopy sync, azcopy copy --include-path do not have process scanning all of the folder.
I wonder why azcopy sync do not excluding search processing even if they are exclude-path folders.

I hope azcopy sync have '--include-path' parameter too, or '--exclude-path' parameter skip processing of every folders.

Here is test example.

PS C:\Users\Zenuser\Desktop\azcopy> azcopy sync "https://newjeans.file.core.windows.net/newjeans/edms/LocalDisk01/?sv" "https://newjeans2.file.core.windows.net/newjeans2/edms/LocalDisk01/?sv" --delete-destination=true --exclude-path="500;501;502"

image

exclude-path parameter well applied.

2024/02/28 08:08:44 ==> REQUEST/RESPONSE (Try=1/26.4232ms, OpTime=56.933ms) -- RESPONSE SUCCESSFULLY RECEIVED
HEAD https://newjeans.file.core.windows.net/newjeans/edms%2FLocalDisk01/500/file2.zip?se=2024-04-17T16%3A02%3A01Z&sig=-REDACTED-&sp=rwdlc&spr=https&srt=sco&ss=f&st=2024-02-28T08%3A02%3A01Z&sv=2022-11-02
X-Ms-Request-Id: [406692d6-f01a-0014-6f1d-6ac308000000]

2024/02/28 08:08:44 ==> REQUEST/RESPONSE (Try=1/25.647ms, OpTime=40.164ms) -- RESPONSE SUCCESSFULLY RECEIVED
HEAD https://newjeans.file.core.windows.net/newjeans/edms%2FLocalDisk01/500/file4.zip?se=2024-04-17T16%3A02%3A01Z&sig=-REDACTED-&sp=rwdlc&spr=https&srt=sco&ss=f&st=2024-02-28T08%3A02%3A01Z&sv=2022-11-02

but in log file there are logs, seems like scanning exclude-path folders.

Thank you for watching. I am using azcopy well, and it would be better if this point were also improved. Or, if there is something I missed, please let me know. I would really appreciate it.

@siminsavani-msft
Copy link
Member

Hi @leelax22 ! This is on our radar and we will update this thread accordingly!

@leelax22
Copy link
Author

leelax22 commented Feb 29, 2024

I found solution via 'include-regex' parameter in my situation.
I'm not used to using regex but ms copilot help me make it.

I made 2 jobs which had these parameters.
--include-regex "^[0-4]?[0-9]?[0-9]/."
--include-regex "^([5-9][0-9]?[0-9]|9[0-9][0-9])/.
"

But I checked log for the job and found that each job scanned someting for whole folders(0~999).
I should test more....

@leelax22 leelax22 reopened this Feb 29, 2024
@leelax22
Copy link
Author

leelax22 commented Feb 29, 2024

"azcopy copy --include-path" log do not have scanning log whole folders which is not in "include-path".
and "azcopy sync --exclude-path" log have scanning whole folders which is in "exclude-path"

so I think, 'exclude' parameter contains process of first scanning whole folders and after that exclude not satisfy condition.

But, after testing some jobs more, It's difference from 'sync' and 'copy', not 'exclude' and 'include', I guess.

So, I wonder if I can solve the problem of my situation.
Job1: LocalDisk01/0-499(folders)
Job2: LocalDisk01/500-999(folders)

each Job's file count is less than 10 millions.
if I using azcopy copy --include parameter, I should delete not-syncing filese manually.(it's impossible)
if I using azcopy sync --exclude parameter, each job has scanning whole folders twice which take very very long time.

I don't know it could be solved easily. @@

@leelax22
Copy link
Author

This is the log from [azcopy sync --include-regex "^[0-4]?[0-9]?[0-9]/."]

2024/02/29 02:08:13 ==> REQUEST/RESPONSE (Try=1/45.956ms, OpTime=86.4368ms) -- RESPONSE SUCCESSFULLY RECEIVED
HEAD https://newjeans.file.core.windows.net/newjeans/edms%2FLocalDisk01/500/xxxxx-1022.file?se=2024-04-17T16%3A02%3A01Z&sig=-REDACTED-&sp=rwdlc&spr=https&srt=sco&ss=f&st=2024-02-28T08%3A02%3A01Z&sv=2022-11-02
X-Ms-Request-Id: [b5e47669-001a-004d-34b4-6a448b000000]

LocalDisk01/500 is not included in regex but some api job has done accroding to [f24fd023-bf16-1a4f-4a36-8de234e5e7e5-scanning.log].

Even if I run the job separately, I am concerned that the scanning time will take too long if the entire folder is scanned twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants