Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List and compare files in different storage systems #382

Open
jefftucker opened this issue Jun 1, 2021 · 0 comments
Open

List and compare files in different storage systems #382

jefftucker opened this issue Jun 1, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@jefftucker
Copy link

This feature would enable a user to input two different locations e.g. two different S3 buckets, an S3 bucket and a Swift folder, etc, and Motuz would output a list of all files in each location along with their sizes. It could optionally show the set intersection, union, and/or disjunction so that a user can figure out if they have any duplicate files (based on name and file size) or any files that are present in one location and NOT present in the other location. This would help users to be able to manage their data more effectively and increase the efficiency of their storage by enabling them to remove duplicate data, copy over only missing files, etc.

Sample implementation:

If I were to compare an S3 bucket to a posix file system manually I would do the following steps:

  1. run "aws s3 ls --recursive --summarize s3://bucket > bucket.txt
  2. run "ls -alR /path/to/folder > folder.txt"
  3. canonicalize the paths in both bucket.txt and folder.txt to show path relative to root folder/bucket, file name, and size in bytes
  4. sort both folders in order by file name and path
  5. run "diff bucket.txt folder.txt" to compare and contrast what files are in both locations.

This feature is basically these 5 steps except between any two arbitrary folders/buckets/etc. in whatever storage systems Motuz supports. If this needs to be submitted as a job that then gets returned at a later time for the user to check the results, that would most likely be fine.

Nice to have:

  • compare the file hash if the storage system makes that readily available in the metadata
  • force the creation of the hash for each file in each location and include this in the results. This could then highlight files with the same name and a different hash or the same hash but a different name/path.
@jefftucker jefftucker added the enhancement New feature or request label Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant