Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: improvements/replacements for s3-find #332

Open
Kirill888 opened this issue Aug 25, 2021 · 3 comments
Open

Discussion: improvements/replacements for s3-find #332

Kirill888 opened this issue Aug 25, 2021 · 3 comments

Comments

@Kirill888
Copy link
Member

Introduction

s3-find is a library function and a cli utility used for listing S3 buckets with some basic "globing" support. It's an important tool used for keeping various databases in sync with S3 buckets and also for data investigations. But there are some serious issues and performance pitfalls.

Problems

Main problem is dependency on aibotocore (can be "fixed" by moving away from async model an into threaded model). There are also some limitations in the way globing works.

Actions

Let's discuss what we want to do about this tool, evaluate alternatives like s5cmd, minio/mc etc.

@Kirill888
Copy link
Member Author

Kirill888 commented Aug 25, 2021

Refs #167 #30 #149 #105

@alexgleith
Copy link
Contributor

This is a pure python implementation: https://github.com/bloomreach/s4cmd built for performance.

I agree that we should try to remove the aibotocore dependency.

@emmaai
Copy link
Contributor

emmaai commented Nov 22, 2021

Ref #167 , it can do //**/ fine, but s3-to-dc better with some more informative messages when it can't deal with certain patterns, rather than general message saying Added 0 datasets and failed 0 datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants