Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more Store implementations #9

Open
Hirevo opened this issue Oct 13, 2019 · 6 comments
Open

Add more Store implementations #9

Hirevo opened this issue Oct 13, 2019 · 6 comments
Labels
C-enhancement Category: Enhancement good first issue Good for newcomers help wanted Extra attention is needed P-medium Priority: Medium

Comments

@Hirevo
Copy link
Owner

Hirevo commented Oct 13, 2019

This is a tracking issue for additional Store implementations.

Here is a non-exhaustive list of potential candidates for an Store implementation:

  • S3Storage:
    This would store crate tarballs and rendered README pages into Amazon's S3.
  • RemoteStorage:
    Requires to implement a companion server.
    The idea is to have a reserved folder on another machine which runs a companion server, you would then point Alexandrie to that server and it would take care of storing crate tarballs and README pages on that machine on behalf of the registry.
    The point of this is to have an easy solution to outsource the crates' storage for people having multiple machines available to them.
@Hirevo Hirevo added C-enhancement Category: Enhancement good first issue Good for newcomers help wanted Extra attention is needed P-medium Priority: Medium labels Oct 13, 2019
@Hirevo Hirevo added this to the Core feature-set milestone Oct 13, 2019
@Hirevo Hirevo pinned this issue Oct 13, 2019
@Hirevo Hirevo changed the title Add more Store implementors Add more Store implementations Oct 13, 2019
@danieleades
Copy link
Contributor

i think using artifactory for binary storage would be useful for a lot of enterprise users.

@jgallagher-cs
Copy link
Contributor

Hi, and thanks for this project! I'm evaluating it for internal use at my company, and I'd like to add support for S3 storage. I've started by trying to use rusoto_s3. Implementing the "get" methods is no problem, but implementing the "store" methods seems to be pretty painful:

  • The put_object method wants a StreamingBody as its payload.
  • Constructing a StreamingBody (which is a type alias for rusoto_core::ByteStream) appears to need a Stream<Item = Result<bytes::Bytes, std::io::Error>> + Send + Sync + 'static, which already seems difficult to satisfy with the current store type bound of data: T where T: io::Read, both because of the inherent pain of converting blocking -> streaming and the missing extra Send + Sync + 'static type bounds. But on top of that...
  • Apparently S3 puts actually require you to know the size up front. So not only do we have all the type bound restrictions + sync -> async, we also have to know the full size ahead of time.

It seems like the only way to satisfy that is to read the full data into a Vec<u8>, which can trivially be converted into a ByteStream. I found another S3 library that has a method that takes a blocking std::io::Reader for puts, and that's exactly what it does.

Looking higher up the call stack, the store methods are called from alexandrie/src/crates/publish.rs, and it looks like they have already read the full data into memory. So passing a T: Read down to the store methods which would read them into a new vec would just be making a copy. That doesn't seem terrible, but doesn't seem ideal either. It looks like with a small amount of work, the caller could pass an owned Vec<u8> down to the store methods, which would avoid the double copy.

Do you have any interest in changing the store methods to take a Vec<u8> instead of a T: Read? Or would you rather keep T: Read but need to make a copy when storing to S3? (Or if you see another avenue out of this, that'd be great too!)

@Hirevo
Copy link
Owner Author

Hirevo commented Jul 17, 2020

@jgallagher-cs Hello, and thank you for the interest and willingness to help !
It indeed would be really cool to have the ability to leverage S3 as a storage option.

Reading your comment made me realize that the Storage abstraction is trying to do more than what is actually needed from it.
You're completely right that we have no use for streaming bytes into the storage without having buffered the entire crate already.
I (most-likely) made the the mistake of adding this trait method as a way-too-early guess that it could be useful one day, but, as you've shown, even if it was something that is needed/desired, the current implementation is not the right one.

I am in favor of changing the API as you suggested, by simply taking a Vec<u8>.
I think it models better what the registry really does and needs.

@jgallagher-cs
Copy link
Contributor

👍 Great, I'll make that change.

Do you want S3 support gated behind a Cargo feature? It ends up pulling in quite a few dependencies, so I would assume "yes", but features don't really play nicely with workspaces (e.g., rust-lang/cargo#7507). The docs for building with S3 support would be something like

cd alexandrie
cargo build --release --features s3

:(

@Hirevo
Copy link
Owner Author

Hirevo commented Jul 23, 2020

I would tend to be in favor of making it be behind a feature flag.
Since the user already has to choose a database implementation using a feature flag, I would say that it is a price that we already payed.
Also, I think that handling features in workspaces is going to improve in usability in the future, according to this issue.

jgallagher-cs added a commit to jgallagher-cs/alexandrie that referenced this issue Jul 24, 2020
This is primarily to simplify the S3 implementation (which would
otherwise have to read the data into a local Vec, effectively making an
in-memory copy since we already had the data in memory anyway). See
discussion at
Hirevo#9 (comment).
jgallagher-cs added a commit to jgallagher-cs/alexandrie that referenced this issue Jul 24, 2020
This is primarily to simplify the S3 implementation (which would
otherwise have to read the data into a local Vec, effectively making an
in-memory copy since we already had the data in memory anyway). See
discussion at
Hirevo#9 (comment).
jgallagher-cs added a commit to jgallagher-cs/alexandrie that referenced this issue Jul 30, 2020
This is primarily to simplify the S3 implementation (which would
otherwise have to read the data into a local Vec, effectively making an
in-memory copy since we already had the data in memory anyway). See
discussion at
Hirevo#9 (comment).
HugoPuntos pushed a commit to HugoPuntos/alexandrie that referenced this issue Sep 30, 2020
This is primarily to simplify the S3 implementation (which would
otherwise have to read the data into a local Vec, effectively making an
in-memory copy since we already had the data in memory anyway). See
discussion at
Hirevo/alexandrie#9 (comment).
HugoPuntos pushed a commit to HugoPuntos/alexandrie that referenced this issue Sep 30, 2020
This is primarily to simplify the S3 implementation (which would
otherwise have to read the data into a local Vec, effectively making an
in-memory copy since we already had the data in memory anyway). See
discussion at
Hirevo/alexandrie#9 (comment).
@Xuanwo
Copy link

Xuanwo commented Sep 18, 2023

Hi, I'm the committer of OpenDAL: A data access layer that allows users to easily and efficiently retrieve data from various storage services in a unified way.

Do you think it's a good idea to integrate opendal as a Store implementations so that users can use s3, azblob, gcs, fs, hdfs as the storage backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: Enhancement good first issue Good for newcomers help wanted Extra attention is needed P-medium Priority: Medium
Projects
None yet
Development

No branches or pull requests

4 participants