Design and implement alternative backend storage mechanism and DatasetType #593

robertbartel · 2024-04-30T19:22:40Z

Design and implement an (or potentially several) alternative other than OBJECT_STORE and MinIO. Also remove the FILESYSTEM DatasetType and related code if this isn't the (or a) direction chosen.

The text was updated successfully, but these errors were encountered:

christophertubbs · 2024-04-30T19:46:12Z

I recommend keeping a DatasetType for FILESYSTEM or having a similar ability to push/pull data to the file system as a fallback implementation, such as for local dev or for testing purposes. I would not suggest that this functionality be available in a non-dev environment, however. This will yield a backend storage mechanism that, while not necessarily desirable, will be "usable" in just about any environment. This guarantees that there will almost always be at least one implementation that is usable in a majority of situations, say if an object store is unavailable or network access is unavailable.

Imagine the difference between the default webserver for Django being... Django in dev but using Gunicorn in a deployed environment or using SQLite in dev but using Postgres when deployed.

aaraney · 2024-04-30T19:55:02Z

Yeah, I am kind of opposed to this proposal and having a FILESYSTEM backing storage type. If the minio python library does not let us connect and use other S3 compliant stores, I think we should swap that for something that gives us more flexibility. In my mind, this is analogous to your Django analogy, @christophertubbs. Sure, we might want to change storage service providers, but I don't think constraining ourselves to S3 is really a constraint. It is the SQL of object store apis.

I would rather mock out a file system backed S3 client than add an additional tower of abstractions to support is seemly limited use case.

robertbartel · 2024-04-30T20:57:03Z

@christophertubbs, the big problem with a FILESYSTEM dataset type - the reason it's not already implemented - is the distributed computation. To really start using it, we would have to implement custom synchronization capabilities across the different nodes, either at the service (much more complicated) or worker level. We could implement some sort of dev-only conditional logic instead, but I don't think that ends up being considerably less work than worker-level-synchronization. And while I don't really care for that idea, it would bring some benefits.

@aaraney, I'd argue that S3 is technically already an alternative, distinct backing storage type. The necessary pieces to support it are going to be very similar to those for OBJECT_STORE, but I'm fairly sure non-identical to the point that it needs to be distinguished. Even if there are ways we could avoid doing that, right now we can only have one manager per DatasetType. We'd lose the ability to to connect to a local object store. Perhaps removing the one-manager-per-type restriction is really the thing that needs to be changed, but regardless this prevents things from being easily snap-into-place and without consequences.

The main things I had in mind here are something using distributed Swarm volumes (probably using the Container Storage Interface) and S3. And I don't necessarily expect this to be done too terribly soon (though we should probably start working to support S3 one way or another in the more near term) as much as I wanted to make sure it being tracked.

robertbartel · 2024-05-31T14:12:43Z

I did some looking into options for CSI-based cluster volumes. This looks promising for the future, but it involves standing up (or otherwise having access to) other services. I think an SMB option may be the simplest to embed within a DMOD deployment, though we'll probably want to eventually have NFS also.

robertbartel added epic A large, high-level task composed of (sub-epic) tasks maas MaaS Workstream labels Apr 30, 2024

robertbartel mentioned this issue May 29, 2024

Investigate IO performance issue with workers and object-store-backed datasets #637

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design and implement alternative backend storage mechanism and DatasetType #593

Design and implement alternative backend storage mechanism and DatasetType #593

robertbartel commented Apr 30, 2024

christophertubbs commented Apr 30, 2024

aaraney commented Apr 30, 2024

robertbartel commented Apr 30, 2024

robertbartel commented May 31, 2024

Design and implement alternative backend storage mechanism and DatasetType #593

Design and implement alternative backend storage mechanism and DatasetType #593

Comments

robertbartel commented Apr 30, 2024

christophertubbs commented Apr 30, 2024

aaraney commented Apr 30, 2024

robertbartel commented Apr 30, 2024

robertbartel commented May 31, 2024