Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor data access part 1 models validators [Please donot merge] #2007

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

amitupreti
Copy link
Contributor

Context:
We are breaking the PR #1967 into smaller PR(easy to review and work on). This branch is expected to merge on the 1967, not dev.

This PR introduces the DataAccess model and validators

Quick Summary about the model, DataSource model should be used to decide where the files are
stored(determined by data_location) for project and how they can be accessed(determined by access_mechanism).
A single project can have multiple DataSource.

About the fields
files_available - determines if the files can be viewed/downloaded for the given type of datasource.(@kshalot had notes about this field here #1967 (comment))

email - For GCP group access, this would store the email of the group.

uri - The URI for the data on the external service. For s3 this would be of the form s3://<bucket_name>, for gsutil this would be of the form gs://<bucket_name>

Quick Summary about validators

The validation is based on four aspects: required fields, forbidden fields, required access mechanisms, and forbidden access mechanisms.

  1. Required Fields: For each data location (such as Google BigQuery, Google Cloud Storage, AWS Open Data, and AWS S3), certain fields must be present. For instance, Google BigQuery requires an 'email', while Google Cloud Storage, AWS Open Data, and AWS S3 require a 'uri'. If a required field is missing, a validation error is raised.

  2. Forbidden Fields: Conversely, for certain data locations, some fields must not be present. For example, for 'Direct' data location, 'uri' and 'email' fields should not be present. If they are found, a validation error is raised.

  3. Required Access Mechanisms: Each data location may also require one of several specified access mechanisms. For instance, Google BigQuery and Google Cloud Storage can require either a 'Google Group Email' or a 'Research Environment' access mechanism, while AWS Open Data and AWS S3 require an 'S3' access mechanism. If none of the acceptable access mechanisms are found, a validation error is raised.

  4. Forbidden Access Mechanisms: Finally, some data locations forbid certain access mechanisms. Specifically, the 'Direct' data location forbids the 'Google Group Email', 'S3', and 'Research Environment' access mechanisms. If any of these are present, a validation error is raised.

Quick Note about the interface
This is so that we can quickly test if the validators work. and create datasources.

 Discussed in [Issue MIT-LCP#1987](MIT-LCP#1927 (comment)),

 Quick Summary, DataSource model will decide where the files are
 stored(determined by `data_location`) for project and how they can
 be accessed(determined by `access_mechanism`). A single project
 can have multiple DataSource.

 About the fields
 `files_available` - determines if the files can be
 viewed/downloaded for the given type of datasource.

 `email` - For GCP group access, this would store the email of the group.

`uri` - The URI for the data on the external service.
For s3 this would be of the form s3://<bucket_name>,
for gsutil this would be of the form gs://<bucket_name>
Data Location and AccessMechanism are tightly coupled.

For example, as of now Research Environment is only implemented for
GOOGLE_CLOUD_STORAGE, or Direct data access is only available for
the resources which are directly stored on server.

The validator will first check if the appropriate fields are provided
depending on the datalocation type and also check if the expected
access mechanism is used for the given datalocation.
Note: This doesnot upload the data to the location.
Currently it is expected that the upload will be done separately
before adding the datasource.
@amitupreti amitupreti requested a review from kshalot May 17, 2023 20:30
@amitupreti amitupreti marked this pull request as draft May 17, 2023 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant