[WIP] refactor DataAccess to DataSource #1967

amitupreti · 2023-04-10T23:32:59Z

Context:
As the first step to solve #1927, we want to refactor the current code to centralize all the project data access mechanism though DataSource. The goal of this PR is to refactor the current project to use DataSource without messing the access logic.

List of expected changes

Create DataSource model
Refactor code to use DataSource and control all access mechanism through DataSource
2.1 Direct Access(Files stored on server)
2.2 Research Environment Access(All other access disabled, users can only create the Research Environment)
2.3 GCP, AWS(WIP)
Create migration script to migrate the access location from dataaccess to datasource for all existing projects for (Pending)
a. direct access
b. RE
c. aws, gcp

amitupreti · 2023-04-10T23:34:46Z

@kshalot opened for early feedback.

Currently, i did a rough refactor to control access to project files though direct access(files on server) and research environment(gcs). I wanted to see if i am on the right track or totally off-course :D

Please check commitwise serially otherwise it might be confusing to review

kshalot

A few initial comments

kshalot · 2023-04-11T09:39:50Z

physionet-django/project/managers/publishedproject.py

+        projects_with_research_environment = DataSource.objects.filter(
+            access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT,
+            project=self).values_list('project_id', flat=True)


I don't think this is the way we want to go - let's not make any assumptions regarding which data source is expected. Right now we still operate with the assumption that being authorized to view a project's files means that you can access all of its data sources. Therefore, the accessible_by and is_authorized methods should simply return the list of accessible projects (or True in case of is_authorized). We can then use the DataSource relation to only choose the relevant datasets.

So research environments would get the list of all accessible projects using a generic method and simply filter out the irrelevant ones. We could also create more methods, e.g.

def can_access_directly(): ... def can_create_research_environment(): ...

This will be even more apparent if we decide to refactor the authorization to, for example, the access grant model (see #1927 (comment)).

Which case is this covering? Why do we need to check whether a source exists to check whether a project is accessible by a user? This just makes this method less generic, since it only makes sense in case of HDN.

I think my previous point still stands - we should only check access mechanisms here (event, DUA, credentialing etc.) and not sources.

Looking at the can_create_research_environment part of the code, this looks like a leftover from some old logic

Yes, i agree, this method should be generic and we should be creating separate methods which use the accessible_by and is_authorized to check access via specific data access mechanism.

I see your point, the accessible_by method should only give us project that the user has access to. This method should not care what access mechanisms are set for this project. Also, appreciate your question, it is super helpful to think about the desig, i will remember to ask myself while working on other functions.

physionet-django/physionet/settings/base.py

physionet-django/project/modelcomponents/access.py

physionet-django/project/projectfiles/gcs.py

physionet-django/project/modelcomponents/access.py

amitupreti · 2023-04-11T13:03:00Z

A few initial comments

Thank you @kshalot , Super appreciate the feedback. Let me get on it right away

kshalot

@amitupreti I left a few comments. The base looks good, but what I think would be very helpful here is if we would see how this new DataSource would play into the existing codebase, i.e. trying to use this model to control whether a project's files are visible etc (instead of the way it is done, which uses the allow_file_downloads boolean.

My comments might be a bit all over the place - if you encounter any trouble planning the next steps here then let me know. We can jump on a call and figure this out together. Whatever works for you!

kshalot · 2023-04-18T15:51:41Z

physionet-django/console/forms.py

+        if not settings.ENABLE_CLOUD_RESEARCH_ENVIRONMENTS:
+            self.fields['access_mechanism'].choices = [
+                choice for choice in self.fields['access_mechanism'].choices if choice[0] != 'research-environment']


I'm wondering whether we want to care about this being enabled or not... It makes sense and it would be my first instinct to do so, but right now it seems that research environments are singled out as something that can be present or not.

The HDN deployment is not integrated with any other data source other than local/research environment, so in that case it makes no sense showing anything other than that. Research environments are the only ones that have an explicit setting to disable them, but still.

I guess it doesn't hurt to have this, it just got me thinking!

I can see what you mean. It makes sense about for HDN like deployment, we might just want to have the local/research environment and disable other stuff.

From what i understand, we might want to allow the deployments to have any combination of access mechanisms.
so what if we have a configurable env like
ACCESS_MECHANISMS = 'research-environment,direct, awss3'
and use that.

or maybe since this is something only the Project Editor or console admin can do, we might be okay with showing them all the option and trust them to use the right ones.

kshalot · 2023-04-18T15:54:51Z

physionet-django/project/managers/publishedproject.py

+        projects_with_research_environment = DataSource.objects.filter(
+            access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT,
+            project=self).values_list('project_id', flat=True)


Which case is this covering? Why do we need to check whether a source exists to check whether a project is accessible by a user? This just makes this method less generic, since it only makes sense in case of HDN.

I think my previous point still stands - we should only check access mechanisms here (event, DUA, credentialing etc.) and not sources.

kshalot · 2023-04-18T15:58:56Z

physionet-django/project/modelcomponents/access.py

+
+    project = models.ForeignKey('project.PublishedProject',
+                                related_name='data_sources', db_index=True, on_delete=models.CASCADE)
+    files_available = models.BooleanField(default=False)


I'm going to go back to this to give it some more thought - the files_available boolean is dubious now, seeing it in practice. It's a bit confusing what it's supposed to achieve unless you were part of the previous conversations 🤔

I think it's going to be more clear once we see it actually used in practice, i.e. replace the current DataAccess with DataSource. Originally the idea of this boolean was to denote the source that gives us access to the file browser. I think at this point we could not care about this and assume only the direct source gives us direct file download access. We could easily go back to this and make it more generic in the future.

The point is that in theory the GCS source could also give us access to files, since we are using buckets in HDN to store the projects, with the exact same UX as PhysioNet has for direct upload. But right now the mechanisms are tightly coupled with their specific use cases, so maybe we could just not care about it for now. This would mean we wouldn't initially need the files_available bool.

physionet-django/physionet/settings/base.py

physionet-django/project/projectfiles/gcs.py

physionet-django/project/projectfiles/local.py

kshalot · 2023-04-18T16:05:13Z

physionet-django/project/managers/publishedproject.py

+        projects_with_research_environment = DataSource.objects.filter(
+            access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT,
+            project=self).values_list('project_id', flat=True)


Looking at the can_create_research_environment part of the code, this looks like a leftover from some old logic

physionet-django/project/modelcomponents/publishedproject.py

amitupreti · 2023-04-18T16:36:40Z

@amitupreti I left a few comments. The base looks good, but what I think would be very helpful here is if we would see how this new DataSource would play into the existing codebase, i.e. trying to use this model to control whether a project's files are visible etc (instead of the way it is done, which uses the allow_file_downloads boolean.

My comments might be a bit all over the place - if you encounter any trouble planning the next steps here then let me know. We can jump on a call and figure this out together. Whatever works for you!

Thank you. This is super helpful

amitupreti · 2023-04-21T20:49:44Z

Hi @kshalot , i abstracted the creation of datasource on project publish to a separate class, which is called after the project is published.

Regarding the files_available, i let it be and it is currently used with direct access(files stored and served from the server). So with direct access and files_available =True, the assumption is that user can view the files on browser and also download it with wget, which i think works only for direct access/local . Reason for letting it be is because as you said, later we might want to let users see/access files from gcs bucket through physionet.

kshalot

@amitupreti I left a few comments. I think we are generally going in the right direction.

Correct me if I'm wrong though - this is still not integrated with the project page, right? I.e. it does not control whether the file tab is shown, whether the S3 message is shown etc?

I think you should at least integrate the direct access with the file viewer so we can see it in action. This is the most important part, because the file viewer is not part of the current DataAccess model (the one which you are refactoring into DataSource.

kshalot · 2023-04-27T13:30:23Z

physionet-django/project/modelcomponents/access.py

+    Controls all access to project data.
+    """
+    class DataLocation(models.TextChoices):
+        DIRECT = 'DI', 'Direct'


Is there any advantage to using abbreviations like DI or AS3 over DIRECT, AWS_S3, AWS_OPEN_DATA etc? I guess it will save a teeny-tine amount of space but makes things a bit incomprehensible when looking at it from the data's perspective.

kshalot · 2023-04-27T13:45:10Z

physionet-django/project/modelcomponents/access.py

+        if self.data_location == self.DataLocation.GOOGLE_BIGQUERY:
+            if self.access_mechanism != self.AccessMechanism.GOOGLE_GROUP_EMAIL:
+                raise ValidationError('Google BigQuery data sources must use the Google Group Email access mechanism.')
+            if not self.email:
+                raise ValidationError('Google BigQuery data sources must have an email address.')
+        elif self.data_location == self.DataLocation.GOOGLE_CLOUD_STORAGE:
+            if self.access_mechanism != self.AccessMechanism.GOOGLE_GROUP_EMAIL:
+                raise ValidationError('Google Cloud Storage data sources must use the Google Group Email access '
+                                      'mechanism.')
+            if not self.uri:
+                raise ValidationError('Google Cloud Storage data sources must have an uri address.')
+        elif self.data_location == self.DataLocation.AWS_OPEN_DATA:
+            if self.access_mechanism != self.AccessMechanism.S3:
+                raise ValidationError('AWS Open Data data sources must use the S3 access mechanism.')
+            if not self.uri:
+                raise ValidationError('AWS Open Data data sources must have a URI.')
+        elif self.data_location == self.DataLocation.AWS_S3:
+            if self.access_mechanism != self.AccessMechanism.S3:
+                raise ValidationError('AWS S3 data sources must use the S3 access mechanism.')
+            if not self.uri:
+                raise ValidationError('AWS S3 data sources must have a URI.')
+        elif self.data_location == self.DataLocation.DIRECT:
+            if self.email:
+                raise ValidationError('Direct data sources must not have an email address.')
+            if self.uri:
+                raise ValidationError('Direct data sources must not have a URI.')
+        else:
+            raise ValidationError('Invalid data location.')


This is very hard to follow. You could approach it differently - define what exactly is required for each data source and have a single piece of logic that validates it.

required_fields = { self.DataLocation.DIRECT: [], self.DataLocation.AWS_S3: ["uri"], ... } forbidden_fields = { self.DataLocation.DIRECT: ["uri", "email"], ... } for required_field in required_fields[self.data_location]: if not getattr(self, required_field): # Or something like that raise ValidationError("...") for forbidden_field in forbidden_fields[self.data_location]: if getattr(self, forbidden_field): # Or something like that raise ValidationError("...")

Same goes for data access:

data_location_access_mechanisms = { self.DataLocation.DIRECT: None self.DataLocation.AWS_S3: self.AccessMechanism.AWS_S3, ... }

kshalot · 2023-04-27T13:49:56Z

physionet-django/project/utility.py

+    def __init__(self, **kwargs):
+        self.data_location = kwargs.get('data_location', None)
+        self.files_available = kwargs.get('files_available', None)
+        self.email = kwargs.get('email', None)
+        self.uri = kwargs.get('uri', None)
+        self.access_mechanism = kwargs.get('access_mechanism', None)


Hmm, why did you opt for keyword arguments here? Why not:

Suggested change

def __init__(self, **kwargs):

self.data_location = kwargs.get('data_location', None)

self.files_available = kwargs.get('files_available', None)

self.email = kwargs.get('email', None)

self.uri = kwargs.get('uri', None)

self.access_mechanism = kwargs.get('access_mechanism', None)

def __init__(self, data_location=None, files_available=None, email=None, uri=None, access_mechanism=None):

self.data_location = data_location

self.files_available = files_available

self.email = email

self.uri = uri

self.access_mechanism = access_mechanism

kshalot · 2023-04-27T13:51:47Z

physionet-django/project/utility.py

+            uri=self.uri,
+        )
+
+    @staticmethod


I know self is not needed here, but this is called on an instance everywhere (see your uses of this method - you first instantiate the class). Maybe @staticmethod is not necessary here.

kshalot · 2023-04-27T13:54:33Z

physionet-django/project/utility.py

+                files_available=True,
+                data_location=DataSource.DataLocation.DIRECT,
+            )
+        elif (settings.DEFAULT_PROJECT_ACCESS_MECHANISM == DataSource.DataLocation.RESEARCH_ENVIRONMENT


That's too many conditions IMO, it should be extracted somewhere. It's also not clear why the condition is there in the first place. I.e. why do we need all 3 settings to be set to a certain value.

If the settings are that coupled, then maybe we should opt for a different setup. Have one environment variable, e.g. RESEARCH_ENVIRONMENT_DEPLOYMENT that will set all three (DEFAULT_PROJECT_ACCESS_MECHANISM, DEFAULT_PROJECT_DATA_LOCATION and STORAGE_TYPE) if set to True and if set to False it will set the two from the condition above.

amitupreti · 2023-04-27T14:01:06Z

Correct me if I'm wrong though - this is still not integrated with the project page, right? I.e. it does not control whether the file tab is shown, whether the S3 message is shown etc?

I think i had added that part of integration. i.e using datasource to control access in project on Direct Download refactor commits. Let me check your comments and make things clear and write how everything comes to play in details.

Its kind of chaotic right now. I will work on it and update the code.

Thanks again

Here we are creating DataSource model which will work exactly like dataaccess but will let us manage all project related access through this model. This will make it easier in future to manage access as everything will be centralized though here. Similarly as discussed here MIT-LCP#1927, we can use this model to work with `AccessGrant`

Context: The goal here is to refactor to let access controlled through DataAccess without changing the existing logic. And the existing logic is if the project has research environment enabled, the user should not be able to view/download files(they can only create research environment and play with the data there) In this commit, we divided the access into 2 section(first for RE and the rest). The outcome should be same as before which is if RE is enabled show the `published_project_denied_downloads.html` else show normal file access stuff

replaces allow_file_downloads with can_view_files(this controls if a user has download access to the project by using the DataSource). Here the idea is that all the direct download stuff will be inside ``` {% if can_view_files %} ``` and the other access methods like gcp, aws will be outside this block. So the logic will be 1. If the project is RE project, show the RE warning message 2. Else 2.1 If the project has other access outside RE and direct download show that. 2.2 If the project has direct download access, then show direct download stuff To make it easier to understand the changes, i am breaking down the commits to very smaller chunks Note that 2.1 and 2.2 dont have to be either or(as currently a project can be directly accessible and also be accessible via gcs,aws etc)

The p and h5 text should appear for both direct download and the other gcp,aws access mechanism. So brining them outside the can_view_file if block

Move the other access method(gcp) outside of can_view_files if block. The logic before this commit was, if a project doesnot have access policy but has project.gcp, show the gcp stuff. On this commit, we have the same logic.

clean unnecessary/empty else statement

move data_access stuff outside the if can_view_files block. Current logic was if a user can download the project files, check the data_access and show the access method details for the ones that exist. So it should be safe to bring this block outside the if can_view_files block. Note that, currently for gcp, aws, we are still using the old data_access methods. We will refactor those in future commit(the current series of commit is to refactor the direct download to use data access while maintaining the same logic)

When a project is published, we should always create a datasource object based on the project storage type because with the DataSource all access will go though it, otherwise the files wont be accessible. This commit creates the DataSource type automatically when project is published. The type of DataSource created depends on the settings. For Physionet, the default data storage is direct, and for HDN its GCP(defined by STORAGE_TYPE) and access mechanism is Research Environment. For auto creation of DataSource, i added 2 new environment variable to control location and access_mechanism of DataSource. Note: the DataSourceCreator doesn't belong on the access.py, i had to keep it here for a short time until we refactor the authorization in a separate app to avoid circular import

Looks like we cannot import app models in settings because during code initialization the settings runs before the apps are loaded which throws error.

amitupreti force-pushed the au/feature/refactor_data_access branch from a7a7f42 to 4f8686c Compare April 10, 2023 23:35

kshalot requested changes Apr 11, 2023

View reviewed changes

kshalot requested changes Apr 18, 2023

View reviewed changes

amitupreti force-pushed the au/feature/refactor_data_access branch from 4f8686c to 17c3791 Compare April 21, 2023 20:23

kshalot requested changes Apr 27, 2023

View reviewed changes

amitupreti marked this pull request as draft April 27, 2023 14:01

amitupreti added 12 commits May 10, 2023 12:43

interface to create data_source for project

58f65d1

indent for readability

86dcdcf

Direct Download refactor: step 3

5837f74

The p and h5 text should appear for both direct download and the other gcp,aws access mechanism. So brining them outside the can_view_file if block

Direct Download refactor: step 4

d312b8c

Move the other access method(gcp) outside of can_view_files if block. The logic before this commit was, if a project doesnot have access policy but has project.gcp, show the gcp stuff. On this commit, we have the same logic.

Direct Download refactor: step 5

6b686cb

clean unnecessary/empty else statement

add migration files

d987295

bug fix

3a3752a

Looks like we cannot import app models in settings because during code initialization the settings runs before the apps are loaded which throws error.

amitupreti force-pushed the au/feature/refactor_data_access branch from b655aca to 3a3752a Compare May 10, 2023 19:00

amitupreti mentioned this pull request May 17, 2023

refactor data access part 1 models validators [Please donot merge] #2007

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] refactor DataAccess to DataSource #1967

[WIP] refactor DataAccess to DataSource #1967

amitupreti commented Apr 10, 2023

amitupreti commented Apr 10, 2023 •

edited

kshalot left a comment

kshalot Apr 11, 2023

kshalot Apr 18, 2023

kshalot Apr 18, 2023

amitupreti Apr 21, 2023

amitupreti commented Apr 11, 2023

kshalot left a comment

kshalot Apr 18, 2023

amitupreti Apr 21, 2023

kshalot Apr 18, 2023

kshalot Apr 18, 2023

kshalot Apr 18, 2023

amitupreti commented Apr 18, 2023

amitupreti commented Apr 21, 2023

kshalot left a comment

kshalot Apr 27, 2023

kshalot Apr 27, 2023

kshalot Apr 27, 2023

kshalot Apr 27, 2023

kshalot Apr 27, 2023

kshalot Apr 27, 2023

amitupreti commented Apr 27, 2023

[WIP] refactor DataAccess to DataSource #1967

Are you sure you want to change the base?

[WIP] refactor DataAccess to DataSource #1967

Conversation

amitupreti commented Apr 10, 2023

amitupreti commented Apr 10, 2023 • edited

kshalot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitupreti commented Apr 11, 2023

kshalot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitupreti commented Apr 18, 2023

amitupreti commented Apr 21, 2023

kshalot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitupreti commented Apr 27, 2023

amitupreti commented Apr 10, 2023 •

edited