Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] refactor DataAccess to DataSource #1967

Draft
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

amitupreti
Copy link
Contributor

Context:
As the first step to solve #1927, we want to refactor the current code to centralize all the project data access mechanism though DataSource. The goal of this PR is to refactor the current project to use DataSource without messing the access logic.

List of expected changes

  1. Create DataSource model

  2. Refactor code to use DataSource and control all access mechanism through DataSource
    2.1 Direct Access(Files stored on server)
    2.2 Research Environment Access(All other access disabled, users can only create the Research Environment)
    2.3 GCP, AWS(WIP)

  3. Create migration script to migrate the access location from dataaccess to datasource for all existing projects for (Pending)
    a. direct access
    b. RE
    c. aws, gcp

@amitupreti
Copy link
Contributor Author

amitupreti commented Apr 10, 2023

@kshalot opened for early feedback.

Currently, i did a rough refactor to control access to project files though direct access(files on server) and research environment(gcs). I wanted to see if i am on the right track or totally off-course :D

Please check commitwise serially otherwise it might be confusing to review

@amitupreti amitupreti force-pushed the au/feature/refactor_data_access branch from a7a7f42 to 4f8686c Compare April 10, 2023 23:35
Copy link
Member

@kshalot kshalot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments

Comment on lines 65 to 67
projects_with_research_environment = DataSource.objects.filter(
access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT,
project=self).values_list('project_id', flat=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the way we want to go - let's not make any assumptions regarding which data source is expected. Right now we still operate with the assumption that being authorized to view a project's files means that you can access all of its data sources. Therefore, the accessible_by and is_authorized methods should simply return the list of accessible projects (or True in case of is_authorized). We can then use the DataSource relation to only choose the relevant datasets.

So research environments would get the list of all accessible projects using a generic method and simply filter out the irrelevant ones. We could also create more methods, e.g.

def can_access_directly(): ...
def can_create_research_environment(): ...

This will be even more apparent if we decide to refactor the authorization to, for example, the access grant model (see #1927 (comment)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which case is this covering? Why do we need to check whether a source exists to check whether a project is accessible by a user? This just makes this method less generic, since it only makes sense in case of HDN.

I think my previous point still stands - we should only check access mechanisms here (event, DUA, credentialing etc.) and not sources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the can_create_research_environment part of the code, this looks like a leftover from some old logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i agree, this method should be generic and we should be creating separate methods which use the accessible_by and is_authorized to check access via specific data access mechanism.

I see your point, the accessible_by method should only give us project that the user has access to. This method should not care what access mechanisms are set for this project. Also, appreciate your question, it is super helpful to think about the desig, i will remember to ask myself while working on other functions.

physionet-django/physionet/settings/base.py Outdated Show resolved Hide resolved
physionet-django/project/modelcomponents/access.py Outdated Show resolved Hide resolved
physionet-django/project/projectfiles/gcs.py Outdated Show resolved Hide resolved
physionet-django/project/modelcomponents/access.py Outdated Show resolved Hide resolved
@amitupreti
Copy link
Contributor Author

A few initial comments

Thank you @kshalot , Super appreciate the feedback. Let me get on it right away

Copy link
Member

@kshalot kshalot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitupreti I left a few comments. The base looks good, but what I think would be very helpful here is if we would see how this new DataSource would play into the existing codebase, i.e. trying to use this model to control whether a project's files are visible etc (instead of the way it is done, which uses the allow_file_downloads boolean.

My comments might be a bit all over the place - if you encounter any trouble planning the next steps here then let me know. We can jump on a call and figure this out together. Whatever works for you!

Comment on lines +702 to +704
if not settings.ENABLE_CLOUD_RESEARCH_ENVIRONMENTS:
self.fields['access_mechanism'].choices = [
choice for choice in self.fields['access_mechanism'].choices if choice[0] != 'research-environment']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we want to care about this being enabled or not... It makes sense and it would be my first instinct to do so, but right now it seems that research environments are singled out as something that can be present or not.

The HDN deployment is not integrated with any other data source other than local/research environment, so in that case it makes no sense showing anything other than that. Research environments are the only ones that have an explicit setting to disable them, but still.

I guess it doesn't hurt to have this, it just got me thinking!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see what you mean. It makes sense about for HDN like deployment, we might just want to have the local/research environment and disable other stuff.

From what i understand, we might want to allow the deployments to have any combination of access mechanisms.
so what if we have a configurable env like
ACCESS_MECHANISMS = 'research-environment,direct, awss3'
and use that.

or maybe since this is something only the Project Editor or console admin can do, we might be okay with showing them all the option and trust them to use the right ones.

Comment on lines 65 to 67
projects_with_research_environment = DataSource.objects.filter(
access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT,
project=self).values_list('project_id', flat=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which case is this covering? Why do we need to check whether a source exists to check whether a project is accessible by a user? This just makes this method less generic, since it only makes sense in case of HDN.

I think my previous point still stands - we should only check access mechanisms here (event, DUA, credentialing etc.) and not sources.


project = models.ForeignKey('project.PublishedProject',
related_name='data_sources', db_index=True, on_delete=models.CASCADE)
files_available = models.BooleanField(default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to go back to this to give it some more thought - the files_available boolean is dubious now, seeing it in practice. It's a bit confusing what it's supposed to achieve unless you were part of the previous conversations 🤔

I think it's going to be more clear once we see it actually used in practice, i.e. replace the current DataAccess with DataSource. Originally the idea of this boolean was to denote the source that gives us access to the file browser. I think at this point we could not care about this and assume only the direct source gives us direct file download access. We could easily go back to this and make it more generic in the future.

The point is that in theory the GCS source could also give us access to files, since we are using buckets in HDN to store the projects, with the exact same UX as PhysioNet has for direct upload. But right now the mechanisms are tightly coupled with their specific use cases, so maybe we could just not care about it for now. This would mean we wouldn't initially need the files_available bool.

physionet-django/physionet/settings/base.py Outdated Show resolved Hide resolved
physionet-django/project/projectfiles/gcs.py Outdated Show resolved Hide resolved
physionet-django/project/projectfiles/local.py Outdated Show resolved Hide resolved
Comment on lines 65 to 67
projects_with_research_environment = DataSource.objects.filter(
access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT,
project=self).values_list('project_id', flat=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the can_create_research_environment part of the code, this looks like a leftover from some old logic

@amitupreti
Copy link
Contributor Author

@amitupreti I left a few comments. The base looks good, but what I think would be very helpful here is if we would see how this new DataSource would play into the existing codebase, i.e. trying to use this model to control whether a project's files are visible etc (instead of the way it is done, which uses the allow_file_downloads boolean.

My comments might be a bit all over the place - if you encounter any trouble planning the next steps here then let me know. We can jump on a call and figure this out together. Whatever works for you!

Thank you. This is super helpful

@amitupreti amitupreti force-pushed the au/feature/refactor_data_access branch from 4f8686c to 17c3791 Compare April 21, 2023 20:23
@amitupreti
Copy link
Contributor Author

Hi @kshalot , i abstracted the creation of datasource on project publish to a separate class, which is called after the project is published.

Regarding the files_available, i let it be and it is currently used with direct access(files stored and served from the server). So with direct access and files_available =True, the assumption is that user can view the files on browser and also download it with wget, which i think works only for direct access/local . Reason for letting it be is because as you said, later we might want to let users see/access files from gcs bucket through physionet.

Copy link
Member

@kshalot kshalot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitupreti I left a few comments. I think we are generally going in the right direction.

Correct me if I'm wrong though - this is still not integrated with the project page, right? I.e. it does not control whether the file tab is shown, whether the S3 message is shown etc?

I think you should at least integrate the direct access with the file viewer so we can see it in action. This is the most important part, because the file viewer is not part of the current DataAccess model (the one which you are refactoring into DataSource.

Controls all access to project data.
"""
class DataLocation(models.TextChoices):
DIRECT = 'DI', 'Direct'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any advantage to using abbreviations like DI or AS3 over DIRECT, AWS_S3, AWS_OPEN_DATA etc? I guess it will save a teeny-tine amount of space but makes things a bit incomprehensible when looking at it from the data's perspective.

Comment on lines +203 to +232
if self.data_location == self.DataLocation.GOOGLE_BIGQUERY:
if self.access_mechanism != self.AccessMechanism.GOOGLE_GROUP_EMAIL:
raise ValidationError('Google BigQuery data sources must use the Google Group Email access mechanism.')
if not self.email:
raise ValidationError('Google BigQuery data sources must have an email address.')
elif self.data_location == self.DataLocation.GOOGLE_CLOUD_STORAGE:
if self.access_mechanism != self.AccessMechanism.GOOGLE_GROUP_EMAIL:
raise ValidationError('Google Cloud Storage data sources must use the Google Group Email access '
'mechanism.')
if not self.uri:
raise ValidationError('Google Cloud Storage data sources must have an uri address.')
elif self.data_location == self.DataLocation.AWS_OPEN_DATA:
if self.access_mechanism != self.AccessMechanism.S3:
raise ValidationError('AWS Open Data data sources must use the S3 access mechanism.')
if not self.uri:
raise ValidationError('AWS Open Data data sources must have a URI.')
elif self.data_location == self.DataLocation.AWS_S3:
if self.access_mechanism != self.AccessMechanism.S3:
raise ValidationError('AWS S3 data sources must use the S3 access mechanism.')
if not self.uri:
raise ValidationError('AWS S3 data sources must have a URI.')
elif self.data_location == self.DataLocation.DIRECT:
if self.email:
raise ValidationError('Direct data sources must not have an email address.')
if self.uri:
raise ValidationError('Direct data sources must not have a URI.')
else:
raise ValidationError('Invalid data location.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very hard to follow. You could approach it differently - define what exactly is required for each data source and have a single piece of logic that validates it.

required_fields = {
  self.DataLocation.DIRECT: [],
  self.DataLocation.AWS_S3: ["uri"],
  ...
}

forbidden_fields = {
  self.DataLocation.DIRECT: ["uri", "email"],
  ...
}

for required_field in required_fields[self.data_location]:
  if not getattr(self, required_field): # Or something like that
    raise ValidationError("...")

for forbidden_field in forbidden_fields[self.data_location]:
  if getattr(self, forbidden_field): # Or something like that
    raise ValidationError("...")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for data access:

data_location_access_mechanisms = {
  self.DataLocation.DIRECT: None
  self.DataLocation.AWS_S3: self.AccessMechanism.AWS_S3,
  ...
}

Comment on lines 685 to 690
def __init__(self, **kwargs):
self.data_location = kwargs.get('data_location', None)
self.files_available = kwargs.get('files_available', None)
self.email = kwargs.get('email', None)
self.uri = kwargs.get('uri', None)
self.access_mechanism = kwargs.get('access_mechanism', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why did you opt for keyword arguments here? Why not:

Suggested change
def __init__(self, **kwargs):
self.data_location = kwargs.get('data_location', None)
self.files_available = kwargs.get('files_available', None)
self.email = kwargs.get('email', None)
self.uri = kwargs.get('uri', None)
self.access_mechanism = kwargs.get('access_mechanism', None)
def __init__(self, data_location=None, files_available=None, email=None, uri=None, access_mechanism=None):
self.data_location = data_location
self.files_available = files_available
self.email = email
self.uri = uri
self.access_mechanism = access_mechanism

uri=self.uri,
)

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know self is not needed here, but this is called on an instance everywhere (see your uses of this method - you first instantiate the class). Maybe @staticmethod is not necessary here.

files_available=True,
data_location=DataSource.DataLocation.DIRECT,
)
elif (settings.DEFAULT_PROJECT_ACCESS_MECHANISM == DataSource.DataLocation.RESEARCH_ENVIRONMENT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too many conditions IMO, it should be extracted somewhere. It's also not clear why the condition is there in the first place. I.e. why do we need all 3 settings to be set to a certain value.

If the settings are that coupled, then maybe we should opt for a different setup. Have one environment variable, e.g. RESEARCH_ENVIRONMENT_DEPLOYMENT that will set all three (DEFAULT_PROJECT_ACCESS_MECHANISM, DEFAULT_PROJECT_DATA_LOCATION and STORAGE_TYPE) if set to True and if set to False it will set the two from the condition above.

@amitupreti
Copy link
Contributor Author

Correct me if I'm wrong though - this is still not integrated with the project page, right? I.e. it does not control whether the file tab is shown, whether the S3 message is shown etc?

I think i had added that part of integration. i.e using datasource to control access in project on Direct Download refactor commits. Let me check your comments and make things clear and write how everything comes to play in details.

Its kind of chaotic right now. I will work on it and update the code.

Thanks again

@amitupreti amitupreti marked this pull request as draft April 27, 2023 14:01
Here we are creating DataSource model which will work exactly like dataaccess but will let us manage all project related access through this model. This will make it easier in future to manage access as everything will be centralized though here.

Similarly as discussed here MIT-LCP#1927, we can use this model to work with `AccessGrant`
Context: The goal here is to refactor to let access controlled through DataAccess without changing the existing logic.
And the existing logic is if the project has research environment enabled, the user should not be able to view/download files(they can only create research environment and play with the data there)

In this commit, we divided the access into 2 section(first for RE and the rest). The outcome should be same as before which is if RE is enabled show the `published_project_denied_downloads.html` else show normal file access stuff
replaces allow_file_downloads with can_view_files(this controls if a user has download access to the project by using the DataSource).

Here the idea is that all the direct download stuff will be inside
```
{% if can_view_files %}
```

and the other access methods like gcp, aws will be outside this block.

So the logic will be
1. If the project is RE project, show the RE warning message
2. Else
2.1 If the project has other access outside RE and direct download show that.
2.2 If the project has direct download access, then show direct download stuff

To make it easier to understand the changes, i am breaking down the commits to very smaller chunks

Note that 2.1 and 2.2 dont have to be either or(as currently a project can be directly accessible and also be accessible via gcs,aws etc)
The p and h5 text should appear for both direct download and the other gcp,aws access mechanism. So brining them outside the can_view_file if block
Move the other access method(gcp) outside of can_view_files if block.

The logic before this commit was, if a project doesnot have access policy but has project.gcp, show the gcp stuff.

On this commit, we have the same logic.
clean unnecessary/empty else statement
move data_access stuff outside the if can_view_files block.
Current logic was if a user can download the project files, check the data_access and show the access method details for the ones that exist. So it should be safe to bring this block outside the if can_view_files block.

Note that, currently for gcp, aws, we are still using the old data_access methods. We will refactor those in future commit(the current series of commit is to refactor the direct download to use data access while maintaining the same logic)
When a project is published, we should always create a datasource
object based on the project storage type because with  the
DataSource  all access will go though it, otherwise the files wont
be accessible.

This commit creates the DataSource type automatically
when project is published.

The type of DataSource created depends on the settings.
For Physionet, the default data storage is direct,
and for HDN its GCP(defined by STORAGE_TYPE) and access mechanism is
Research Environment.

For auto creation of DataSource, i added 2 new environment variable
to control location and access_mechanism of DataSource.

Note: the DataSourceCreator doesn't belong
on the access.py, i had to keep it here
for a short time until we refactor
the authorization in a separate app
to avoid circular import
Looks like we cannot import app models
in settings because during code initialization
the settings runs before the apps are loaded
which throws error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants