Generic dataset module and specific s3_datasets module - part 2 (Create datasets_base and move enums) #1257

dlpzx · 2024-05-07T10:01:34Z

Feature or Bugfix

⚠️ This PR should be merged after #1250.

Refactoring

Detail

As explained in the design for #1123 we are trying to implement a generic datasets_base module that can be used by any type of datasets in a generic way.

This PR:

Creates the skeleton of the datasets_base module consisting of 3 packages: db, api, services. And adds the __init__ file.
Adds the dependency of s3_datasets to datasets_base in the __init__ file of the s3_datasets module
Moves datasets_enums to datasets_base

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
- Is the input sanitized?
- What precautions are you taking before deserializing the data you consume?
- Is injection prevented by parametrizing queries?
- Have you ensured no eval or similar functions are used?
Does this PR introduce any functionality or component that requires authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
- Are you logging failed auth attempts?
Are you using or adding any cryptographic features?
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users?
- Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…oring-1

…eric-dataset-model-refactoring-2

dlpzx · 2024-05-08T07:31:36Z

Tested in AWS:

Loading of modules works as expected (e.g. we can see the Datasets module)
All FE views show enums as normal (e.g. DatasetRoles, Topics, classification)
Edit dataset and add topics works as expected
Edit dataset confidentiality works as expected
createEnvironment and createDataset

👀 Because how the loader is configured for CDK and CDK_CLI_EXTENSION load modes it is needed to add the datasets_base explicitly in the config.json file. It is something that was going to happen either way so I added it already.

backend/dataall/modules/datasets_base/__init__.py

noah-paige · 2024-05-13T19:07:17Z

config.json

@@ -9,6 +9,9 @@
        "datapipelines": {
            "active": true
        },
+        "datasets_base": {


I think ideally we would have dataset_base be automatically activated if any of the child dataset modules (i.e. s3_datasets) is active and not have to expose it here on the config.json

Configuring the dataset_base module as active or not does not really mean anything as it is really the child dataset modules that are meaningful to activate or not

I am not sure how easy it is to do such a thing or if it requires a big shift in the loader logic but if an easy fix

I saw your comment that it is required here because of how we load ImportModes of CDK and CDK_CLI_EXTENSION - but can you explain why that is / is there an easy way to resolve?

First of all, datasets_base will be added in the config.json at some point (check the full design steps in #1123) because some parameters are relevant for all datasets no only for s3_datasets or redshift_datasets (e.g. confidentiality). That is why I did not give it more thought.

But if you are curious, the issue is a bit hidden, the loader has some methods to check that it initialized the modules correctly. With the current implementation I think that a module cannot have a module interface CDK and not have a CDK_CLI_EXTENSION if it is not a config.json module. I think the problem is not that much on the loader but on the way we are using it in the cdk proxy. I will have a closer look

In earlier versions such as v2.4.0 I think we never exposed datasets_base as an module in config.json and we had the same type of __init__.py function where it was only supported by CDK and not CDK_CLI_EXTENSION

Trying to understand why we could do that then but can not now - playing around a bit on how these modules are imported using the loader class

Either way I tested these changes and it looks good so will approve PR but do some more digging here on the side

I think I found the issue as to why there is an error when we remove datasets_base from config.json:

When we load_modules(modes={ImportMode.CDK_CLI_EXTENSION})

We first import all modules in_config as part of def _load_modules(...)

This would theoretically include s3_datasets and dataset_sharing at this point in time

When we import dataset_sharing we have the following line in top level of __init__.py that subsequently imports s3_datasets and datasets_base modules

from dataall.modules.dataset_sharing.db.share_object_repositories import ShareEnvironmentResource

Then when we _check_loading_correct(...) in loader.py

it fails on L252 with ImportError since datasets_base is in sys.modules.keys() but is not in checked_modules_names (since datasets_base is not in config / expected_load)

Previously we never exposed datasets_base or datasets_sharing as configurable modules in config.json so we avoided this problem. Moving the line

from dataall.modules.dataset_sharing.db.share_object_repositories import ShareEnvironmentResource

in backend/dataall/modules/dataset_sharing/__init__.py to within the the def __init__(self): in class SharingApiModuleInterface() resolves this issue on loading modules but may need some additional testing to ensure share APIs still working as expected

noah-paige

Doing some digging on the loader class and exposing datasets_base in config.json but tested this PR and changes look good

dlpzx · 2024-05-15T06:44:22Z

@noah-paige I will implement the import change in another PR, probably with other modules that also have global imports in their init files

### Feature or Bugfix - Bugfix ### Detail The loader class that loads modules dependending on the module interface loads all module init files and then uses the interfaces to load specific parts of the module or to register/perform other operations. The issue is that in some modules we are importing some parts of the package at the top level of the `__init__` file, which means that no matter what the module interface is, those imports will be executed. Let's see it with an example. I am loading modules using the `CDK_CLI_EXTENSION` module in the `cdk_cli_wrapper`. The loader loads all init files of the modules in the config.json, and then uses the interface to load the pipelines module only. The issue is that in the init files of the modules it has also imported things like the `DashboardRepository` or `SagemakerStudioEnvironmentResource` for example. Pieces of code that are not needed in the `CDK_CLI_EXTENSION`. This would be a minor issue, some extra pieces of code are loaded, okey. The bigger problem comes from "child-modules". The loader after loading everything checks if what is has been loaded has its correct interface or if it is a config.json module. Modules such as `feed` that are not directly specified in the config.json could be imported indirectly through some of this top-level imports in the init files and result in failure of the loading. We have 2 options: 1. Make our loader aware of the source of each file it imports 2. Be strict with how we import in the init files In this PR I went for 2 because our loader is complex enough and is already name-sensitive. I think adding more logic would make it very difficult to handle and it might have unexpected results. ### Relates - #1257 - in this PR we realized about the issue ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…te DatasetBase db model and S3Dataset model) (#1258) ### Feature or Bugfix ⚠️ This PR should be merged after #1257. - Feature - Refactoring ### Detail As explained in the design for #1123 we are trying to implement a generic `datasets_base` module that can be used by any type of datasets in a generic way. **This PR does**: - Adds a generic `DatasetBase` model in datasets_base.db that is used in s3_datasets.db to build the `S3Dataset` model using joined table inheritance in [sqlalchemy](https://docs.sqlalchemy.org/en/20/orm/inheritance.html) - Rename all usages of Dataset to S3Dataset (in the future some will be returned to DatasetBase, but for the moment we will keep them as S3Dataset) - Add migration script that backfills `datasets` table and renames `s3_datasets` ---> ⚠️ In the process of migrating we are doing some "scary" operations on the dataset table, if for any reason the migration encounters any issue it could result in catastrophic loss of information --> for this reason this [PR](#1267) implements RDS snapshots on migrations. **This PR does not**: - Feed registration stays as: `FeedRegistry.register(FeedDefinition('Dataset', S3Dataset))` using `Dataset` with the `S3Dataset` resource type. It is out of the scope of this PR to migrate the Feed definition. - Exactly the same for the GlossaryRegistry registration. We keep `object_type='Dataset'` to avoid backwards compatibility issues. - It does not change the resourceType for permissions. We keep using a generic `Dataset` as target for S3 permissions. If we are to split permissions into DatasetBase permissions and S3Dataset permissions we would do it on a different PR #### Remarks Inserting new items of S3Dataset does not require any changes. SQL Alchemy joined inheritance automatically inserts data in the parent table and then another one to the child table as explained in this stackoverflow [link](https://stackoverflow.com/questions/39926937/sqlalchemy-how-to-insert-a-joined-table-inherited-class-instance-when-the-pare) (I was not able to find it in the official docs) ### Relates - #1123 - #955 - #1267 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dlpzx added 8 commits May 6, 2024 14:33

Rename datasets to s3_datasets

035a637

Rename datasets to s3_datasets

aa99257

Rename datasets to s3_datasets

8ea5ca2

Fix references to config in frontend

2ad4a2e

Merge branch 'refs/heads/main' into feat/generic-dataset-model-refact…

59d4c93

…oring-1

Fix s3_dataset references frontend

17fe992

Added datasets_base module and dependencies

2c333a2

Moved dataset_enums to datasets_base

7d66809

dlpzx force-pushed the feat/generic-dataset-model-refactoring-2 branch from 86f1431 to 7d66809 Compare May 7, 2024 11:16

This was referenced May 7, 2024

Create generic dataset_base and s3_dataset modules from current datasets #1123

Open

Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model) #1258

Merged

Merge remote-tracking branch 'refs/remotes/origin/main' into feat/gen…

313dd89

…eric-dataset-model-refactoring-2

dlpzx marked this pull request as ready for review May 8, 2024 06:37

dlpzx requested a review from SofiaSazonova May 8, 2024 06:37

Fix details of init files

7e287d1

dlpzx marked this pull request as draft May 8, 2024 07:59

Add datasets_base in config.json

c4ec66d

dlpzx marked this pull request as ready for review May 13, 2024 13:55

dlpzx requested a review from noah-paige May 13, 2024 13:57

noah-paige reviewed May 13, 2024

View reviewed changes

backend/dataall/modules/datasets_base/__init__.py Show resolved Hide resolved

noah-paige reviewed May 13, 2024

View reviewed changes

noah-paige approved these changes May 14, 2024

View reviewed changes

dlpzx merged commit e9862cb into main May 15, 2024
9 checks passed

dlpzx mentioned this pull request May 15, 2024

Remove global imports in modules #1270

Merged

dlpzx deleted the feat/generic-dataset-model-refactoring-2 branch May 22, 2024 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic dataset module and specific s3_datasets module - part 2 (Create datasets_base and move enums) #1257

Generic dataset module and specific s3_datasets module - part 2 (Create datasets_base and move enums) #1257

dlpzx commented May 7, 2024

dlpzx commented May 8, 2024 •

edited

noah-paige May 13, 2024

noah-paige May 13, 2024

dlpzx May 14, 2024

noah-paige May 14, 2024

noah-paige May 14, 2024

noah-paige left a comment

dlpzx commented May 15, 2024

Generic dataset module and specific s3_datasets module - part 2 (Create datasets_base and move enums) #1257

Generic dataset module and specific s3_datasets module - part 2 (Create datasets_base and move enums) #1257

Conversation

dlpzx commented May 7, 2024

Feature or Bugfix

Detail

Relates

Security

dlpzx commented May 8, 2024 • edited

noah-paige May 13, 2024

Choose a reason for hiding this comment

noah-paige May 13, 2024

Choose a reason for hiding this comment

dlpzx May 14, 2024

Choose a reason for hiding this comment

noah-paige May 14, 2024

Choose a reason for hiding this comment

noah-paige May 14, 2024

Choose a reason for hiding this comment

noah-paige left a comment

Choose a reason for hiding this comment

dlpzx commented May 15, 2024

dlpzx commented May 8, 2024 •

edited