Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic dataset module and specific s3_datasets module - part 2 (Create datasets_base and move enums) #1257

Merged
merged 11 commits into from May 15, 2024

Conversation

dlpzx
Copy link
Contributor

@dlpzx dlpzx commented May 7, 2024

Feature or Bugfix

⚠️ This PR should be merged after #1250.

  • Refactoring

Detail

As explained in the design for #1123 we are trying to implement a generic datasets_base module that can be used by any type of datasets in a generic way.

This PR:

  • Creates the skeleton of the datasets_base module consisting of 3 packages: db, api, services. And adds the __init__ file.
  • Adds the dependency of s3_datasets to datasets_base in the __init__ file of the s3_datasets module
  • Moves datasets_enums to datasets_base

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dlpzx dlpzx marked this pull request as ready for review May 8, 2024 06:37
@dlpzx dlpzx requested a review from SofiaSazonova May 8, 2024 06:37
@dlpzx
Copy link
Contributor Author

dlpzx commented May 8, 2024

Tested in AWS:

  • Loading of modules works as expected (e.g. we can see the Datasets module)
  • All FE views show enums as normal (e.g. DatasetRoles, Topics, classification)
  • Edit dataset and add topics works as expected
  • Edit dataset confidentiality works as expected
  • createEnvironment and createDataset

👀 Because how the loader is configured for CDK and CDK_CLI_EXTENSION load modes it is needed to add the datasets_base explicitly in the config.json file. It is something that was going to happen either way so I added it already.

@dlpzx dlpzx marked this pull request as draft May 8, 2024 07:59
@dlpzx dlpzx marked this pull request as ready for review May 13, 2024 13:55
@dlpzx dlpzx requested a review from noah-paige May 13, 2024 13:57
@@ -9,6 +9,9 @@
"datapipelines": {
"active": true
},
"datasets_base": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we would have dataset_base be automatically activated if any of the child dataset modules (i.e. s3_datasets) is active and not have to expose it here on the config.json

Configuring the dataset_base module as active or not does not really mean anything as it is really the child dataset modules that are meaningful to activate or not

I am not sure how easy it is to do such a thing or if it requires a big shift in the loader logic but if an easy fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your comment that it is required here because of how we load ImportModes of CDK and CDK_CLI_EXTENSION - but can you explain why that is / is there an easy way to resolve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, datasets_base will be added in the config.json at some point (check the full design steps in #1123) because some parameters are relevant for all datasets no only for s3_datasets or redshift_datasets (e.g. confidentiality). That is why I did not give it more thought.

But if you are curious, the issue is a bit hidden, the loader has some methods to check that it initialized the modules correctly. With the current implementation I think that a module cannot have a module interface CDK and not have a CDK_CLI_EXTENSION if it is not a config.json module. I think the problem is not that much on the loader but on the way we are using it in the cdk proxy. I will have a closer look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In earlier versions such as v2.4.0 I think we never exposed datasets_base as an module in config.json and we had the same type of __init__.py function where it was only supported by CDK and not CDK_CLI_EXTENSION

Trying to understand why we could do that then but can not now - playing around a bit on how these modules are imported using the loader class

Either way I tested these changes and it looks good so will approve PR but do some more digging here on the side

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found the issue as to why there is an error when we remove datasets_base from config.json:

  • When we load_modules(modes={ImportMode.CDK_CLI_EXTENSION})
    • We first import all modules in_config as part of def _load_modules(...)
    • This would theoretically include s3_datasets and dataset_sharing at this point in time
    • When we import dataset_sharing we have the following line in top level of __init__.py that subsequently imports s3_datasets and datasets_base modules
from dataall.modules.dataset_sharing.db.share_object_repositories import ShareEnvironmentResource
  • Then when we _check_loading_correct(...) in loader.py
    • it fails on L252 with ImportError since datasets_base is in sys.modules.keys() but is not in checked_modules_names (since datasets_base is not in config / expected_load)

Previously we never exposed datasets_base or datasets_sharing as configurable modules in config.json so we avoided this problem. Moving the line

from dataall.modules.dataset_sharing.db.share_object_repositories import ShareEnvironmentResource

in backend/dataall/modules/dataset_sharing/__init__.py to within the the def __init__(self): in class SharingApiModuleInterface() resolves this issue on loading modules but may need some additional testing to ensure share APIs still working as expected

Copy link
Contributor

@noah-paige noah-paige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some digging on the loader class and exposing datasets_base in config.json but tested this PR and changes look good

@dlpzx
Copy link
Contributor Author

dlpzx commented May 15, 2024

@noah-paige I will implement the import change in another PR, probably with other modules that also have global imports in their init files

@dlpzx dlpzx merged commit e9862cb into main May 15, 2024
9 checks passed
dlpzx added a commit that referenced this pull request May 15, 2024
### Feature or Bugfix
- Bugfix

### Detail
The loader class that loads modules dependending on the module interface
loads all module init files and then uses the interfaces to load
specific parts of the module or to register/perform other operations.
The issue is that in some modules we are importing some parts of the
package at the top level of the `__init__` file, which means that no
matter what the module interface is, those imports will be executed.

Let's see it with an example. I am loading modules using the
`CDK_CLI_EXTENSION` module in the `cdk_cli_wrapper`. The loader loads
all init files of the modules in the config.json, and then uses the
interface to load the pipelines module only. The issue is that in the
init files of the modules it has also imported things like the
`DashboardRepository` or `SagemakerStudioEnvironmentResource` for
example. Pieces of code that are not needed in the `CDK_CLI_EXTENSION`.

This would be a minor issue, some extra pieces of code are loaded, okey.
The bigger problem comes from "child-modules". The loader after loading
everything checks if what is has been loaded has its correct interface
or if it is a config.json module. Modules such as `feed` that are not
directly specified in the config.json could be imported indirectly
through some of this top-level imports in the init files and result in
failure of the loading.

We have 2 options:
1. Make our loader aware of the source of each file it imports
2. Be strict with how we import in the init files

In this PR I went for 2 because our loader is complex enough and is
already name-sensitive. I think adding more logic would make it very
difficult to handle and it might have unexpected results.

### Relates
- #1257 - in this PR we realized about the issue

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx added a commit that referenced this pull request May 17, 2024
…te DatasetBase db model and S3Dataset model) (#1258)

### Feature or Bugfix
⚠️ This PR should be merged after #1257.
- Feature
- Refactoring

### Detail
As explained in the design for #1123 we are trying to implement a
generic `datasets_base` module that can be used by any type of datasets
in a generic way.

**This PR does**:
- Adds a generic `DatasetBase` model in datasets_base.db that is used in
s3_datasets.db to build the `S3Dataset` model using joined table
inheritance in
[sqlalchemy](https://docs.sqlalchemy.org/en/20/orm/inheritance.html)
- Rename all usages of Dataset to S3Dataset (in the future some will be
returned to DatasetBase, but for the moment we will keep them as
S3Dataset)
- Add migration script that backfills `datasets` table and renames
`s3_datasets` ---> ⚠️ In the process of migrating we are doing some
"scary" operations on the dataset table, if for any reason the migration
encounters any issue it could result in catastrophic loss of information
--> for this reason this
[PR](#1267) implements RDS
snapshots on migrations.

**This PR does not**:
- Feed registration stays as:
`FeedRegistry.register(FeedDefinition('Dataset', S3Dataset))` using
`Dataset` with the `S3Dataset` resource type. It is out of the scope of
this PR to migrate the Feed definition.
- Exactly the same for the GlossaryRegistry registration. We keep
`object_type='Dataset'` to avoid backwards compatibility issues.
- It does not change the resourceType for permissions. We keep using a
generic `Dataset` as target for S3 permissions. If we are to split
permissions into DatasetBase permissions and S3Dataset permissions we
would do it on a different PR

#### Remarks
Inserting new items of S3Dataset does not require any changes. SQL
Alchemy joined inheritance automatically inserts data in the parent
table and then another one to the child table as explained in this
stackoverflow
[link](https://stackoverflow.com/questions/39926937/sqlalchemy-how-to-insert-a-joined-table-inherited-class-instance-when-the-pare)
(I was not able to find it in the official docs)


### Relates
- #1123 
- #955 
- #1267

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@dlpzx dlpzx deleted the feat/generic-dataset-model-refactoring-2 branch May 22, 2024 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants