Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model) #1258

dlpzx · 2024-05-07T12:08:27Z

Feature or Bugfix

⚠️ This PR should be merged after #1257.

Feature
Refactoring

Detail

As explained in the design for #1123 we are trying to implement a generic datasets_base module that can be used by any type of datasets in a generic way.

This PR does:

Adds a generic DatasetBase model in datasets_base.db that is used in s3_datasets.db to build the S3Dataset model using joined table inheritance in sqlalchemy
Rename all usages of Dataset to S3Dataset (in the future some will be returned to DatasetBase, but for the moment we will keep them as S3Dataset)
Add migration script that backfills datasets table and renames s3_datasets ---> ⚠️ In the process of migrating we are doing some "scary" operations on the dataset table, if for any reason the migration encounters any issue it could result in catastrophic loss of information --> for this reason this PR implements RDS snapshots on migrations.

This PR does not:

Feed registration stays as: FeedRegistry.register(FeedDefinition('Dataset', S3Dataset)) using Dataset with the S3Dataset resource type. It is out of the scope of this PR to migrate the Feed definition.
Exactly the same for the GlossaryRegistry registration. We keep object_type='Dataset' to avoid backwards compatibility issues.
It does not change the resourceType for permissions. We keep using a generic Dataset as target for S3 permissions. If we are to split permissions into DatasetBase permissions and S3Dataset permissions we would do it on a different PR

Remarks

Inserting new items of S3Dataset does not require any changes. SQL Alchemy joined inheritance automatically inserts data in the parent table and then another one to the child table as explained in this stackoverflow link (I was not able to find it in the official docs)

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)?
- Is the input sanitized?
- What precautions are you taking before deserializing the data you consume?
- Is injection prevented by parametrizing queries?
- Have you ensured no eval or similar functions are used?
Does this PR introduce any functionality or component that requires authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
- Are you logging failed auth attempts?
Are you using or adding any cryptographic features?
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users?
- Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…oring-1

… creation renamed table s3_dataset

…eric-dataset-model-refactoring-2

…t-model-refactoring-2' into feat/generic-dataset-model-refactoring-3

…eric-dataset-model-refactoring-3 # Conflicts: # backend/dataall/modules/dataset_sharing/services/dataset_sharing_service.py # backend/dataall/modules/s3_datasets/api/dataset/resolvers.py # backend/dataall/modules/s3_datasets/db/dataset_models.py # backend/dataall/modules/s3_datasets/services/dataset_service.py # backend/dataall/modules/s3_datasets/services/dataset_table_service.py

dlpzx · 2024-05-15T15:59:41Z

Testing (before changes from PR review)

locally update and list pre-existing datasets
locally update and create new dataset - check that it creates successfully in database, it indexes in catalog and its permissions are created for the resource type Datasets
locally run all migration scripts from scratch - without data
locally downgrade each of the migration scripts one by one - without data
locally downgrade both migration scrips at once - without data
in a pre-existing AWS deployment merge this branch and check migration is correctly executed
in a pre-existing AWS deployment check previous datasets are listed and can be accessed (checking permissions)
in a pre-existing AWS deployment create a new Dataset

backend/dataall/modules/datasets_base/db/dataset_models.py

...nd/migrations/versions/458572580709_remove_dataset_table_read_permissions_from_env_admins.py

backend/migrations/versions/c6d01930179d_add_backfill_read_folder_permissions.py

backend/migrations/versions/d059eead99c2_rename_dataset_table_as_s3_dataset.py

noah-paige · 2024-05-15T20:31:40Z

Overall left some minor comments - additionally tested out the migrations scripts back-filling a single dataset and all works locally for me as well

Will do one last look through first thing tomorrow once you are finished with you check list of testing as well

backend/dataall/modules/s3_datasets/__init__.py

noah-paige · 2024-05-15T21:41:46Z

backend/migrations/versions/5cdcf6cc1d73_create_and_backfill_dataset_table.py

+    session.commit()
+    session.close()
+
+    # Update non-nullable columns


On downgrade I am getting an error column "label" of relation "s3_dataset" contains null values even after we set the label value in the above for loop

I will re-test with the changes from PR review, see the results below

dlpzx · 2024-05-16T07:16:25Z

Testing after changes from review

locally create datasets with migrations until revision 458572580709, update schema with alembic migration d059eead99c2 -> check datasetType is backfilled, foreign keys are updated, dataset is renamed
downgrade back to 458572580709 -> check that datasetType is deleted and object datasettype enum as well. Check that we have now a foreign key fk_dataset_env_uri and the dataset table is called dataset
Upgrade to head -> check that new dataset dataset table is backfilled, datasettypes enum is used for dataset type
Same for downgrade

backend/migrations/versions/5cdcf6cc1d73_create_and_backfill_dataset_table.py

backend/dataall/modules/s3_datasets/db/dataset_models.py

### Feature or Bugfix - Feature ### Detail Alembic migrations can get complex and in some cases we are using alembic for not only schema migrations but also data migrations. When moving columns with data from one table to another we might accidentally make a mistake in a migration script. We strive to test all migration scripts and avoid bugs in such sensitive operations, but to protect users from the catastrophic situation in which there is a bug, a service issue or any other exceptional situation this PR introduces the creation of manual database snapshots before running alembic migration scripts. This PR modifies the db_migration handler that is triggered with every backendStack update. It checks if there are new migration scripts (if the current head in the database is different from the new head in the code). If True, it will create a cluster snapshot. Remarks: - Snapshots cannot be created when the cluster in not `available`, the PR introduces a check to wait for this condition. If the Lambda timeout is reached waiting for the cluster, then the CICD pipeline will fail and will need to be retried - During the creation of an snapshot we can still run alembic migration scripts - Snapshots are incremental, the first time will take a long time, but new snapshots will be faster ### Relates - #1258 - This PR is a good example of complex data migration operations. ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

noah-paige · 2024-05-17T13:44:12Z

Last last things:

WHen running alembic autogenerate migration I get the following

def upgrade():
    # ### commands auto generated by Alembic - please adjust! ###
    op.alter_column('dataset', 'datasetType',
               existing_type=postgresql.ENUM('S3', name='datasettypes'),
               type_=sa.Enum('S3', name='datasettype'),
               existing_nullable=False)
    op.drop_constraint('s3_dataset_bucket_datasetUri_fkey', 'dataset_bucket', type_='foreignkey')
    op.create_foreign_key(None, 'dataset_bucket', 'dataset', ['datasetUri'], ['datasetUri'], ondelete='CASCADE')
    # ### end Alembic commands ###

Our models and migration scripts may not be in sync?

noah-paige · 2024-05-17T13:47:48Z

Changing datasetUri in dataset_bucket table model at backend/dataall/modules/s3_datasets/db/dataset_models.py to

    datasetUri = Column(String, ForeignKey('s3_dataset.datasetUri', ondelete='CASCADE'), nullable=False)

fixes the foreign key change generated by alembic

dlpzx · 2024-05-17T14:25:43Z

Hello hello, thanks @noah-paige for such a deep review :)

foreign_key in dataset_bucket - I can make the change, looks reasonable
postgresql.ENUM vs sa.Enum - there is a problem with sa.Enum and that is that it forces the creation of the enum object itself. I used postgresql.ENUM because it has the option to create_type=False which uses the existing datasettypes object. otherwise the migration fails

noah-paige

tested and looks good - approving

…te DatasetBaseRepository and move DatasetLock) (#1276) ### Feature or Bugfix ⚠️ merge after #1258 - Refactoring ### Detail As explained in the design for #1123 we are trying to implement a generic `datasets_base` module that can be used by any type of datasets in a generic way. In this small PR: - we move the generic DatasetLock model to datasets_base - move the DatasetLock db operations to databasets_base DatasetBaseRepository - move activity to DatasetBaseRepository ### Relates - #1123 - #955 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dlpzx added 19 commits May 6, 2024 14:33

Rename datasets to s3_datasets

035a637

Rename datasets to s3_datasets

aa99257

Rename datasets to s3_datasets

8ea5ca2

Fix references to config in frontend

2ad4a2e

Merge branch 'refs/heads/main' into feat/generic-dataset-model-refact…

59d4c93

…oring-1

Fix s3_dataset references frontend

17fe992

Added datasets_base module and dependencies

2c333a2

Moved dataset_enums to datasets_base

7d66809

Use S3Dataset instead of Dataset in s3_dataset module

38a2275

Use S3Dataset instead of Dataset in dataset_sharing module

fe71766

Use S3Dataset instead of Dataset in tests+some missing in modules

3985167

Use S3Dataset instead of Dataset in migration scripts and init

f05a4f6

Fix foreign key between datasetBase and s3dataset

aacd1f0

Fix migration references to Dataset and add new migration script with…

41db8ed

… creation renamed table s3_dataset

Added first draft of migration scripts

b3849cd

Merge remote-tracking branch 'refs/remotes/origin/main' into feat/gen…

313dd89

…eric-dataset-model-refactoring-2

Fix details of init files

7e287d1

Finis migration scripts

98660df

Add datasets_base in config.json

c4ec66d

dlpzx mentioned this pull request May 14, 2024

Create generic dataset_base and s3_dataset modules from current datasets #1123

Open

Merge remote-tracking branch 'refs/remotes/origin/feat/generic-datase…

58ea763

…t-model-refactoring-2' into feat/generic-dataset-model-refactoring-3

dlpzx mentioned this pull request May 14, 2024

Create RDS database snapshot before executing alembic migrations #1267

Merged

dlpzx added 2 commits May 15, 2024 10:13

Adapt permission resourceType to DatasetBase

5d472e8

dlpzx marked this pull request as ready for review May 15, 2024 09:17

dlpzx changed the title ~~WIP - Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model)~~ Generic dataset module and specific s3_datasets module - part 3 (Create DatasetBase db model and S3Dataset model) May 15, 2024

dlpzx added 2 commits May 15, 2024 11:29

Adapt permission resourceType to DatasetBase

8a573da

linting

b806100

Fix issues in foreign keys migration scripts

b31922f

dlpzx requested a review from noah-paige May 15, 2024 16:26