Support different reset modes during import #26

PeterJCLaw · 2023-12-11T17:33:44Z

This introduces support for "reset modes"¹ -- different approaches for ensuring a clean database before importing the fresh data. Initial support is for three modes:

drop-database: the default; drops the database & re-creates it.
drop-tables: drops the tables the Django codebase is aware of, useful if the Django database user doesn't have access to drop the entire database.
none: no attempt to reset the database, useful if the user has already manually configured the database or otherwise wants more control over setup.

Using the second of these in a staging-type environment is the main motivation for this change.

This PR also updates the existing PostgresSequences extra to cope with a sequence of a given name already being present by overwriting it. This doesn't feel like the best solution (the result if the import somehow lists the same sequence twice may be unexpected), but seemed probably the simplest for now given what devdata is typically used for.

Review by commit may be useful, they're in a hopefully sensible order for that.

Testing is achieved by adding reset modes to all existing import tests as well as creating a few more. I've also tested this against the use-case I originally had (importing against an existing database part of a staging environment).

Fixes #23

TODO:

🐞 ensure the django migrations table is cleared when using the drop-tables mode -> tested & fixed in 4227139
validate behaviour of none when importing over the top of an existing matching schema -> test demonstrating result in 84a2b03

This doesn't feel like the best name, however "strategies" is already taken in this project and I didn't want to confuse that. Open to changing this if we can come up with something better! ↩

…port Currently the only mode is the existing behaviour, however this opens the door to other approaches.

This is unlikely to be common, however it provides an escape hatch for advanced users who want to do their own thing.

This is slightly wassteful, but should mean we cheaply get good coverage over all the interesting model relations we come up with.

Sequences don't nicely fit into one of just schema or data, they're somewhat inherently both. Given that Django's "loaddata" over-writes existing rows in tables, it seems reasonable to do something similar for sequences -- even if that means we actually drop the sequence and fully re-create it.

PeterJCLaw · 2023-12-11T18:34:05Z

@lirsacc this may be of interest as I think you've wanted this before too.

danpalmer · 2023-12-11T23:45:40Z

Thanks for this Peter! I've not done a line by line review yet, I need to let this one stew for a few days if that's ok? The problem makes sense, having a solution in this codebase also makes sense, and this looks like a reasonable option.

A few questions/alternatives to discuss....

The use-case of the staging environment makes sense, but I didn't feel the use case was clear from the description of how to use it in the readme. The default is nicely labelled, but I wonder if we can come up with a more compelling decision tree or something where the motivation for all three options is obvious, such that there's no confusion on which is the best option.
Related, what are the downsides of using the per-table approach? Unless ridiculous, performance isn't a concern here I don't think. Do we need a full DB drop option?
What would this look like if it was a property of the strategy – i.e. whether the strategy wants its underlying data reset or not? Strategies may be idempotent or not (I think?). Would it be simpler if they always were or weren't?
Can we unify the reset modes and the sequences behaviour? Alternatively, do they need their own reset modes – drop/rewind?
Should this be a command line flag or a setting, or both?

src/devdata/reset_modes.py

PeterJCLaw · 2023-12-12T00:27:16Z

Thanks for this Peter! I've not done a line by line review yet, I need to let this one stew for a few days if that's ok?

Yup, sure.

The problem makes sense, having a solution in this codebase also makes sense, and this looks like a reasonable option.

A few questions/alternatives to discuss....

Good thoughts 👍, thanks for the prompt review :)

The use-case of the staging environment makes sense, but I didn't feel the use case was clear from the description of how to use it in the readme. The default is nicely labelled, but I wonder if we can come up with a more compelling decision tree or something where the motivation for all three options is obvious, such that there's no confusion on which is the best option.

👍 will have a think.

Related, what are the downsides of using the per-table approach? Unless ridiculous, performance isn't a concern here I don't think.

I think the main one is that it assumes that the existing database is close enough to what the current codebase defines that it can use the current models to empty the database. There are various ways that might not be true (even just being on a different branch might hit it) which could either fail loudly or not clear things out that the user might expect would be removed.
Sequences are the main example which come to mind for the non-failing case, though other tables also feel possible. I don't know what else might be left behind (especially on databases other than Postgres).

One thing which has just occurred to me and I suspect isn't covered by tests here (and is almost certainly a 🐞) -- the django migrations table isn't being cleared out, so that'll need addressing.

Do we need a full DB drop option?

Aside from the limitations of the alternatives, I think there's a couple of reasons to keep the drop-database option:

it makes this a non-breaking change (at least when kept as the default), and
it's potentially handy in cases there may not be an existing database (e.g: getting started in local dev).

Note that neither the drop-tables nor the none options handle the lack of an existing database currently. From what I recall you mentioned at one point that it had been challenging to detect an existing database (I think this was in the context of why there's confirmation around dropping the database even if there isn't one), so I didn't look into handling that here. It may be that this has gotten easier now though, so perhaps there's more which could be done.

What would this look like if it was a property of the strategy – i.e. whether the strategy wants its underlying data reset or not? Strategies may be idempotent or not (I think?). Would it be simpler if they always were or weren't?

Interesting, I hadn't thought about pushing this down to that layer.

This might fit alongside the none or drop-tables approaches (or variations thereof), though obviously not with the drop-database mode. To me this feels like it would end up pushing towards removing drop-database and maybe encouraging more complexity in the strategies to make them idempotent.

Thinking it through, another source of complexity would be handling the case of a single model with several strategies. For me this feels like it'd be getting too complex and having the reset separate is going to be quite a bit simpler. I guess it depends on how advanced you want this tool to get in terms of what it offers around working with an existing database.

Can we unify the reset modes and the sequences behaviour? Alternatively, do they need their own reset modes – drop/rewind?

Perhaps. Feels like this might need a bit more thinking -- currently there's no relationship between these modes and the later processing. I'll admit sequences weren't something I'd thought about until actually building this, so they ended up getting a simple tweak rather than being designed for explicitly.

Should this be a command line flag or a setting, or both?

Hadn't thought about making it a setting. While my use-case is one where I mostly want a different mode in different environments, it's not clear to me that that would always be the case nor that that's necessarily a good motivation for a setting. My immediate thought is that this feels like a thing you want to control when you run the command (or write the wrapper script) rather than something that's a property of the codebase.
Things like django-configurations do allow settings to be properties of the environment but I think I'd (personally) still find I wanted the CLI flag.

danpalmer

Re: pushing down into strategies...

Thinking it through, another source of complexity would be handling the case of a single model with several strategies. For me this feels like it'd be getting too complex and having the reset separate is going to be quite a bit simpler.

Very good point, I think this answers this one entirely for me, thanks!

Re: command line flag...

While my use-case is one where I mostly want a different mode in different environments, it's not clear to me that that would always be the case nor that that's necessarily a good motivation for a setting. My immediate thought is that this feels like a thing you want to control when you run the command (or write the wrapper script) rather than something that's a property of the codebase.

Sounds good. Let's start with a command-line flag. I think you're right that this is a property of the particular development environment and the current aim of the developer at a point in time, rather than a configuration around the runtime environment that might be configured with something like Configurations. We can always add a setting later if there's any desire for it.

Thanks for all your time on this and on replying in detail to the comments! Let's go with the approach in this PR. I've done a line by line review now.

Don't forget the potential bug about the handling of the migrations table, if it is necessary.

README.md

src/devdata/extras.py

src/devdata/management/commands/devdata_import.py

This happens in self.dump_data_for_import, likely I missed this in 78f06a3.

This introduces a util to simplify this and adds usage to a couple of places which had been missed.

Essentially we get a merging of the two sources of data.

Importing, even in the 'none' case, is potentially destructive so we want to give the user a chance to bail.

Also includes docs for the Postgres sequences extra.

PeterJCLaw · 2023-12-20T16:47:48Z

@danpalmer thanks for the review. I think I've addressed everything now -- both the TODOs and the review comments, so would be great if you had time for another pass. Hopefully review by commit of the new changes is clear enough, but happy to answer any further questions you have.

PeterJCLaw · 2023-12-20T16:51:26Z

Ah, just for clarity: I still haven't done any manual testing here and probably won't have time this side of the new year. The updates since the first review do include better testing though so I'm a little more confident in how this will behave now. Happy to wait & do some manual testing though if that's desired.

PeterJCLaw · 2024-01-02T14:22:41Z

@danpalmer hope you've had a good break, do you know when you might have a chance to have another look at this PR?

danpalmer

Thanks so much for this @PeterJCLaw. Go ahead and merge when you're ready!

PeterJCLaw · 2024-01-08T12:52:29Z

src/devdata/reset_modes.py

+        with connection.cursor() as cursor:
+            table_names = connection.introspection.table_names(cursor)
+
+            models = connection.introspection.installed_models(table_names)
+
+            if MigrationRecorder(connection).has_table():
+                models.add(MigrationRecorder.Migration)
+
+            with connection.schema_editor() as editor:
+                for model in models:
+                    editor.delete_model(model)


Hrm, should this be transactional? That way we'd either drop all the tables or none, which is probably better than maybe dropping some?

A failure mode here is if the user has permission for some of the tables but not all.

Given that we're not dropping the database it might even be nice to have the whole import be transactional, but that likely requires deeper changes.

I'm going to leave this as-is for the moment -- following the pattern of the drop-database mode which can also leave things part complete (even though that's perhaps less likely).

PeterJCLaw · 2024-01-08T13:06:39Z

src/devdata/reset_modes.py

+                for model in models:
+                    editor.delete_model(model)


A progress bar here might be nice. And/or some generic feedback that the "reset" portion of the import has completed.

I'm going to leave this as-is. I still think that a progress report here would be good, however making it clear that this is on the reset side feels non-trivial without some more general status reporting during import.

PeterJCLaw · 2024-01-08T16:58:25Z

Have now tested this in the use-case I originally had in mind, it seems to work :)

PeterJCLaw force-pushed the reset-modes branch from f41f27a to 6b5d600 Compare December 11, 2023 17:53

PeterJCLaw added 9 commits December 11, 2023 18:18

Introduce the idea of different modes to clear the database before im…

d79475c

…port Currently the only mode is the existing behaviour, however this opens the door to other approaches.

Allow users to opt-out from resetting the database

765c6f8

This is unlikely to be common, however it provides an escape hatch for advanced users who want to do their own thing.

Validate all reset modes when testing imports

c555467

This is slightly wassteful, but should mean we cheaply get good coverage over all the interesting model relations we come up with.

Make this assertion slightly more debuggable

fb9cec6

Extract util for easier import testing

78f06a3

Avoid duplicating this slug

f06dd84

Support resetting a database via dropping all tables

87566b2

Document the available reset modes and why they exist

b94d9df

PeterJCLaw force-pushed the reset-modes branch from 6b5d600 to b94d9df Compare December 11, 2023 18:18

PeterJCLaw marked this pull request as ready for review December 11, 2023 18:33

PeterJCLaw requested a review from danpalmer December 11, 2023 18:33

danpalmer reviewed Dec 11, 2023

View reviewed changes

src/devdata/reset_modes.py Outdated Show resolved Hide resolved

danpalmer approved these changes Dec 17, 2023

View reviewed changes

README.md Show resolved Hide resolved

src/devdata/extras.py Show resolved Hide resolved

src/devdata/management/commands/devdata_import.py Outdated Show resolved Hide resolved

PeterJCLaw added 4 commits December 20, 2023 15:13

Remove redundant migrations dumping

b8d61fa

This happens in self.dump_data_for_import, likely I missed this in 78f06a3.

Ensure Django's migrations table is consistently present

05da3db

This introduces a util to simplify this and adds usage to a couple of places which had been missed.

Simplify lookup of installed models

e7e3f20

Handle resetting of the django_migrations table too

4227139

PeterJCLaw mentioned this pull request Dec 20, 2023

django_db_blocker doesn't seem to be doing anything #34

Open

PeterJCLaw added 2 commits December 20, 2023 16:27

Demonstrate the result of reset-mode none over existing data

7d50fbd

Essentially we get a merging of the two sources of data.

Add example command lines to show usage

219e810

PeterJCLaw force-pushed the reset-modes branch from beef078 to 219e810 Compare December 20, 2023 16:27

PeterJCLaw added 2 commits December 20, 2023 16:43

Always confirm imports

3ad2e85

Importing, even in the 'none' case, is potentially destructive so we want to give the user a chance to bail.

Improve documentation on how the reset modes behave

729a19f

Also includes docs for the Postgres sequences extra.

PeterJCLaw requested a review from danpalmer December 20, 2023 16:47

danpalmer approved these changes Jan 5, 2024

View reviewed changes

PeterJCLaw commented Jan 8, 2024

View reviewed changes

PeterJCLaw merged commit d4a90d0 into danpalmer:main Jan 8, 2024
17 checks passed

PeterJCLaw deleted the reset-modes branch January 8, 2024 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support different reset modes during import #26

Support different reset modes during import #26

PeterJCLaw commented Dec 11, 2023 •

edited

PeterJCLaw commented Dec 11, 2023

danpalmer commented Dec 11, 2023

PeterJCLaw commented Dec 12, 2023 •

edited

danpalmer left a comment

PeterJCLaw commented Dec 20, 2023

PeterJCLaw commented Dec 20, 2023

PeterJCLaw commented Jan 2, 2024

danpalmer left a comment

PeterJCLaw Jan 8, 2024

PeterJCLaw Jan 8, 2024

PeterJCLaw Jan 8, 2024

PeterJCLaw Jan 8, 2024

PeterJCLaw commented Jan 8, 2024

Support different reset modes during import #26

Support different reset modes during import #26

Conversation

PeterJCLaw commented Dec 11, 2023 • edited

Footnotes

PeterJCLaw commented Dec 11, 2023

danpalmer commented Dec 11, 2023

PeterJCLaw commented Dec 12, 2023 • edited

danpalmer left a comment

Choose a reason for hiding this comment

PeterJCLaw commented Dec 20, 2023

PeterJCLaw commented Dec 20, 2023

PeterJCLaw commented Jan 2, 2024

danpalmer left a comment

Choose a reason for hiding this comment

PeterJCLaw Jan 8, 2024

Choose a reason for hiding this comment

PeterJCLaw Jan 8, 2024

Choose a reason for hiding this comment

PeterJCLaw Jan 8, 2024

Choose a reason for hiding this comment

PeterJCLaw Jan 8, 2024

Choose a reason for hiding this comment

PeterJCLaw commented Jan 8, 2024

PeterJCLaw commented Dec 11, 2023 •

edited

PeterJCLaw commented Dec 12, 2023 •

edited