Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different reset modes during import #26

Merged
merged 17 commits into from Jan 8, 2024

Conversation

PeterJCLaw
Copy link
Collaborator

@PeterJCLaw PeterJCLaw commented Dec 11, 2023

This introduces support for "reset modes"1 -- different approaches for ensuring a clean database before importing the fresh data. Initial support is for three modes:

  • drop-database: the default; drops the database & re-creates it.
  • drop-tables: drops the tables the Django codebase is aware of, useful if the Django database user doesn't have access to drop the entire database.
  • none: no attempt to reset the database, useful if the user has already manually configured the database or otherwise wants more control over setup.

Using the second of these in a staging-type environment is the main motivation for this change.

This PR also updates the existing PostgresSequences extra to cope with a sequence of a given name already being present by overwriting it. This doesn't feel like the best solution (the result if the import somehow lists the same sequence twice may be unexpected), but seemed probably the simplest for now given what devdata is typically used for.

Review by commit may be useful, they're in a hopefully sensible order for that.

Testing is achieved by adding reset modes to all existing import tests as well as creating a few more. I've also tested this against the use-case I originally had (importing against an existing database part of a staging environment).

Fixes #23

TODO:

  • 🐞 ensure the django migrations table is cleared when using the drop-tables mode -> tested & fixed in 4227139
  • validate behaviour of none when importing over the top of an existing matching schema -> test demonstrating result in 84a2b03

Footnotes

  1. This doesn't feel like the best name, however "strategies" is already taken in this project and I didn't want to confuse that. Open to changing this if we can come up with something better!

…port

Currently the only mode is the existing behaviour, however this
opens the door to other approaches.
This is unlikely to be common, however it provides an escape hatch
for advanced users who want to do their own thing.
This is slightly wassteful, but should mean we cheaply get good
coverage over all the interesting model relations we come up with.
Sequences don't nicely fit into one of just schema or data, they're
somewhat inherently both. Given that Django's "loaddata" over-writes
existing rows in tables, it seems reasonable to do something similar
for sequences -- even if that means we actually drop the sequence
and fully re-create it.
@PeterJCLaw
Copy link
Collaborator Author

@lirsacc this may be of interest as I think you've wanted this before too.

@danpalmer
Copy link
Owner

Thanks for this Peter! I've not done a line by line review yet, I need to let this one stew for a few days if that's ok? The problem makes sense, having a solution in this codebase also makes sense, and this looks like a reasonable option.

A few questions/alternatives to discuss....

  • The use-case of the staging environment makes sense, but I didn't feel the use case was clear from the description of how to use it in the readme. The default is nicely labelled, but I wonder if we can come up with a more compelling decision tree or something where the motivation for all three options is obvious, such that there's no confusion on which is the best option.
  • Related, what are the downsides of using the per-table approach? Unless ridiculous, performance isn't a concern here I don't think. Do we need a full DB drop option?
  • What would this look like if it was a property of the strategy – i.e. whether the strategy wants its underlying data reset or not? Strategies may be idempotent or not (I think?). Would it be simpler if they always were or weren't?
  • Can we unify the reset modes and the sequences behaviour? Alternatively, do they need their own reset modes – drop/rewind?
  • Should this be a command line flag or a setting, or both?

@PeterJCLaw
Copy link
Collaborator Author

PeterJCLaw commented Dec 12, 2023

Thanks for this Peter! I've not done a line by line review yet, I need to let this one stew for a few days if that's ok?

Yup, sure.

The problem makes sense, having a solution in this codebase also makes sense, and this looks like a reasonable option.

A few questions/alternatives to discuss....

Good thoughts 👍, thanks for the prompt review :)

  • The use-case of the staging environment makes sense, but I didn't feel the use case was clear from the description of how to use it in the readme. The default is nicely labelled, but I wonder if we can come up with a more compelling decision tree or something where the motivation for all three options is obvious, such that there's no confusion on which is the best option.

👍 will have a think.

  • Related, what are the downsides of using the per-table approach? Unless ridiculous, performance isn't a concern here I don't think.

I think the main one is that it assumes that the existing database is close enough to what the current codebase defines that it can use the current models to empty the database. There are various ways that might not be true (even just being on a different branch might hit it) which could either fail loudly or not clear things out that the user might expect would be removed.
Sequences are the main example which come to mind for the non-failing case, though other tables also feel possible. I don't know what else might be left behind (especially on databases other than Postgres).

One thing which has just occurred to me and I suspect isn't covered by tests here (and is almost certainly a 🐞) -- the django migrations table isn't being cleared out, so that'll need addressing.

Do we need a full DB drop option?

Aside from the limitations of the alternatives, I think there's a couple of reasons to keep the drop-database option:

  • it makes this a non-breaking change (at least when kept as the default), and
  • it's potentially handy in cases there may not be an existing database (e.g: getting started in local dev).

Note that neither the drop-tables nor the none options handle the lack of an existing database currently. From what I recall you mentioned at one point that it had been challenging to detect an existing database (I think this was in the context of why there's confirmation around dropping the database even if there isn't one), so I didn't look into handling that here. It may be that this has gotten easier now though, so perhaps there's more which could be done.

  • What would this look like if it was a property of the strategy – i.e. whether the strategy wants its underlying data reset or not? Strategies may be idempotent or not (I think?). Would it be simpler if they always were or weren't?

Interesting, I hadn't thought about pushing this down to that layer.

This might fit alongside the none or drop-tables approaches (or variations thereof), though obviously not with the drop-database mode. To me this feels like it would end up pushing towards removing drop-database and maybe encouraging more complexity in the strategies to make them idempotent.

Thinking it through, another source of complexity would be handling the case of a single model with several strategies. For me this feels like it'd be getting too complex and having the reset separate is going to be quite a bit simpler. I guess it depends on how advanced you want this tool to get in terms of what it offers around working with an existing database.

  • Can we unify the reset modes and the sequences behaviour? Alternatively, do they need their own reset modes – drop/rewind?

Perhaps. Feels like this might need a bit more thinking -- currently there's no relationship between these modes and the later processing. I'll admit sequences weren't something I'd thought about until actually building this, so they ended up getting a simple tweak rather than being designed for explicitly.

  • Should this be a command line flag or a setting, or both?

Hadn't thought about making it a setting. While my use-case is one where I mostly want a different mode in different environments, it's not clear to me that that would always be the case nor that that's necessarily a good motivation for a setting. My immediate thought is that this feels like a thing you want to control when you run the command (or write the wrapper script) rather than something that's a property of the codebase.
Things like django-configurations do allow settings to be properties of the environment but I think I'd (personally) still find I wanted the CLI flag.

Copy link
Owner

@danpalmer danpalmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: pushing down into strategies...

Thinking it through, another source of complexity would be handling the case of a single model with several strategies. For me this feels like it'd be getting too complex and having the reset separate is going to be quite a bit simpler.

Very good point, I think this answers this one entirely for me, thanks!

Re: command line flag...

While my use-case is one where I mostly want a different mode in different environments, it's not clear to me that that would always be the case nor that that's necessarily a good motivation for a setting. My immediate thought is that this feels like a thing you want to control when you run the command (or write the wrapper script) rather than something that's a property of the codebase.

Sounds good. Let's start with a command-line flag. I think you're right that this is a property of the particular development environment and the current aim of the developer at a point in time, rather than a configuration around the runtime environment that might be configured with something like Configurations. We can always add a setting later if there's any desire for it.


Thanks for all your time on this and on replying in detail to the comments! Let's go with the approach in this PR. I've done a line by line review now.

Don't forget the potential bug about the handling of the migrations table, if it is necessary.

README.md Show resolved Hide resolved
src/devdata/extras.py Show resolved Hide resolved
src/devdata/management/commands/devdata_import.py Outdated Show resolved Hide resolved
This happens in self.dump_data_for_import, likely I missed this in 78f06a3.
This introduces a util to simplify this and adds usage to a couple
of places which had been missed.
Importing, even in the 'none' case, is potentially destructive so
we want to give the user a chance to bail.
Also includes docs for the Postgres sequences extra.
@PeterJCLaw
Copy link
Collaborator Author

@danpalmer thanks for the review. I think I've addressed everything now -- both the TODOs and the review comments, so would be great if you had time for another pass. Hopefully review by commit of the new changes is clear enough, but happy to answer any further questions you have.

@PeterJCLaw
Copy link
Collaborator Author

Ah, just for clarity: I still haven't done any manual testing here and probably won't have time this side of the new year. The updates since the first review do include better testing though so I'm a little more confident in how this will behave now. Happy to wait & do some manual testing though if that's desired.

@PeterJCLaw
Copy link
Collaborator Author

@danpalmer hope you've had a good break, do you know when you might have a chance to have another look at this PR?

Copy link
Owner

@danpalmer danpalmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this @PeterJCLaw. Go ahead and merge when you're ready!

Comment on lines +96 to +106
with connection.cursor() as cursor:
table_names = connection.introspection.table_names(cursor)

models = connection.introspection.installed_models(table_names)

if MigrationRecorder(connection).has_table():
models.add(MigrationRecorder.Migration)

with connection.schema_editor() as editor:
for model in models:
editor.delete_model(model)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, should this be transactional? That way we'd either drop all the tables or none, which is probably better than maybe dropping some?

A failure mode here is if the user has permission for some of the tables but not all.

Given that we're not dropping the database it might even be nice to have the whole import be transactional, but that likely requires deeper changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as-is for the moment -- following the pattern of the drop-database mode which can also leave things part complete (even though that's perhaps less likely).

Comment on lines +105 to +106
for model in models:
editor.delete_model(model)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A progress bar here might be nice. And/or some generic feedback that the "reset" portion of the import has completed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as-is. I still think that a progress report here would be good, however making it clear that this is on the reset side feels non-trivial without some more general status reporting during import.

@PeterJCLaw
Copy link
Collaborator Author

Have now tested this in the use-case I originally had in mind, it seems to work :)

@PeterJCLaw PeterJCLaw merged commit d4a90d0 into danpalmer:main Jan 8, 2024
17 checks passed
@PeterJCLaw PeterJCLaw deleted the reset-modes branch January 8, 2024 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support running from an empty database without dropping/re-creating
2 participants