Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add migrate endpoint to move artifacts to another storage backend #4298

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gerrod3
Copy link
Contributor

@gerrod3 gerrod3 commented Aug 17, 2023

fixes: #3358

@gerrod3 gerrod3 force-pushed the migrate-backend branch 4 times, most recently from 6c09731 to 49669c1 Compare August 18, 2023 03:09
kwargs["pulp_domain"] = request.pulp_domain.name
kwargs["pulp_domain"] = task.pulp_domain.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a semantic change, right? Should we maybe always try to use get_domain instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request.domain == get_domain(). This change is for when setting the task in a separate domain from the calling domain. In this case when calling migrate it is called (can be called) from the default domain since domains are a part of the default domain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But should you not call migrate from within the affected domain?
The fact that domains are in the default domain is a bit confusing i guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a question I haven't figured out yet and that is where to put the new endpoint. Currently it is on the domains endpoint, specifically on an instance of a domain (/pulp/default/api/v3/domains/domain_href/migrate/), but it could equally just be a new standalone endpoint that figures out what to do based on the domain it ran in. (pulp/foo-domain/api/v3/migrate/). Not sure what approach is the best, any suggestions?

Copy link
Member

@mdellweg mdellweg Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self referential stuff is always weird...
would POST /pulp/foo-domain/api/v3/domain/migrate/ work? Where /pulp/foo-domain/api/v3/domain/ is a self inspecting domain detail viewset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would work as well. Would you like for me to try it out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that.

@action(detail=True, methods=["patch"])
def migrate(self, request, **kwargs):
"""
Migrate the domain's storage backend to a new one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make the migrating of data and the updating of storage credentials more atomic?

Here's the scenario I worry about:

  1. user wants to migrate and starts a /migration/ request
  2. user still has content-change operations submitted, e.g. a cron-dispatched sync which queues and waits while content is migrating
  3. the content migration finished and the subseuent content-change task begins to run
  4. because the credentials aren't updated that new content is saved in the old backend once again

Before thinking over ideas on how to fix this, am I understanding this PR right that this could happen? If so, what do you all think about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only on the default domain can your scenario happen, which is why I explicitly instruct users to update their config file to the new backend and restart Pulp after the migration (hopefully users of domains won't be migrating the default domain very often). The migration task locks the domain so no other task can run, and for all other non-default domains the task will also update that domain's storage settings to the new backend. Therefore any queued up task will run with the new backend after the migration.

Note, the default domain is special since it loads its values for storage from the config file which with dynaconf means they could be stored anywhere, so it was impractical to update these values automatically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize they update the configs after migration. My concern is a not an issue. Regarding the default domain I don't think we can do any better either. Thanks for explaining this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we let the default domain configure it's storage backend in the database too then?
Or what about this: We disallow to use this endpoint on the default domain, and make this migration for the default domain a management command that checks if services are shut down? And on startup, all services check that the storage settings are matching the db?

Copy link
Member

@ipanova ipanova Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since default domain is special and essentially requires to undergo a manual switch-over, my recommendation would be to document the necessary steps (two?) and add a warning note that will make it clear that until the necessary steps are not performed all content, including new one will be saved/served from the old backend.
And that another migrate task might be needed in case these steps were omitted to bring over the newly created diff of content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdellweg you have a valid point

Copy link
Contributor Author

@gerrod3 gerrod3 Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've tried to reach a compromise between having the default domain update immediately upon migration, but still allow for its settings to be determined by the config file. Here is the new logic:

  • All domains have their storage instance built using their field storage_settings. For the default domain this field is empty and thus will use the default values from settings.
  • Upon migration the domain's storage_settings is updated to the new values. This way all domains, including the default domain, will now properly use the new storage backend after the migration.
  • If migrating the default domain, a message is put into the migration task to remind the user to update their config with the new storage settings before the next DB migration
  • Upon pulpcore-manager migrate, Pulp in a post-migrate hook will update the default domain's saved fields to match the values (storage_class, redirect_to_object_storage, hide_distributions) to the config file. storage_settings will be reset back to empty to ensure that the default storage is again built from settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for keeping you busy with this one. But i think we should try to find a bullet proof way.
If i understand your compromise correctly, the admin will run the migration, whereby the default domain will change it's config. Now they miss reading the message and upon running pulpcore-manager migrate the system will be borked again. Is this correct?

What if we follow an approach more similar to the allowed checksums setting?

  1. You adjust the config.
  2. You restart the services (they may warn you about the mismatch).
  3. You kick off the migrate task for the default domain without any parameters (they come from the settings).
    3.1 If the settings fail to validate, it will error out in the view.

Copy link
Contributor Author

@gerrod3 gerrod3 Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more of what you mean by 3 and 3.1? I am not following your logic.

EDIT

Now they miss reading the message and upon running pulpcore-manager migrate the system will be borked again. Is this correct?

Yes the system will be bricked until they update the config and run pulpcore-manager migrate. I've added a check into settings.py for mismatched settings and logs a warning to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the system will be bricked until they update the config and run pulpcore-manager migrate. I've added a check into settings.py for mismatched settings and logs a warning to the user.

But by the time this message is shown (and may even be missed by the admin), Pulp is reconfigured to use this old (brobably already outdated) storage, and with the first new artifact there, you end up with two in consistent storages. That is the situation i try to avoid.

Line of thinking: How much less special can we make the default domain? What if the default domain, like any other, just always carries it's storage definition? So when you want to change that, you modify the settings, restart your services and tell pulp to rectify the situation. That will be your migrate task, but it should take it's parameters from the config now, instead of the request body (Or we make it fail when they differ...). They get validated before the task is dispatched (that was 3.1). Not at all times, the data base will be the truth, and the settings are the desired state the database is supposed to match eventually.

pulpcore/app/tasks/migrate.py Outdated Show resolved Hide resolved
@action(detail=True, methods=["patch"])
def migrate(self, request, **kwargs):
"""
Migrate the domain's storage backend to a new one.
Copy link
Member

@ipanova ipanova Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since default domain is special and essentially requires to undergo a manual switch-over, my recommendation would be to document the necessary steps (two?) and add a warning note that will make it clear that until the necessary steps are not performed all content, including new one will be saved/served from the old backend.
And that another migrate task might be needed in case these steps were omitted to bring over the newly created diff of content.

domain.storage_class = data["storage_class"]
domain.storage_settings = data["storage_settings"]
domain.save(update_fields=["storage_class", "storage_settings"])
pb.increment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How errors are handled is one of my biggest concerns with this feature. If i'm reading this right:

  • If an error occurs right away (i.e. new storage settings are wrong), the error gets thrown and nothing is updated
  • if an error occurs half way through the migration (or lets say it gets cancelled.) Some of the artifacts would be migrated to the new storage but some would not, and the domain would not be updated to the storage correct? And then a user would need to migrate again to finish migrating all the rest of the items, or else specify a new 3rd configuration, where by everything would be migrated to the third storage.

Does that sound Accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's correct for both scenarios. I'll add a check for if the artifact is missing on the current backend, and tell the user to run the repair command or delete the artifact if so.

date = now()
continue
break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to the files stored in the old storage? Are they just left behind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I don't attempt to delete them. The user can decide what to do with their old backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we document that?
Also I think we eventually need to provide tools for this. I'd be ok to leave that for later...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's documented in the endpoint's method comment, which appears in the auto-gen docs. I don't think there is a need for a tool to delete files from the storage. We already entrust users to bring their own storage, so I think they can manage their files on it, but I agree if we do decide we need a tool it would be best to add it later.

artifacts = Artifact.objects.filter(pulp_domain=domain)
date = now()

with ProgressReport(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now here's a new question coming to my mind. Are there settings that can change without it being a new bucket? Thinking about key-rotation here...
Is there any way we can identify "this is the same bucket, do not move the artifacts"? Or even "with the old settings you cannot reach the bucket anymore"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to think about this some more, but I do believe there is a way. I check to see if the artifact is present in the new location, but depending on the number of artifacts this could still be slow and doing unnecessary work.

@gerrod3 gerrod3 marked this pull request as draft September 18, 2023 19:29
Copy link

stale bot commented Dec 17, 2023

This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution!

@stale stale bot added the stale label Dec 17, 2023
@gerrod3 gerrod3 removed the stale label Jan 4, 2024
Copy link

stale bot commented Apr 9, 2024

This pull request has been marked 'stale' due to lack of recent activity. If there is no further activity, the PR will be closed in another 30 days. Thank you for your contribution!

@stale stale bot added the stale label Apr 9, 2024
@bmbouter
Copy link
Member

I met with @jlsherrill yesterday and he brought this up as a feature their usage will need this year. I'm hoping we can release it upstream and test it some there before putting it into use by @jlsherrill 's team

@gerrod3 and @goosemania could you give some thought to when this could be continued? I'm not sure what else you have going on so I'm just starting the convo here.

Copy link

stale bot commented Apr 24, 2024

This pull request is no longer marked for closure.

@stale stale bot removed the stale label Apr 24, 2024
@gerrod3 gerrod3 force-pushed the migrate-backend branch 2 times, most recently from d2347ad to 651737e Compare April 24, 2024 14:17
@gerrod3
Copy link
Contributor Author

gerrod3 commented Apr 24, 2024

Ok, I've updated the PR to make the default domain less special. The storage settings for the default domain will now also be saved in the database. This will be accomplished through a data migration (should be ZDU safe). When migrating the default domain the endpoint will use the values from the config file so that settings will still be the final source of truth for the default domain. This will allow users to change the values in their settings file and then simply run the migrate endpoint to move their artifacts to the new backend, all without restarting Pulp.

I'm still unsure of the best way to inform users if there is a disparity between settings and the default domain. Right now I have a check in settings.py file, but it doesn't feel like the right place.

@gerrod3 gerrod3 force-pushed the migrate-backend branch 3 times, most recently from d96ec2e to aa42810 Compare May 14, 2024 14:12
@gerrod3 gerrod3 marked this pull request as ready for review May 14, 2024 15:09
Copy link
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review...
Did not look at the tests yet.

def default_domain_settings(apps, schema_editor):
Domain = apps.get_model("core", "Domain")
default_domain = Domain.objects.get(name="default")
config_settings = StorageSettingsSerializer.get_default_domain_settings(settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit dangerous. Maybe it's better at this point do keep a frozen copy of the serializer.

Comment on lines -193 to +195
object_parameters = serializers.DictField(default={})
object_parameters = serializers.DictField(default=dict())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a change. Did you intend to change to default=dict to make it use a callable?

)
row = cursor.fetchone()
mismatched = []
for i, setting in enumerate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would

for k, v in zip((DEFAULT_FILE_STORAGE, ...), row):

work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

As a domain user, I can change my storage system credentials and Pulp will migrate my data
5 participants