Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CEP] Migrate RepeaterRecord to SQL #28314

Open
kaapstorm opened this issue Aug 6, 2020 · 7 comments
Open

[CEP] Migrate RepeaterRecord to SQL #28314

kaapstorm opened this issue Aug 6, 2020 · 7 comments
Assignees
Labels
CEP: done CEP CommCare Enhancement Proposal

Comments

@kaapstorm
Copy link
Contributor

kaapstorm commented Aug 6, 2020

Abstract

Using SQL for repeat records offers significant performance improvements for Data Forwarding.

Motivation

More projects are using Data Forwarding, and the volume of payloads being sent is increasing. This increases the negative impact of current inefficiencies.

HQ is becoming more vulnerable to the effects of downtime of remote systems. As our integration offerings around FHIR, COVID-19 and vaccinations improve, these risks compound.

Specification (updated 2020-12-10)

Current state

There are two execution paths to forwarding a repeat record payload:

  1. A Django signal (e.g. on form submission, or app change) triggers a repeat record to be registered, and RepeatRecord.attempt_forward_now() is called.

  2. The check_repeaters() scheduled task iterates repeat record instances, and calls RepeatRecord.attempt_forward_now().

A series of calls, spanning tasks and the RepeatRecord and Repeater models, send the payload and handle the response or failure.

Problems with current state

There are three low-hanging problems with this approach; the first two are related to Couch, and the third one is related to hindsight:

  1. RepeatRecord instances are iterated independent of their Repeater: We iterate repeat records of paused repeaters. In the past we've managed to improve performance noticeably just by delaying the next forwarding attempt of repeat records of paused repeaters by a week instead of a day. With SQL relationships, this would be irrelevant. We could simply filter out the repeat records of paused repeaters. (And when the repeater is unpaused, we can forward its repeat records immediately, aligning behaviour with user expectations.)

  2. RepeatRecord instances are forwarded independent of their peers: e.g. If an attempt to forward a new form submission fails, its repeat record is scheduled to be retried an hour later (then three hours, nine hours, etc.). If a second form is submitted a moment after the first, we try to forward it immediately, ignorant that an attempt to the same remote service failed a moment ago. It's probably going to fail. And if it succeeds, then that payload will have been forwarded out of sequence. Again, SQL relationships make it trivial to associate repeat records with each other:

    1. We can batch forwarding attempts at consistent intervals.

    2. And we can always send payloads in the sequence in which they were registered.

  3. If we refactor RepeatRecord.fire() from a method to a function, and pull in some of the code that is spread across Repeater and RepeatRecord, we can simplify the code for sending a payload quite a bit.

Proposed changes

New models and tasks
  • Create SQL models for RepeatRecord, SQLRepeatRecord and SQLRepeatRecordAttempt.

  • Create SQLRepeaterStub only to link RepeatRecords. Repeater data and functionality would remain in existing Couch models. For now, Repeaters are not migrated to SQL.

  • Repeater.register() would create a corresponding SQLRepeaterStub if it doesn't exist, add a new SQLRepeatRecord instance, and pass the SQLRepeaterStub (not the repeat record) to an attempt_forward_now() function.

  • The repeater is retried later if a remote service is unavailable, not the repeat record. attempt_forward_now() calls a process_repeater() task, which iterates repeat records in the order in which they were registered, and sends their payloads. Repeaters are processed concurrently by the process_repeater() task. The process_repeater() task is assigned to the repeat record queue, which uses green threads on Production. (Production has 200 repeat record queue workers, and just under 1400 repeaters, although probably fewer than half of the repeaters are in use.)

  • Instead of two Couch requests for every repeat record, SQL repeat records are fetched in chunks of 1000.

Old repeat records

Couch code remains in place for existing repeat records. Couch repeat records are soft-migrated to SQL as they become due for retrying, and are forwarded using the new process_repeater() task.

After seven days (MAX_RETRY_WAIT) the only repeat records not yet migrated will be old succeeded or cancelled records. These can be migrated using a management command, or ignored, or deleted.

Repeat Records Report

The repeat records report would be subclassed for SQL repeat records. If a domain has SQL repeat records, the SQLRepeatRecordReport will be used.

Datadog

Couch and SQL workflows would feed the same metrics.

Rollback command

Rollback would be a two-step process:

  1. Revert the PR that creates SQLRepeatRecord instances and changes the check_repeaters() task.
  2. Use a management command to reverse the migration of Couch repeat records, and create Couch records for new SQL records.

Impact on users

The change would not interrupt the delivery of repeat records.

Old repeat records would not appear in the Repeat Records Report until they were migrated to SQL.

Impact on hosting

The process_repeater() task would use the same workers as the process_repeat_record() task did. No impact on hosting.

Deploy process

Deploy would involve 2 PRs:

  1. Models, tasks, the SQLRepeatRecordReport, and the rollback management command. This PR can be deployed early. It would not change functionality.
  2. A "switch" PR that would start creating SQLRepeatRecord instances instead of Couch RepeatRecord instances, and soft-migrate Couch repeat records.

Backwards compatibility

When HQ switches to the new SQL workflow, support would still exist for Couch repeat records.

Release timeline

The code changes should not conflict with ongoing FHIR work, but would definitely benefit that work if they were in place before the FHIR work is released. Ideally these changes should be deployed before that work, which is due to complete its first deliverables in Q1 2021.

Open questions and issues

  • What should a comprehensive plan to QA this migration cover?

  • Scale: There are about 1400 repeaters and over 72 million repeat records on Production. If we switch on a Monday, there will be the fewest number of repeat records to soft-migrate. If we want to migrate old succeeded and cancelled repeat records, it would best be done over the following weekend when we could use idle repeat record queue workers. The migration of 72 million repeat records will take several days. Do we want to migrate them all? Do we want a cut-off age, e.g. only migrate those less than a year old?

(I will create a follow-up CEP on the topic of Repeat Record TTL.)

@kaapstorm kaapstorm added the CEP CommCare Enhancement Proposal label Aug 6, 2020
@kaapstorm kaapstorm self-assigned this Aug 6, 2020
@snopoke
Copy link
Contributor

snopoke commented Aug 6, 2020

Thanks for putting this together Norman. A few questions / clarifications:

  • This seems to cover a few different things. The concurrency suggestions seem like they would be independent of the migration to SQL.

  • We already use gevent for the repeater celery workers. Would the concurrency changes you are proposing replace that or work in conjunction?

  • how long do we keep repeat records / attempts for? It seems important to me that we should not keep these records forever. If we didn't keep them forever it would make the migration easier since we could just wait until all old records are removed instead of migrating them to SQL.

@kaapstorm
Copy link
Contributor Author

Hey @snopoke . Excellent questions!

  • The concurrency suggestions are independent, but they build on the migration to SQL -- I don't think it would make sense to try to implement them before migrating to SQL, because I think iterating RepeatRecords more accurately should be a prerequisite.

  • I had completely forgotten that Celery can use gevent, and that it uses gevent for the repeat record queue (whose only task is process_repeat_record()). ... That is excellent! And it means that the benefits of using process pool workers + aiohttp would be much less that I had believed, if there would be any benefit at all. ... Given that, and that I think it is obvious that a SQL migration should use a parallel workflow, and should iterate RepeatRecords using related models, it seems this CEP is not really necessary.

  • I'm pretty sure we never delete repeat records / attempts. As you know, we delete RequestLogs after 90 days (corehq.motech.repeaters.tasks.clean_logs()). I will check with the AEs, who may use the Repeat Records Report more often than I do, but 90 days seems like a reasonable amount of time to keep succeeded and cancelled repeat records / attempts too. I'd be curious to hear thoughts on whether we should delete repeat records for paused repeaters.

@snopoke
Copy link
Contributor

snopoke commented Aug 10, 2020

I'm pretty sure we never delete repeat records / attempts. As you know, we delete RequestLogs after 90 days (corehq.motech.repeaters.tasks.clean_logs()). I will check with the AEs, who may use the Repeat Records Report more often than I do, but 90 days seems like a reasonable amount of time to keep succeeded and cancelled repeat records / attempts too. I'd be curious to hear thoughts on whether we should delete repeat records for paused repeaters.

I don't see a lot of value in keeping repeat records and their attempts forever. Their main purpose is to provide visibility into any issue and what data is being sent. I would think that having a similar window for repeat record attempts would be a good change. This would also answer the question of what to do with records from cancelled / paused repeaters. If a repeater is paused I think we should continue to create repeat records with the understanding that they would have the same TTL window applied to them. This would give a project 90 days (or whatever the window is) to fix the repeater before they start 'losing' data.

Cancelled repeaters should not create new records (I'm not sure quite what the difference is between paused and cancelled).

@kaapstorm
Copy link
Contributor Author

The feedback from AEs is that we should delete repeat records, but that we should keep repeat records that have not been delivered successfully for longer than 90 days (because sometimes it has taken longer than 90 days for a project to notice that their data has not been forwarded). They suggested 180 days.

Here is how I understand the suggestion:

RepeatRecord state TTL
pending (due to be resent) registered_on + 180 days
failed (not yet due to be resent) registered_on + 180 days
cancelled (failed too many times) registered_on + 180 days
succeeded last_checked + 90 days (to match RequestLog TTL)

Repeaters only have two states: paused and active/unpaused. RepeatRecords are created for them in both states, but payloads are only forwarded when the Repeater is unpaused.

We considered creating a cancelled or stopped state for repeaters where RepeatRecords are not created for them, but now that Repeaters use ConnectionSettings to store their credentials, a user could achieve almost the same result by having a ConnectionSettings instance without a Repeater instance that uses it. If we no longer iterate RepeatRecords for paused Repeaters, and RepeatRecords get deleted after a while, that could also reduce the benefit for creating a new state. So at the moment, I'm leaning towards leaving Repeater states as they are.

@snopoke
Copy link
Contributor

snopoke commented Aug 12, 2020

because sometimes it has taken longer than 90 days for a project to notice that their data has not been forwarded

This statement seems like something we should aim to fix rather than just giving more leeway. Having said that I think the 180 days proposal is still better than the current situation but if we document this anywhere we should say 90 days to give us room to drop from 180 in the future.

@kaapstorm
Copy link
Contributor Author

Totally agree. I think the "Addresses to send notifications" field should be required.

@kaapstorm
Copy link
Contributor Author

@snopoke I've updated the description to focus on repeat record models, where I think our easiest wins will be. I think conversations about Repeat Record TTL are useful and important, but might be better to spawn a separate CEP/issue/conversation for that. I'm hoping we can make progress on part of this if this CEP narrows its focus a bit.

@snopoke snopoke pinned this issue Dec 10, 2020
@snopoke snopoke unpinned this issue Dec 10, 2020
@kaapstorm kaapstorm changed the title [CEP] Migrate Repeaters to SQL [CEP] Migrate RepeaterRecord to SQL Jan 4, 2021
This was referenced Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CEP: done CEP CommCare Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

2 participants