Election 2022 learnings #30

jonathanstegall · 2022-11-09T16:01:22Z

Schedule frequency learnings

After running this scraper for the 2022 election, there are some observations that we might build upon to make the scraper more reliable:

The Celery queue on Heroku seems to sometimes get overwhelmed without explanation. If this is in fact what is happening (though it is hard to say for sure), the queue seems to get stuck waiting for processing rather than starting new tasks, leading to the data falling behind.
Increasing the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY seemed to improve this. We finished the night at 300 seconds instead of 180, but still had to restart the Heroku app a few times. It is possible that it would be ideal to make this more like 600, but this requires some work (see below).
Having the scrape_results_chain task combine the results and elections scraper was, in theory, the right way to keep the result scraper and the election meta scraper together, but this doesn't seem to work on Heroku's infrastructure. Sometimes the two tasks run completely independently of each other, resulting in the data reporting incorrectly when it was last updated.
The logic of this code was intended to constantly be revising the updated value in the elections table so it would stay current with the filesystem on the Secretary of State website. This doesn't seem to always work though, because sometimes the files are updated at different times and the scraper uses a value that is not the most recent.
It's impossible to say definitively when a specific contest has been updated by the Secretary of State, so we were correct not to try to display that.

Making the scheduler smarter

One possible way to improve scheduling in the future might be something like this:

When the scraper runs, gather a list of all the relevant CSV files (from scraper_sources.json or wherever) and retrieve the "Last-Modified" header value from the server and sort those values in descending order. Use the most recent one for the updated value in the elections table.
Instead of running every 3 minutes or 5 minutes or whatever, start running whenever the first CSV gets modified, and then run it again 10 minutes after that initial modified time, and then start the loop that way. This might be complicated to code, it probably is, but I think it is the ideal way to set the frequency in a stable and accurate way (and maybe match the frequency the Secretary of State uses better as well).

The text was updated successfully, but these errors were encountered:

jonathanstegall · 2022-11-09T16:34:50Z

One thing I'm very very unclear on is what the relationship is between the traffic to the API endpoints (from the front-end interface) and the performance/reliability of the Celery worker. It seems to be running much less clumsily today, for example, than it did last night, and this leads me to think there is at least some relationship.

jonathanstegall mentioned this issue Nov 10, 2022

Election 2022 learnings MinnPost/minnpost-elections-dashboard#10

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Election 2022 learnings #30

Election 2022 learnings #30

jonathanstegall commented Nov 9, 2022 •

edited

jonathanstegall commented Nov 9, 2022

Election 2022 learnings #30

Election 2022 learnings #30

Comments

jonathanstegall commented Nov 9, 2022 • edited

Schedule frequency learnings

Making the scheduler smarter

jonathanstegall commented Nov 9, 2022

jonathanstegall commented Nov 9, 2022 •

edited