Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Election 2022 learnings #30

Open
jonathanstegall opened this issue Nov 9, 2022 · 1 comment
Open

Election 2022 learnings #30

jonathanstegall opened this issue Nov 9, 2022 · 1 comment

Comments

@jonathanstegall
Copy link
Member

jonathanstegall commented Nov 9, 2022

Schedule frequency learnings

After running this scraper for the 2022 election, there are some observations that we might build upon to make the scraper more reliable:

  1. The Celery queue on Heroku seems to sometimes get overwhelmed without explanation. If this is in fact what is happening (though it is hard to say for sure), the queue seems to get stuck waiting for processing rather than starting new tasks, leading to the data falling behind.
  2. Increasing the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY seemed to improve this. We finished the night at 300 seconds instead of 180, but still had to restart the Heroku app a few times. It is possible that it would be ideal to make this more like 600, but this requires some work (see below).
  3. Having the scrape_results_chain task combine the results and elections scraper was, in theory, the right way to keep the result scraper and the election meta scraper together, but this doesn't seem to work on Heroku's infrastructure. Sometimes the two tasks run completely independently of each other, resulting in the data reporting incorrectly when it was last updated.
  4. The logic of this code was intended to constantly be revising the updated value in the elections table so it would stay current with the filesystem on the Secretary of State website. This doesn't seem to always work though, because sometimes the files are updated at different times and the scraper uses a value that is not the most recent.
  5. It's impossible to say definitively when a specific contest has been updated by the Secretary of State, so we were correct not to try to display that.

Making the scheduler smarter

One possible way to improve scheduling in the future might be something like this:

  1. When the scraper runs, gather a list of all the relevant CSV files (from scraper_sources.json or wherever) and retrieve the "Last-Modified" header value from the server and sort those values in descending order. Use the most recent one for the updated value in the elections table.
  2. Instead of running every 3 minutes or 5 minutes or whatever, start running whenever the first CSV gets modified, and then run it again 10 minutes after that initial modified time, and then start the loop that way. This might be complicated to code, it probably is, but I think it is the ideal way to set the frequency in a stable and accurate way (and maybe match the frequency the Secretary of State uses better as well).
@jonathanstegall
Copy link
Member Author

One thing I'm very very unclear on is what the relationship is between the traffic to the API endpoints (from the front-end interface) and the performance/reliability of the Celery worker. It seems to be running much less clumsily today, for example, than it did last night, and this leads me to think there is at least some relationship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant