-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider moving back to SQL queries on replicas rather than consuming Event Stream #73
Comments
Also, I just learned we don't really need one database connection per project, as multiple projects are hosted on the same database shard and we could use one connection per shard. According to the documentation:
The shards are also available in the SELECT DISTINCT slice FROM wiki
WHERE family in ('wikisource', 'wikipedia', 'wikitionary', 'wikinews') AND is_closed = 0; which gives us 6 shards we'd need to connect to, to cover all projects. |
So I did manage to try this out. It's on my sql branch, specifically this commit. I've had it running locally for a couple of days.
I think the main drawback here is that local development becomes quite a bit more complicated, and I don't really know how to get it to work on Windows. See the README in my branch for details. |
Oh and here's what I see on the local DB for the last few days:
It would be good to compare this with prod. |
This is good to know! Here's prod data from the same time range:
Data for the 15th is likely incomplete since that's where the tool got stuck last. This might be helpful in diagnosing:
|
Aha, I think I was not processing |
Hopefully with #74 we won't really need to make this switch, but I kept hacking on it a little more just in case. With my latest
So it's not exactly the same as what prod had, but probably good enough. We even see the SQL version capture more hashtags in some cases. |
Thanks for investigating this, I'm definitely curious where that difference in the number of hashtags captured comes from, but I'm glad to see that there's a way to implement this that isn't as resource intensive as the previous setup. The tool is currently working OK with the EventStream, the other bug fix seems to have fixed the issue. Given the added complication for local dev, I think I'm going to suggest we park this for now and we can revisit if we find a need due to further issues with the EventStream. |
Agree, I've been keeping an eye on prod and, although I have seen it lag behind a few days, it seems to always catch up again with the latest fixes. I still think the SQL version might give us lower latency but it's not worth the extra complexity. As an aside, would you mind documenting the steps for deploying this in Toolforge when you have a minute please? If we do start pursuing this again, I think the next step would be to deploy it in a separate account so we can compare. |
We've been having reliability issues that are somewhat difficult to troubleshoot, where the tool stops processing updates. We thought 77d68bd solved this, but it looks like it's back after a few months.
We're not sure Event Stream is really the cause -- it could just as well be some issue where the container that consumes the stream is not running, or something else --, but I wonder if moving back from the stream to SQL queries, which is where the tool started, wouldn't result in a simpler and more resilient design.
A SQL-based design could also have lower latency, as a database query should be much faster than doing multiple HTTP queries to fetch the same data from Event Stream. For a quick comparison, we can fetch roughly the data we need for the past month with:
This takes 1 min on Toolforge, while my local tool takes many hours to catch up on just a couple of days of backlog. This is not a fair comparison (the SQL query is not doing API calls, and I have a higher latency to the API from home than the tool does), but it's interesting evidence that we should explore further.
A sketch of the design:
meta
db to find all projects to track (~400, as I write this):size
field ofmeta_p.wiki
in partitioning to ensure the large projects don't end up with the same worker.The text was updated successfully, but these errors were encountered: