Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider moving back to SQL queries on replicas rather than consuming Event Stream #73

Open
eggpi opened this issue Aug 6, 2021 · 8 comments

Comments

@eggpi
Copy link
Contributor

eggpi commented Aug 6, 2021

We've been having reliability issues that are somewhat difficult to troubleshoot, where the tool stops processing updates. We thought 77d68bd solved this, but it looks like it's back after a few months.

We're not sure Event Stream is really the cause -- it could just as well be some issue where the container that consumes the stream is not running, or something else --, but I wonder if moving back from the stream to SQL queries, which is where the tool started, wouldn't result in a simpler and more resilient design.

A SQL-based design could also have lower latency, as a database query should be much faster than doing multiple HTTP queries to fetch the same data from Event Stream. For a quick comparison, we can fetch roughly the data we need for the past month with:

SELECT rc.rc_title, rc.rc_this_oldid,
       rc.rc_last_oldid, rc.rc_timestamp,
       c.comment_text
FROM recentchanges AS rc JOIN comment AS c
ON rc.rc_comment_id = comment.comment_id
WHERE c.comment_text regexp '[[:space:]]+#[^#]{3,}' AND rc.rc_timestamp > '20210704000000'
AND NOT rc.rc_bot AND rc.rc_source IN ('mw.edit', 'mw.new');

This takes 1 min on Toolforge, while my local tool takes many hours to catch up on just a couple of days of backlog. This is not a fair comparison (the SQL query is not doing API calls, and I have a higher latency to the API from home than the tool does), but it's interesting evidence that we should explore further.

A sketch of the design:

  1. Query the meta db to find all projects to track (~400, as I write this):
SELECT dbname
FROM wiki
WHERE family IN ('wikisource', 'wikipedia', 'wikitionary', 'wikinews')
AND is_closed = 0;
  1. Spawn a pool of N worker processes that poll 1 / N of the databases. Maybe use the size field of meta_p.wiki in partitioning to ensure the large projects don't end up with the same worker.
  2. Within each worker, run the big query above for each database, process the results as we currently do, then sleep for X min + jitter.
@eggpi
Copy link
Contributor Author

eggpi commented Aug 7, 2021

Also, I just learned we don't really need one database connection per project, as multiple projects are hosted on the same database shard and we could use one connection per shard. According to the documentation:

In addition each cluster can be accessed by the name of its Wikimedia production shard which follows the format s${SHARD_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud (for example, s1.analytics.db.svc.wikimedia.cloud hosts the enwiki_p database). The shard where a particular database is located can change over time. You should only use the shard name for opening a database connection if your application requires it for specific performance reasons such as for heavily crosswiki tools which would otherwise open hundreds of database connections.

The shards are also available in the meta_p database:

SELECT DISTINCT slice FROM wiki
WHERE family in ('wikisource', 'wikipedia', 'wikitionary', 'wikinews') AND is_closed = 0;

which gives us 6 shards we'd need to connect to, to cover all projects.

eggpi added a commit to eggpi/hashtags that referenced this issue Aug 13, 2021
eggpi added a commit to eggpi/hashtags that referenced this issue Aug 13, 2021
eggpi added a commit to eggpi/hashtags that referenced this issue Aug 13, 2021
@eggpi
Copy link
Contributor Author

eggpi commented Aug 13, 2021

So I did manage to try this out. It's on my sql branch, specifically this commit. I've had it running locally for a couple of days.

  • This is actually not that more complex than using EventStream, and could totally be worth it.
  • Performance is good, my local tool is keeping up with the changes and is ahead of production (which, by the way, seems late or stuck again as I write this).

I think the main drawback here is that local development becomes quite a bit more complicated, and I don't really know how to get it to work on Windows. See the README in my branch for details.

@eggpi
Copy link
Contributor Author

eggpi commented Aug 13, 2021

Oh and here's what I see on the local DB for the last few days:

mysql> select date_format(timestamp, '%Y %m %d') as d, count(*) from hashtags_hashtag group by d order by d desc limit 10;
+------------+----------+
| d          | count(*) |
+------------+----------+
| 2021 08 13 |     2021 |
| 2021 08 12 |     1982 |
| 2021 08 11 |     2149 |
| 2021 08 10 |     2662 |
| 2021 08 09 |     3040 |
| 2021 08 08 |     3444 |
| 2021 08 07 |     1925 |
| 2021 08 06 |     1948 |
| 2021 08 05 |     2244 |
| 2021 08 04 |     4494 |
+------------+----------+
10 rows in set (3.20 sec)

It would be good to compare this with prod.

eggpi added a commit to eggpi/hashtags that referenced this issue Aug 16, 2021
@Samwalton9
Copy link
Member

This is good to know!

Here's prod data from the same time range:

MySQL [hashtagsv2_db]> SELECT DATE(timestamp), COUNT(*) FROM hashtags_hashtag WHERE timestamp > '2021-08-04' GROUP BY DATE(timestamp) desc;
+-----------------+----------+
| DATE(timestamp) | COUNT(*) |
+-----------------+----------+
| 2021-08-15      |    11963 |
| 2021-08-14      |    13006 |
| 2021-08-13      |    13762 |
| 2021-08-12      |    19729 |
| 2021-08-11      |    16860 |
| 2021-08-10      |    10613 |
| 2021-08-09      |    16571 |
| 2021-08-08      |    15349 |
| 2021-08-07      |    15930 |
| 2021-08-06      |     8969 |
| 2021-08-05      |    13402 |
| 2021-08-04      |    15228 |
+-----------------+----------+
12 rows in set, 1 warning (0.11 sec)

Data for the 15th is likely incomplete since that's where the tool got stuck last.

This might be helpful in diagnosing:

MySQL [hashtagsv2_db]> SELECT hashtag, COUNT(*) FROM hashtags_hashtag WHERE timestamp > '2021-08-04' GROUP BY hashtag ORDER BY COUNT(*) DESC limit 10;
+-----------------+----------+
| hashtag         | COUNT(*) |
+-----------------+----------+
| WPWP            |    75677 |
| flickr2commons  |    61838 |
| WPWPTE          |     6962 |
| WPWPTR          |     6003 |
| WPWPSR          |     3052 |
| IABot           |     1932 |
| suggestededit   |     1775 |
| WFH             |     1239 |
| moisdusourçage  |      675 |
| wpwpcrs         |      611 |
+-----------------+----------+
10 rows in set (1.14 sec)

MySQL [hashtagsv2_db]> SELECT domain, COUNT(*) FROM hashtags_hashtag WHERE timestamp > '2021-08-04' GROUP BY domain ORDER BY COUNT(*) DESC limit 10;
+-----------------------+----------+
| domain                | COUNT(*) |
+-----------------------+----------+
| commons.wikimedia.org |    63441 |
| vo.wikipedia.org      |    29760 |
| te.wikipedia.org      |    13884 |
| ms.wikipedia.org      |     8596 |
| mg.wikipedia.org      |     8554 |
| tr.wikipedia.org      |     6452 |
| en.wikipedia.org      |     6444 |
| sr.wikipedia.org      |     4875 |
| ceb.wikipedia.org     |     3444 |
| ha.wikipedia.org      |     1721 |
+-----------------------+----------+
10 rows in set (1.60 sec)

@eggpi
Copy link
Contributor Author

eggpi commented Aug 19, 2021

Aha, I think I was not processing commons.wikimedia.org. Thanks!

@eggpi
Copy link
Contributor Author

eggpi commented Aug 27, 2021

Hopefully with #74 we won't really need to make this switch, but I kept hacking on it a little more just in case. With my latest sql branch, I'm seeing the following numbers from 2021-08-04 to 2021-08-15:

mysql> SELECT DATE(timestamp), COUNT(*) FROM hashtags_hashtag WHERE timestamp > '2021-08-04' GROUP BY DATE(timestamp) desc;
+-----------------+----------+
| DATE(timestamp) | COUNT(*) |
+-----------------+----------+
| 2021-08-15      |    18717 |
| 2021-08-14      |    12834 |
| 2021-08-13      |    13399 |
| 2021-08-12      |    19428 |
| 2021-08-11      |    16407 |
| 2021-08-10      |    10270 |
| 2021-08-09      |    15866 |
| 2021-08-08      |    14945 |
| 2021-08-07      |    15585 |
| 2021-08-06      |     8396 |
| 2021-08-05      |    13045 |
| 2021-08-04      |    14943 |
+-----------------+----------+
12 rows in set, 1 warning (0.10 sec)

mysql> SELECT hashtag, COUNT(*) FROM hashtags_hashtag WHERE timestamp > '2021-08-04' GROUP BY hashtag ORDER BY COUNT(*) DESC limit 10;
+----------------+----------+
| hashtag        | COUNT(*) |
+----------------+----------+
| wpwp           |    77717 |
| flickr2commons |    65777 |
| WPWPTE         |     6961 |
| WPWPTR         |     6021 |
| WPWPSR         |     3065 |
| IABot          |     1966 |
| suggestededit  |      808 |
| wpwpcrs        |      642 |
| WPWPRW         |      618 |
| WPWPWAM        |      603 |
+----------------+----------+
10 rows in set (0.10 sec)

mysql> SELECT domain, COUNT(*) FROM hashtags_hashtag WHERE timestamp > '2021-08-04' GROUP BY domain ORDER BY COUNT(*) DESC limit 10;
+-----------------------+----------+
| domain                | COUNT(*) |
+-----------------------+----------+
| commons.wikimedia.org |    67340 |
| vo.wikipedia.org      |    29791 |
| te.wikipedia.org      |    13887 |
| mg.wikipedia.org      |    10887 |
| ms.wikipedia.org      |     8605 |
| tr.wikipedia.org      |     6462 |
| en.wikipedia.org      |     5368 |
| sr.wikipedia.org      |     4896 |
| ceb.wikipedia.org     |     3443 |
| id.wikipedia.org      |     1339 |
+-----------------------+----------+
10 rows in set (0.40 sec)

So it's not exactly the same as what prod had, but probably good enough. We even see the SQL version capture more hashtags in some cases.

@Samwalton9
Copy link
Member

Thanks for investigating this, I'm definitely curious where that difference in the number of hashtags captured comes from, but I'm glad to see that there's a way to implement this that isn't as resource intensive as the previous setup.

The tool is currently working OK with the EventStream, the other bug fix seems to have fixed the issue. Given the added complication for local dev, I think I'm going to suggest we park this for now and we can revisit if we find a need due to further issues with the EventStream.

@eggpi
Copy link
Contributor Author

eggpi commented Sep 16, 2021

Agree, I've been keeping an eye on prod and, although I have seen it lag behind a few days, it seems to always catch up again with the latest fixes. I still think the SQL version might give us lower latency but it's not worth the extra complexity.

As an aside, would you mind documenting the steps for deploying this in Toolforge when you have a minute please? If we do start pursuing this again, I think the next step would be to deploy it in a separate account so we can compare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants