Metacat API Load-related Issues Triggered by Reindexing (k8s) #1911

artntek · 2024-05-01T01:11:26Z

When a "reindex all" command is issued to Metacat in k8s, and there are too many index workers deployed, they all overwhelm the single metacat instance with requests, which leads to errors including database connection pool exhaustion and other (unexplained) issues.

We need to reproduce and look at some of these overload mechanisms, then make changes to ensure metacat can tolerate the load from reindexing.

mbjones · 2024-05-01T02:30:25Z

Good observation. A lot of this overload is from the many /meta and /object API calls (and associated access control checks) needed to handle indexing. This is a well-known problem for us, and the point of our hashstore storage refactor is that dataone-indexer workers can get the files they need for indexing without making any API calls. In our new design, a call to reindex all will generate a lot of rabbitmq tasks that contain the job info needed for each indexing job, and the indexing workers can do their thing in parallel without hitting metacat with rest calls. I suspect when we get here, then our limiting bottlenecks will shift to 1) I/O limits from Ceph, and 2) solr write limits (although in theory we can shard this and provide horizontal scaling in solr too).

artntek added this to the 3.1.0 milestone May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metacat API Load-related Issues Triggered by Reindexing (k8s) #1911

Metacat API Load-related Issues Triggered by Reindexing (k8s) #1911

artntek commented May 1, 2024

mbjones commented May 1, 2024

Metacat API Load-related Issues Triggered by Reindexing (k8s) #1911

Metacat API Load-related Issues Triggered by Reindexing (k8s) #1911

Comments

artntek commented May 1, 2024

mbjones commented May 1, 2024