Remove IndexMap document list #3606

dbutenhof · 2024-01-31T19:52:07Z

PBENCH-1315

The production server, with "only" 108,728 indexed datasets (many more still haven't been migrated from the passthrough server), currently claims 84.1Gb of PostgreSQL storage just for the IndexMap table. Most of this consists of a list of each Opensearch document ID in order to allow using bulk update and delete operations to manage the index. This is straining the capacity of our RDU2 PostgreSQL server.

As an alternative, this PR removes the document list and instead of the bulk update and delete operations uses _delete_by_query and _update_by_query searching for documents in the appropriate indices (which we still store in the IndexMap) by parent dataset resource ID.

Along the way, I noticed that (oops) we were missing the "authorization" subdocument in some of our Elasticsearch documents, which would impact the authenticated search API behaviors. And I acted on a deprecation warning for a camelCase template keyword by replacing it with a snake_case alternative.

NOTE: In the interest of expediently deploying a fix for our SQL bloat in RDU2, this is missing unit testing for update and delete, both of which are tested (in indexed and non-indexed cases) by functional tests.

PBENCH-1315 The production server, with "only" 108,728 indexed datasets (many more still haven't been migrated from the passthrough server), currently claims 84.1Gb of PostgreSQL storage just for the `IndexMap` table. Most of this consists of a list of each Opensearch document ID in order to allow using bulk update and delete operations to manage the index. This is straining the capacity of our RDU2 PostgreSQL server. As an alternative, this PR removes the document list and instead of the bulk update and delete operations uses `_delete_by_query` and `_update_by_query` searching for documents in the appropriate indices (which we still store in the `IndexMap`) by parent dataset resource ID. Along the way, I noticed that (oops) we were missing the `"authorization"` subdocument in some of our Elasticsearch documents, which would impact the authenticated search API behaviors. And I acted on a deprecation warning for a camelCase template keyword by replacing it with a snake_case alternative. __DRAFT__: I've completed the basic implementation, so that can be reviewed, along with the database model unit test, but the API unit test is still pending. (I'll continue working on that as I have time, but thought I might as well post what I've got so far.)

webbnh

Looks good so far, but I'm only a quarter of the way through.

lib/pbench/server/api/resources/query_apis/__init__.py

lib/pbench/server/api/resources/query_apis/datasets/__init__.py

lib/pbench/server/api/resources/query_apis/datasets/datasets.py

1. Remove the nascent `test_datasets_delete.py` unit test; we have a priority to quickly defer this SQL fix to reduce our load on the RDU2 PostgreSQL. 2. I'd successfully run functional tests, but as I'd wanted to debug within the container I'd been running `runlocal --keep`, which skips the dataset `DELETE` and avoided exposing the fundamental bug that `ElasticBase` doesn't expose `_delete`. (Oops.) 3. Some general cleanup. There are probably some `INFO` log messages that could be removed, but for now I'd prefer that this subsystem remain a bit chatty in case we encounter any problems.

webbnh

No issues with the code that you've changed. (I didn't really look at the code that you didn't change. 🙁) I've got a couple of quick, entirely ignorable suggestions (and another which I'm going to pre-resolve).

lib/pbench/server/api/resources/query_apis/datasets/datasets.py

lib/pbench/server/indexer.py

lib/pbench/test/unit/server/database/test_index_map_db.py

dbutenhof added bug Server Indexing Database Operations Related to operation and monitoring of a service labels Jan 31, 2024

dbutenhof self-assigned this Jan 31, 2024

webbnh reviewed Feb 1, 2024

View reviewed changes

webbnh mentioned this pull request Feb 1, 2024

Update b1.0 #3607

Merged

dbutenhof marked this pull request as ready for review February 1, 2024 16:36

webbnh approved these changes Feb 1, 2024

View reviewed changes

lib/pbench/server/api/resources/query_apis/datasets/datasets.py Show resolved Hide resolved

lib/pbench/server/indexer.py Show resolved Hide resolved

lib/pbench/test/unit/server/database/test_index_map_db.py Show resolved Hide resolved

dbutenhof merged commit a74911d into distributed-system-analysis:main Feb 1, 2024
4 checks passed

dbutenhof deleted the querydelete branch February 1, 2024 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove IndexMap document list #3606

Remove IndexMap document list #3606

dbutenhof commented Jan 31, 2024 •

edited

webbnh left a comment

webbnh left a comment

Remove IndexMap document list #3606

Remove IndexMap document list #3606

Conversation

dbutenhof commented Jan 31, 2024 • edited

webbnh left a comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

dbutenhof commented Jan 31, 2024 •

edited