Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove IndexMap document list #3606

Merged
merged 2 commits into from
Feb 1, 2024

Conversation

dbutenhof
Copy link
Member

@dbutenhof dbutenhof commented Jan 31, 2024

PBENCH-1315

The production server, with "only" 108,728 indexed datasets (many more still haven't been migrated from the passthrough server), currently claims 84.1Gb of PostgreSQL storage just for the IndexMap table. Most of this consists of a list of each Opensearch document ID in order to allow using bulk update and delete operations to manage the index. This is straining the capacity of our RDU2 PostgreSQL server.

As an alternative, this PR removes the document list and instead of the bulk update and delete operations uses _delete_by_query and _update_by_query searching for documents in the appropriate indices (which we still store in the IndexMap) by parent dataset resource ID.

Along the way, I noticed that (oops) we were missing the "authorization" subdocument in some of our Elasticsearch documents, which would impact the authenticated search API behaviors. And I acted on a deprecation warning for a camelCase template keyword by replacing it with a snake_case alternative.

NOTE: In the interest of expediently deploying a fix for our SQL bloat in RDU2, this is missing unit testing for update and delete, both of which are tested (in indexed and non-indexed cases) by functional tests.

PBENCH-1315

The production server, with "only" 108,728 indexed datasets (many more still
haven't been migrated from the passthrough server), currently claims 84.1Gb of
PostgreSQL storage just for the `IndexMap` table. Most of this consists of a
list of each Opensearch document ID in order to allow using bulk update and
delete operations to manage the index. This is straining the capacity of our
RDU2 PostgreSQL server.

As an alternative, this PR removes the document list and instead of the bulk
update and delete operations uses `_delete_by_query` and `_update_by_query`
searching for documents in the appropriate indices (which we still store in
the `IndexMap`) by parent dataset resource ID.

Along the way, I noticed that (oops) we were missing the `"authorization"`
subdocument in some of our Elasticsearch documents, which would impact the
authenticated search API behaviors. And I acted on a deprecation warning for
a camelCase template keyword by replacing it with a snake_case alternative.

__DRAFT__: I've completed the basic implementation, so that can be reviewed,
along with the database model unit test, but the API unit test is still
pending. (I'll continue working on that as I have time, but thought I might
as well post what I've got so far.)
@dbutenhof dbutenhof added bug Server Indexing Database Operations Related to operation and monitoring of a service labels Jan 31, 2024
@dbutenhof dbutenhof self-assigned this Jan 31, 2024
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, but I'm only a quarter of the way through.

1. Remove the nascent `test_datasets_delete.py` unit test; we have a priority
   to quickly defer this SQL fix to reduce our load on the RDU2 PostgreSQL.
2. I'd successfully run functional tests, but as I'd wanted to debug within
   the container I'd been running `runlocal --keep`, which skips the dataset
   `DELETE` and avoided exposing the fundamental bug that `ElasticBase`
   doesn't expose `_delete`. (Oops.)
3. Some general cleanup.

There are probably some `INFO` log messages that could be removed, but for now
I'd prefer that this subsystem remain a bit chatty in case we encounter any
problems.
@webbnh webbnh mentioned this pull request Feb 1, 2024
@dbutenhof dbutenhof marked this pull request as ready for review February 1, 2024 16:36
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues with the code that you've changed. (I didn't really look at the code that you didn't change. 🙁) I've got a couple of quick, entirely ignorable suggestions (and another which I'm going to pre-resolve).

@dbutenhof dbutenhof merged commit a74911d into distributed-system-analysis:main Feb 1, 2024
4 checks passed
@dbutenhof dbutenhof deleted the querydelete branch February 1, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Database Indexing Operations Related to operation and monitoring of a service Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants