Update b1.0 #3607

webbnh · 2024-02-01T16:20:00Z

This is the next update for the Pbench Server. Despite how GitHub displays it, this picks up changes to main only since 22 January ("PBENCH-1309"). (The others are already in b1.0.)

This combines a few issues: first, I've wanted to filter based on the unpacked tarball size, but some tarballs are beyond the range of the SQL `INTEGER` type and cause SQL cast errors -- change the interpretation of the `int` filter and sort type to `BigInteger`. Also cleans up the logging around retried Sync transaction errors, only logging warnings when it can't determine that the error is a PostgreSQL serialization error. (I hope: this is hard to provoke in casual testing.) Finally, clean up the logging of cached unpacked size by avoiding two separate logs (without dataset name) on unpack, and adding a log of the final unpacked size when we compute it.

Sort datasets by uploaded time

* PBENCH-1307 End time column update

Undable to update date

* PBENCH-1300 Visualization Page Pagination

Display metadata modal is empty

Overview page displays Public datasets

…s#3595) * Another tweak to intake metadata problems Make sure we can't end up with undefined `metadata`. Record details of `metadata.log` access to `run.controller` without adding a ton of separate messages.

* PBENCH-1216 TOC page update

* Minor logging cleanup Minimize cache logging: details were useful when cache management first went in, but are now disruptive during ops review.

Visualization page not loading

* Move nginx cache into /srv/pbench PBENCH-1316 Our deployed containerized server maps `/var/lib` (the default NGINX cache location) to `/home`, which has only 26Gb free. Instead, point NGINX cache to our large Pbench volume at `/srv/pbench/nginx` in order to be able to transfer larger datasets.

PBENCH-1318 The reclaimer defaulted to 20%, which is inappropriate for an unpack reclaim where we want to free just enough for the unpacked dataset size. Also, to help diagnose, add the last referenced cache date to the reclaim log message.

* Protect the cache lock better PBENCH-1317 We found a case where a cache lock could "leak" when an error occurs reading a file in the visualize and compare APIs. The file read has now been repackaged with a `finally` to be sure the stream is closed and unlocked on error.

* Add simple report generator This will report on the state of the ARCHIVE, BACKUP, and CACHE on-disk trees in addition to the state of the SQL database. (I'm going to leave analyzing and reporting on the Opensearch database for another time, since this is "off books" weekend upstream work!) This packages the ad hoc SQL queries I've been doing to monitor the server as a CLI utility, plus some more. Here's the output of `pbench-report-generator --all` on the production server: ``` Archive report: 117,446 tarballs consuming 21.7 TB The smallest tarball is 1.0 kB, pbench-user-benchmark__2020.04.03T11.05.44 The biggest tarball is 41.1 GB, uperf_Azure_RHEL-8.10.0-20240116.45_x86_64_gen2_pci_netvsc_quick_D240125T014727_2024.01.25T01.47.28 Backup report: 117,447 tarballs consuming 21.7 TB Cache report: 97,464 datasets consuming 45.6 TB 4 datasets have never been unpacked, 0 are missing reference timestamps, 0 have bad size metadata The smallest cache is 24.6 kB, pbench-user-benchmark__2020.04.03T11.05.44 The biggest cache is 110.5 GB, trafficgen_RHOSP16.2-RHEL8.3-nrt-OVS-OFFLOAD-PVP-LossTests_tg:trex_r:none_fs:64,128,256,512,1024,1500_nf:1024_fm:si_td:bi_ml:0.002,0.0005,0.0001_tt:bs__2020-12-26T03:16:38 The least recently used cache was referenced Dec 11, specjbb2005__2023.09.22T00.22.28 The most recently used cache was referenced today, uperf_rhel84_4.18.0.277_kernel_10gb_jumbo_2021.01.26T09.51.18 SQL storage report: Table Rows Storage -------------------- ---------- ---------- alembic_version 1 57.3 kB audit 683,922 224.7 MB datasets 117,449 34.3 MB templates 12 221.2 kB server_settings 0 24.6 kB users 11 81.9 kB dataset_metadata 352,344 217.9 MB dataset_operations 340,986 29.1 MB api_keys 5 81.9 kB indexmaps 291,510 79.7 GB Operational states: UPLOAD states: OK 117,449 TOOLINDEX states: READY 106,112 INDEX states: OK 106,112 FAILED 494 CODE 7: 365 Bad metadata.log file encountered CODE 1: 128 Operational error while indexing CODE 12: 1 Unexpected error encountered READY 10,819 ```

* Remove IndexMap document list PBENCH-1315 The production server, with "only" 108,728 indexed datasets (many more still haven't been migrated from the passthrough server), currently claims 84.1Gb of PostgreSQL storage just for the `IndexMap` table. Most of this consists of a list of each Opensearch document ID in order to allow using bulk update and delete operations to manage the index. This is straining the capacity of our RDU2 PostgreSQL server. As an alternative, this PR removes the document list and instead of the bulk update and delete operations uses `_delete_by_query` and `_update_by_query` searching for documents in the appropriate indices (which we still store in the `IndexMap`) by parent dataset resource ID. Along the way, I noticed that (oops) we were missing the `"authorization"` subdocument in some of our Elasticsearch documents, which would impact the authenticated search API behaviors. And I acted on a deprecation warning for a camelCase template keyword by replacing it with a snake_case alternative.

# Conflicts: # dashboard/src/modules/components/ComparisonComponent/PanelContent.jsx

dbutenhof and others added 20 commits December 20, 2023 08:12

PBENCH-1306 (distributed-system-analysis#3588)

e894edb

Sort datasets by uploaded time

End time column update (distributed-system-analysis#3587)

2a1223b

* PBENCH-1307 End time column update

Operations cleanup (distributed-system-analysis#3589)

010037d

PBENCH-1311 (distributed-system-analysis#3590)

b7e6a5f

Undable to update date

Visualization Page Pagination (distributed-system-analysis#3591)

6161099

* PBENCH-1300 Visualization Page Pagination

PBENCH-1313 (distributed-system-analysis#3592)

314ded3

Display metadata modal is empty

PBENCH-1309

442026c

Overview page displays Public datasets

review comments

5d9ce4d

Another tweak to intake metadata problems (distributed-system-analysi…

3db0cf9

…s#3595) * Another tweak to intake metadata problems Make sure we can't end up with undefined `metadata`. Record details of `metadata.log` access to `run.controller` without adding a ton of separate messages.

Update TOC page (distributed-system-analysis#3580)

aaa842a

* PBENCH-1216 TOC page update

Minor logging cleanup (distributed-system-analysis#3598)

58acfe9

* Minor logging cleanup Minimize cache logging: details were useful when cache management first went in, but are now disruptive during ops review.

PBENCH-1314

f3d3148

Visualization page not loading

filter callback fynction update

280e1a2

unused comp

25997c4

Filter function update

01ce14d

webbnh added the Server label Feb 1, 2024

webbnh requested a review from dbutenhof February 1, 2024 16:20

webbnh self-assigned this Feb 1, 2024

dbutenhof previously approved these changes Feb 1, 2024

View reviewed changes

dbutenhof and others added 2 commits February 1, 2024 14:48

Merge remote-tracking branch 'origin/main' into update_b1.0

6571152

# Conflicts: # dashboard/src/modules/components/ComparisonComponent/PanelContent.jsx

webbnh dismissed dbutenhof’s stale review via 6571152 February 5, 2024 15:14

webbnh force-pushed the update_b1.0 branch from e10cd80 to 6571152 Compare February 5, 2024 15:14

webbnh requested a review from dbutenhof February 5, 2024 15:15

dbutenhof approved these changes Feb 5, 2024

View reviewed changes

webbnh merged commit 89f4241 into distributed-system-analysis:b1.0 Feb 5, 2024
4 checks passed

webbnh deleted the update_b1.0 branch February 5, 2024 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update b1.0 #3607

Update b1.0 #3607

webbnh commented Feb 1, 2024 •

edited

Update b1.0 #3607

Update b1.0 #3607

Conversation

webbnh commented Feb 1, 2024 • edited

webbnh commented Feb 1, 2024 •

edited