Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update b1.0 #3607

Merged
merged 22 commits into from
Feb 5, 2024
Merged

Update b1.0 #3607

merged 22 commits into from
Feb 5, 2024

Conversation

webbnh
Copy link
Member

@webbnh webbnh commented Feb 1, 2024

This is the next update for the Pbench Server. Despite how GitHub displays it, this picks up changes to main only since 22 January ("PBENCH-1309"). (The others are already in b1.0.)

dbutenhof and others added 20 commits December 20, 2023 08:12
This combines a few issues: first, I've wanted to filter based on the unpacked
tarball size, but some tarballs are beyond the range of the SQL `INTEGER` type
and cause SQL cast errors -- change the interpretation of the `int` filter and
sort type to `BigInteger`. Also cleans up the logging around retried Sync
transaction errors, only logging warnings when it can't determine that the
error is a PostgreSQL serialization error. (I hope: this is hard to provoke in
casual testing.) Finally, clean up the logging of cached unpacked size by
avoiding two separate logs (without dataset name) on unpack, and adding a log
of the final unpacked size when we compute it.
Sort datasets by uploaded time
* PBENCH-1300
Visualization Page Pagination
Display metadata modal is empty
Overview page displays Public datasets
…s#3595)

* Another tweak to intake metadata problems

Make sure we can't end up with undefined `metadata`.

Record details of `metadata.log` access to `run.controller` without adding
a ton of separate messages.
* Minor logging cleanup

Minimize cache logging: details were useful when cache management first went
in, but are now disruptive during ops review.
Visualization page not loading
* Move nginx cache into /srv/pbench

PBENCH-1316

Our deployed containerized server maps `/var/lib` (the default NGINX cache
location) to `/home`, which has only 26Gb free. Instead, point NGINX cache to
our large Pbench volume at `/srv/pbench/nginx` in order to be able to transfer
larger datasets.
PBENCH-1318

The reclaimer defaulted to 20%, which is inappropriate for an unpack reclaim
where we want to free just enough for the unpacked dataset size.

Also, to help diagnose, add the last referenced cache date to the reclaim log
message.
* Protect the cache lock better

PBENCH-1317

We found a case where a cache lock could "leak" when an error occurs reading a
file in the visualize and compare APIs. The file read has now been repackaged
with a `finally` to be sure the stream is closed and unlocked on error.
* Add simple report generator

This will report on the state of the ARCHIVE, BACKUP, and CACHE on-disk trees
in addition to the state of the SQL database. (I'm going to leave analyzing
and reporting on the Opensearch database for another time, since this is "off
books" weekend upstream work!)

This packages the ad hoc SQL queries I've been doing to monitor the server as
a CLI utility, plus some more.

Here's the output of `pbench-report-generator --all` on the production server:

```
Archive report:
  117,446 tarballs consuming 21.7 TB
  The smallest tarball is 1.0 kB, pbench-user-benchmark__2020.04.03T11.05.44
  The biggest tarball is 41.1 GB, uperf_Azure_RHEL-8.10.0-20240116.45_x86_64_gen2_pci_netvsc_quick_D240125T014727_2024.01.25T01.47.28
Backup report:
  117,447 tarballs consuming 21.7 TB
Cache report:
  97,464 datasets consuming 45.6 TB
  4 datasets have never been unpacked, 0 are missing reference timestamps, 0 have bad size metadata
  The smallest cache is 24.6 kB, pbench-user-benchmark__2020.04.03T11.05.44
  The biggest cache is 110.5 GB, trafficgen_RHOSP16.2-RHEL8.3-nrt-OVS-OFFLOAD-PVP-LossTests_tg:trex_r:none_fs:64,128,256,512,1024,1500_nf:1024_fm:si_td:bi_ml:0.002,0.0005,0.0001_tt:bs__2020-12-26T03:16:38
  The least recently used cache was referenced Dec 11, specjbb2005__2023.09.22T00.22.28
  The most recently used cache was referenced today, uperf_rhel84_4.18.0.277_kernel_10gb_jumbo_2021.01.26T09.51.18
SQL storage report:
  Table                Rows       Storage   
  -------------------- ---------- ----------
  alembic_version               1    57.3 kB
  audit                   683,922   224.7 MB
  datasets                117,449    34.3 MB
  templates                    12   221.2 kB
  server_settings               0    24.6 kB
  users                        11    81.9 kB
  dataset_metadata        352,344   217.9 MB
  dataset_operations      340,986    29.1 MB
  api_keys                      5    81.9 kB
  indexmaps               291,510    79.7 GB
Operational states:
  UPLOAD states:
          OK  117,449
  TOOLINDEX states:
       READY  106,112
  INDEX states:
          OK  106,112
      FAILED      494
           CODE  7:    365  Bad metadata.log file encountered
           CODE  1:    128  Operational error while indexing
           CODE 12:      1  Unexpected error encountered
       READY   10,819
```
@webbnh webbnh added the Server label Feb 1, 2024
@webbnh webbnh requested a review from dbutenhof February 1, 2024 16:20
@webbnh webbnh self-assigned this Feb 1, 2024
dbutenhof
dbutenhof previously approved these changes Feb 1, 2024
dbutenhof and others added 2 commits February 1, 2024 14:48
* Remove IndexMap document list

PBENCH-1315

The production server, with "only" 108,728 indexed datasets (many more still
haven't been migrated from the passthrough server), currently claims 84.1Gb of
PostgreSQL storage just for the `IndexMap` table. Most of this consists of a
list of each Opensearch document ID in order to allow using bulk update and
delete operations to manage the index. This is straining the capacity of our
RDU2 PostgreSQL server.

As an alternative, this PR removes the document list and instead of the bulk
update and delete operations uses `_delete_by_query` and `_update_by_query`
searching for documents in the appropriate indices (which we still store in
the `IndexMap`) by parent dataset resource ID.

Along the way, I noticed that (oops) we were missing the `"authorization"`
subdocument in some of our Elasticsearch documents, which would impact the
authenticated search API behaviors. And I acted on a deprecation warning for
a camelCase template keyword by replacing it with a snake_case alternative.
# Conflicts:
#	dashboard/src/modules/components/ComparisonComponent/PanelContent.jsx
@webbnh webbnh merged commit 89f4241 into distributed-system-analysis:b1.0 Feb 5, 2024
4 checks passed
@webbnh webbnh deleted the update_b1.0 branch February 5, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants