Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.

Maintenance jobs unable to compress all chunks #1741

Open
mikberg opened this issue Nov 4, 2022 · 7 comments
Open

Maintenance jobs unable to compress all chunks #1741

mikberg opened this issue Nov 4, 2022 · 7 comments
Labels
Bug Something isn't working

Comments

@mikberg
Copy link
Contributor

mikberg commented Nov 4, 2022

Describe the bug

My Promscale instance is nearly constantly alerting
PromscaleMaintenanceJobNotKeepingup, which seems to be because
promscale_sql_database_chunks_metrics_uncompressed_count never quite reaches
the set minimum of
10.
Instead, it seems to vary in the interval between 600 and ~1100, depending on maintenance job settings.

I have tried running call prom_api.execute_maintenance(); manually, repeatedly (in a loop) and tried an aggressive schedule for maintenance jobs, running 4x every 5 minutes. They still seem to hit a "floor" around 600 uncompressed chunks.

Unfortunately, I haven't been able to run the full debugging query from the runbook, as the database goes into recovery mode whenever I try.

To Reproduce

Not sure.

Expected behavior

promscale_sql_database_chunks_metrics_uncompressed_count hitting values < 10 after maintenace jobs are done.

Screenshots
Screenshot 2022-11-04 at 11 29 37

Configuration (as applicable)

  • Promscale Connector:
startup.dataset.config: |
  metrics:
    compress_data: true  # default
    default_retention_period: 90d  # default
    default_chunk_interval: 2h  # default is 8h; reduced in effort to mitigate PromscaleMaintenanceJobRunningTooLong
  traces:
    default_retention_period: 30d  # default
  • TimescaleDB:
shared_buffers: 1280MB
effective_cache_size: 3840MB
maintenance_work_mem: 640MB
work_mem: 8738kB
timescaledb.max_background_workers: 8
max_worker_processes: 13
max_parallel_workers_per_gather: 1
max_parallel_workers: 2
wal_buffers: 16MB
min_wal_size: 2GB
max_wal_size: 4GB
checkpoint_timeout: 900
bgwriter_delay: 10ms
bgwriter_lru_maxpages: 100000
default_statistics_target: 500
random_page_cost: 1.1
checkpoint_completion_target: 0.9
max_connections: 75
max_locks_per_transaction: 64
autovacuum_max_workers: 10
autovacuum_naptime: 10
effective_io_concurrency: 256
timescaledb.last_tuned: '2022-10-28T08:48:02Z'
timescaledb.last_tuned_version: '0.14.1'

Version

  • Distribution/OS:
  • Promscale: 0.16.0, 0.7.0 (extension)
  • TimescaleDB: 2.8.1

Additional context

  • PostgreSQL running via Crunchy postgres-operator, database is allocated 8 GB memory, on average using about 5-6 GB.
  • Average ingest at around 2000 samples/sec per Grafana dashboard.
@ramonguiu
Copy link
Contributor

The number of uncompressed chunks depends on the number of unique metric names. Each metric name uses a hypertable and at any point in time there shouldn't be more than 2 chunks uncompressed per hypertable (the current one where current data is being written, the previous one which is kept open for one hour after the current chunk was created for data arriving late).

How many unique metric names do you have?

@mikberg
Copy link
Contributor Author

mikberg commented Nov 7, 2022

The number of uncompressed chunks depends on the number of unique metric names. Each metric name uses a hypertable and at any point in time there shouldn't be more than 2 chunks uncompressed per hypertable (the current one where current data is being written, the previous one which is kept open for one hour after the current chunk was created for data arriving late).

How many unique metric names do you have?

tsdb=# select count(*) from information_schema.tables where table_schema='prom_metric';
 count
-------
  2617

(or 2015 label values for __name__ in Prometheus, might be some left-overs).

Thanks, this was very informative. Do I understand correctly if I take from this that I shouldn't really expect the uncompressed chunks count to fall much below 2*(number_of_unique_metric_names)? In that case, the default alert value of 10 sounds very low?

@ramonguiu
Copy link
Contributor

Thanks, this was very informative. Do I understand correctly if I take from this that I shouldn't really expect the uncompressed chunks count to fall much below 2*(number_of_unique_metric_names)? In that case, the default alert value of 10 sounds very low?

Yes, that's correct. Let me check with the team why the alert is defined like that.

@Harkishen-Singh
Copy link
Member

Harkishen-Singh commented Nov 14, 2022

I agree, this should be changed to

(
    min_over_time(promscale_sql_database_chunks_metrics_uncompressed_count[1h]) > 2 * promscale_sql_database_metric_count
)

Also, pinging @sumerman in case he knows the reason behind > 10.

@sumerman
Copy link
Contributor

I agree, this should be changed to

(
    min_over_time(promscale_sql_database_chunks_metrics_uncompressed_count[1h]) > 2 * promscale_sql_database_metric_count
)

Also, I think we should change min_over_time to avg_over_time. Reason? min_over_time in this case seems too strict, since at any given point in last 1h, if the uncompressed chunks are more than expected, it will alert. Averaging this over 30m should be fine.

Also, pinging @sumerman in case he knows the reason behind > 10.

Thank you. As I have answered elsewhere my intention defining this metric was for it to go down to 0. 10 was a safety margin.

@VineethReddy02 VineethReddy02 added the Bug Something isn't working label Nov 22, 2022
@ramonguiu
Copy link
Contributor

@sumerman did we fix this?

sumerman added a commit that referenced this issue Dec 14, 2022
on a function used by the maintenance jobs.

It should also fix for #1741
@sumerman
Copy link
Contributor

I expect #1794 to fix this when it lands

sumerman added a commit that referenced this issue Dec 21, 2022
on a function used by the maintenance jobs.

It should also fix for #1741
alejandrodnm pushed a commit that referenced this issue Dec 23, 2022
on a function used by the maintenance jobs.

It should also fix for #1741
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants