New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Allow chunk size to be based on time or chunk total size of storage #6720
Comments
@robert-medina-factorypal agree that this would be useful, are you able to share the procedure you've currently got running for the dynamic chunking? |
@george-gridedge Here's the procedure that I initially implemented (please note that I wrote this a while ago when I was starting with timescale so I'm sure there are better ways to do this): CREATE OR REPLACE PROCEDURE adjust_chunk_interval_size(job_id int, config jsonb)
LANGUAGE PLPGSQL AS
$$
DECLARE
desired_chunk_size_bytes constant bigint NOT NULL := ((config ->> 'instance_memory_megabytes')::int * 0.25 * 1048576)::bigint;
calculation_method constant text NOT NULL := config ->> 'calculation_method';
last_nth_entries constant int NOT NULL := coalesce((config ->> 'last_nth_entries')::int, 0);
schema_name constant text NOT NULL := config ->> 'schema_name';
table_name constant text NOT NULL := config ->> 'table_name';
qualified_table_name constant text NOT NULL := format('%I.%I', schema_name, table_name);
interval_before_next_calculation constant interval NOT NULL := (config ->> 'interval_before_next_calculation')::interval;
dry_run constant boolean NOT NULL := coalesce((config ->> 'dry_run')::boolean, false);
interval_for_chunk_desired_size interval;
latest_chunk_range_end timestamptz;
job_next_start timestamptz;
BEGIN
RAISE NOTICE 'desired_chunk_size_bytes: %', desired_chunk_size_bytes;
RAISE NOTICE 'calculation_method: %', calculation_method;
RAISE NOTICE 'last_nth_entries: %', last_nth_entries;
RAISE NOTICE 'qualified_table_name: %', qualified_table_name;
RAISE NOTICE 'interval_before_next_calculation: %', interval_before_next_calculation;
IF calculation_method IN ('last_nth_average', 'last_nth_min') AND last_nth_entries < 0 THEN
RAISE EXCEPTION 'last_nth_entries must not be less than zero when calculation method is last_nth_*.';
END IF;
WITH base_query AS (SELECT cds.chunk_schema,
cds.chunk_name,
coalesce(ccs.before_compression_total_bytes, cds.total_bytes) AS chunk_uncompressed_total_bytes,
trunc(extract('epoch' FROM (tsic.range_end - tsic.range_start)) * 1000)::bigint AS chunk_interval_milliseconds,
row_number() OVER (ORDER BY tsic.range_start DESC) AS record_number
FROM chunks_detailed_size(qualified_table_name) cds
INNER JOIN timescaledb_information.chunks AS tsic
ON cds.chunk_schema = tsic.chunk_schema AND
cds.chunk_name = tsic.chunk_name
LEFT JOIN chunk_compression_stats(qualified_table_name) ccs
ON cds.chunk_schema = ccs.chunk_schema AND
cds.chunk_name = ccs.chunk_name
WHERE ccs.compression_status = 'Compressed'),
calculation_method_query AS (SELECT CASE
WHEN calculation_method = 'average' OR calculation_method = 'last_nth_average'
THEN (avg(chunk_interval_milliseconds::numeric /
chunk_uncompressed_total_bytes::numeric) *
desired_chunk_size_bytes::numeric)::bigint
WHEN calculation_method = 'min' OR calculation_method = 'last_nth_min'
THEN (min(chunk_interval_milliseconds::numeric /
chunk_uncompressed_total_bytes::numeric) *
desired_chunk_size_bytes::numeric)::bigint END AS milliseconds_for_chunk_desired_size
FROM base_query
WHERE CASE
WHEN calculation_method = 'average' OR calculation_method = 'min'
THEN TRUE
WHEN calculation_method = 'last_nth_average' OR
calculation_method = 'last_nth_min'
THEN record_number > 1 AND record_number - 1 <= last_nth_entries END /* Don't count the very last chunk which data may be currently inserted */)
SELECT make_interval(secs => trunc(milliseconds_for_chunk_desired_size / 1000.0))
INTO STRICT interval_for_chunk_desired_size
FROM calculation_method_query;
RAISE NOTICE 'interval_for_chunk_desired_size: %', interval_for_chunk_desired_size;
IF interval_for_chunk_desired_size IS NULL THEN
RAISE EXCEPTION 'Chunk size interval was not calculated correctly.';
END IF;
IF NOT dry_run THEN
PERFORM set_chunk_time_interval(qualified_table_name, interval_for_chunk_desired_size);
END IF;
IF job_id IS NOT NULL THEN
SELECT range_end
INTO STRICT latest_chunk_range_end
FROM timescaledb_information.chunks
WHERE hypertable_schema = schema_name
AND hypertable_name = table_name
ORDER BY range_start DESC
LIMIT 1;
job_next_start = latest_chunk_range_end + interval_for_chunk_desired_size - interval_before_next_calculation;
IF job_next_start < date_trunc('minute', now()) THEN
job_next_start = date_trunc('minute', now()) + interval_for_chunk_desired_size -
interval_before_next_calculation;
END IF;
RAISE NOTICE 'latest_chunk_range_end: %', latest_chunk_range_end;
RAISE NOTICE 'schedule_interval: %', interval_for_chunk_desired_size - interval_before_next_calculation;
RAISE NOTICE 'max_runtime: %', interval_before_next_calculation;
RAISE NOTICE 'next_start: %', job_next_start;
IF NOT dry_run THEN
PERFORM alter_job(job_id, schedule_interval => interval_for_chunk_desired_size -
interval_before_next_calculation,
max_runtime => interval_before_next_calculation, config => config,
next_start => job_next_start);
END IF;
END IF;
END
$$; This procedure is meant to be a user-defined job that performs the following:
Let me know if you have any other questions. |
What problem does the new feature solve?
Describe the problem
As per the Best practices for time partitioning](https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/#best-practices-for-time-partitioning), chunks, it's recommended to set the chunk_time_interval so that 25% of main memory can store one chunk, including its indexes.
However, for data that is highly dynamic, setting the chunk size interval can be tricky. Consider the following scenario:
A Timescale instance that has 8 GB of RAM so 25% of the main memory is 2 GB.
On day 1, it was estimated that the volume of data inserted per day is 1.75 GB. So having the chunk size of 1 day will most likely fit the chunk and its indexes.
On day 2, the volume of data that is ingested goes up 20%. Since the chunk size is still 1 day, there is a chunk with size of 2.1 GB without including its indexes. This chunk does not fit within 25% of the main memory.
On day 3, the volume of data that is ingested goes down 5%. The chunk size interval was not changed so the chunk will have a size of 1.995 GB which will most likely not fit withing 25% of the main memory if its indexes are included.
Lastly, assume that the volume of the data of ingested per day goes up or down in such a way that the chunk including its indexes will or will not fit within 25% of the main memory.
Why it is important to solve?
Based on the previous scenario, many chunks will not follow with the best practices for size which will have a performance impact.
Did you consider alternative solutions, perhaps outside the database?
Yes, in fact, we have created a procedure that was scheduled to run right before a chunk was about to expire so we could estimate the data volume for the next chunk based on the last N chunks and adjust the chunk size interval of the hypetable accordingly.
Why is it better to add the feature to TimescaleDB?
If Timescale supports this feture natively, other users with the same problem could benefit without re-inventing the wheel, so to speak.
What does the feature do?
Give a high-level overview of what the feature does and how it would work
Use the chunk size interval as usual, however, if the chunk is the current one for the hypertable, and if its size including its indexes is about to exceed 25% of the available memory, use this value to somehow change the chunk range_end so that new records are inserted in a new chunk.
The reason why I explicitly mention if the chunk is the current one of the hypertable, is because if there are other chunks after it, the chunk size must not be changed. Otherwise, there will be gaps within the chunks of the hypertable.
In other words, new records will not be inserted in the current chunk if its size including indexes exceed 25% of the main memory, or when the timestamp of the record that needs to be inserted is after its range_end, whichever comes first.
Implementation challenges
Challenging issues to overcome
The text was updated successfully, but these errors were encountered: