Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Allow chunk size to be based on time or chunk total size of storage #6720

Open
robert-medina-factorypal opened this issue Mar 1, 2024 · 2 comments
Labels
feature-request Feature proposal

Comments

@robert-medina-factorypal
Copy link

robert-medina-factorypal commented Mar 1, 2024

What problem does the new feature solve?

Describe the problem

As per the Best practices for time partitioning](https://docs.timescale.com/use-timescale/latest/hypertables/about-hypertables/#best-practices-for-time-partitioning), chunks, it's recommended to set the chunk_time_interval so that 25% of main memory can store one chunk, including its indexes.

However, for data that is highly dynamic, setting the chunk size interval can be tricky. Consider the following scenario:

A Timescale instance that has 8 GB of RAM so 25% of the main memory is 2 GB.

On day 1, it was estimated that the volume of data inserted per day is 1.75 GB. So having the chunk size of 1 day will most likely fit the chunk and its indexes.

On day 2, the volume of data that is ingested goes up 20%. Since the chunk size is still 1 day, there is a chunk with size of 2.1 GB without including its indexes. This chunk does not fit within 25% of the main memory.

On day 3, the volume of data that is ingested goes down 5%. The chunk size interval was not changed so the chunk will have a size of 1.995 GB which will most likely not fit withing 25% of the main memory if its indexes are included.

Lastly, assume that the volume of the data of ingested per day goes up or down in such a way that the chunk including its indexes will or will not fit within 25% of the main memory.

Why it is important to solve?

Based on the previous scenario, many chunks will not follow with the best practices for size which will have a performance impact.

Did you consider alternative solutions, perhaps outside the database?

Yes, in fact, we have created a procedure that was scheduled to run right before a chunk was about to expire so we could estimate the data volume for the next chunk based on the last N chunks and adjust the chunk size interval of the hypetable accordingly.

Why is it better to add the feature to TimescaleDB?

If Timescale supports this feture natively, other users with the same problem could benefit without re-inventing the wheel, so to speak.

What does the feature do?

Give a high-level overview of what the feature does and how it would work

Use the chunk size interval as usual, however, if the chunk is the current one for the hypertable, and if its size including its indexes is about to exceed 25% of the available memory, use this value to somehow change the chunk range_end so that new records are inserted in a new chunk.

The reason why I explicitly mention if the chunk is the current one of the hypertable, is because if there are other chunks after it, the chunk size must not be changed. Otherwise, there will be gaps within the chunks of the hypertable.

In other words, new records will not be inserted in the current chunk if its size including indexes exceed 25% of the main memory, or when the timestamp of the record that needs to be inserted is after its range_end, whichever comes first.

Implementation challenges

Challenging issues to overcome

  • Currently, there is no way to change the chunk size for current chunks, so I'm guessing that feature is not available for a good reason.
  • Checking the chunk size including its indexes for every record that is about to be inserted is a bad idea performance wise.
  • If the instance is downscaled, the chunks that were modified by the feature will not fit withing the 25% of the main memory.
@robert-medina-factorypal robert-medina-factorypal added the feature-request Feature proposal label Mar 1, 2024
@robert-medina-factorypal robert-medina-factorypal changed the title [Feature]: Allow to set chunk size interval based on time or chunk total size of storage [Feature]: Allow chunk size to be based on time or chunk total size of storage Mar 1, 2024
@george-gridedge
Copy link

@robert-medina-factorypal agree that this would be useful, are you able to share the procedure you've currently got running for the dynamic chunking?

@robert-medina-factorypal
Copy link
Author

robert-medina-factorypal commented Apr 9, 2024

@george-gridedge Here's the procedure that I initially implemented (please note that I wrote this a while ago when I was starting with timescale so I'm sure there are better ways to do this):

CREATE OR REPLACE PROCEDURE adjust_chunk_interval_size(job_id int, config jsonb)
    LANGUAGE PLPGSQL AS
$$
DECLARE
    desired_chunk_size_bytes         constant bigint NOT NULL   := ((config ->> 'instance_memory_megabytes')::int * 0.25 * 1048576)::bigint;
    calculation_method               constant text NOT NULL     := config ->> 'calculation_method';
    last_nth_entries                 constant int NOT NULL      := coalesce((config ->> 'last_nth_entries')::int, 0);
    schema_name                      constant text NOT NULL     := config ->> 'schema_name';
    table_name                       constant text NOT NULL     := config ->> 'table_name';
    qualified_table_name             constant text NOT NULL     := format('%I.%I', schema_name, table_name);
    interval_before_next_calculation constant interval NOT NULL := (config ->> 'interval_before_next_calculation')::interval;
    dry_run                          constant boolean NOT NULL  := coalesce((config ->> 'dry_run')::boolean, false);
    interval_for_chunk_desired_size           interval;
    latest_chunk_range_end                    timestamptz;
    job_next_start                            timestamptz;
BEGIN
    RAISE NOTICE 'desired_chunk_size_bytes: %', desired_chunk_size_bytes;
    RAISE NOTICE 'calculation_method: %', calculation_method;
    RAISE NOTICE 'last_nth_entries: %', last_nth_entries;
    RAISE NOTICE 'qualified_table_name: %', qualified_table_name;
    RAISE NOTICE 'interval_before_next_calculation: %', interval_before_next_calculation;
    IF calculation_method IN ('last_nth_average', 'last_nth_min') AND last_nth_entries < 0 THEN
        RAISE EXCEPTION 'last_nth_entries must not be less than zero when calculation method is last_nth_*.';
    END IF;
    WITH base_query AS (SELECT cds.chunk_schema,
                               cds.chunk_name,
                               coalesce(ccs.before_compression_total_bytes, cds.total_bytes)                   AS chunk_uncompressed_total_bytes,
                               trunc(extract('epoch' FROM (tsic.range_end - tsic.range_start)) * 1000)::bigint AS chunk_interval_milliseconds,
                               row_number() OVER (ORDER BY tsic.range_start DESC)                              AS record_number
                        FROM chunks_detailed_size(qualified_table_name) cds
                                 INNER JOIN timescaledb_information.chunks AS tsic
                                            ON cds.chunk_schema = tsic.chunk_schema AND
                                               cds.chunk_name = tsic.chunk_name
                                 LEFT JOIN chunk_compression_stats(qualified_table_name) ccs
                                           ON cds.chunk_schema = ccs.chunk_schema AND
                                              cds.chunk_name = ccs.chunk_name
                        WHERE ccs.compression_status = 'Compressed'),
         calculation_method_query AS (SELECT CASE
                                                 WHEN calculation_method = 'average' OR calculation_method = 'last_nth_average'
                                                     THEN (avg(chunk_interval_milliseconds::numeric /
                                                               chunk_uncompressed_total_bytes::numeric) *
                                                           desired_chunk_size_bytes::numeric)::bigint
                                                 WHEN calculation_method = 'min' OR calculation_method = 'last_nth_min'
                                                     THEN (min(chunk_interval_milliseconds::numeric /
                                                               chunk_uncompressed_total_bytes::numeric) *
                                                           desired_chunk_size_bytes::numeric)::bigint END AS milliseconds_for_chunk_desired_size
                                      FROM base_query
                                      WHERE CASE
                                                WHEN calculation_method = 'average' OR calculation_method = 'min'
                                                    THEN TRUE
                                                WHEN calculation_method = 'last_nth_average' OR
                                                     calculation_method = 'last_nth_min'
                                                    THEN record_number > 1 AND record_number - 1 <= last_nth_entries END /* Don't count the very last chunk which data may be currently inserted */)
    SELECT make_interval(secs => trunc(milliseconds_for_chunk_desired_size / 1000.0))
    INTO STRICT interval_for_chunk_desired_size
    FROM calculation_method_query;
    RAISE NOTICE 'interval_for_chunk_desired_size: %', interval_for_chunk_desired_size;
    IF interval_for_chunk_desired_size IS NULL THEN
        RAISE EXCEPTION 'Chunk size interval was not calculated correctly.';
    END IF;
    IF NOT dry_run THEN
        PERFORM set_chunk_time_interval(qualified_table_name, interval_for_chunk_desired_size);
    END IF;
    IF job_id IS NOT NULL THEN
        SELECT range_end
        INTO STRICT latest_chunk_range_end
        FROM timescaledb_information.chunks
        WHERE hypertable_schema = schema_name
          AND hypertable_name = table_name
        ORDER BY range_start DESC
        LIMIT 1;
        job_next_start = latest_chunk_range_end + interval_for_chunk_desired_size - interval_before_next_calculation;
        IF job_next_start < date_trunc('minute', now()) THEN
            job_next_start = date_trunc('minute', now()) + interval_for_chunk_desired_size -
                             interval_before_next_calculation;
        END IF;
        RAISE NOTICE 'latest_chunk_range_end: %', latest_chunk_range_end;
        RAISE NOTICE 'schedule_interval: %', interval_for_chunk_desired_size - interval_before_next_calculation;
        RAISE NOTICE 'max_runtime: %', interval_before_next_calculation;
        RAISE NOTICE 'next_start: %', job_next_start;
        IF NOT dry_run THEN
            PERFORM alter_job(job_id, schedule_interval => interval_for_chunk_desired_size -
                                                           interval_before_next_calculation,
                              max_runtime => interval_before_next_calculation, config => config,
                              next_start => job_next_start);
        END IF;
    END IF;
END
$$;

This procedure is meant to be a user-defined job that performs the following:

  • Receives configuration regarding:
    • The hypertable to monitor
    • The calculation method
    • When to perform the next calculation
    • If the procedure has side effects
  • The next-chunk interval is calculated using the specified method (average, average of the nth last chunks, min, min of the nth last chunks).
  • If the procedure is enabled to have side effects then the chunk time interval is changed.
  • If the procedure is executed inside a user-defined job context and is enabled to have side effects, then it job next start is recalculated so that is executed before the current chunk of the hypertable would expire.

Let me know if you have any other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Feature proposal
Projects
None yet
Development

No branches or pull requests

2 participants