Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for correlated constraints #6727

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nikkhils
Copy link
Contributor

@nikkhils nikkhils commented Mar 4, 2024

Allow users to specify a specific column to be used as a correlated
constraint using the add_dimension() API. We store such correlated
constraints in the dimensions related timescaledb catalog tables. The
"dimension" catalog table has been amended by adding a "type" column.
This column now explicitly stores the type: open, closed or correlated
as appropriate.

We create dimension_slice and chunk_constraint entries for chunks which
have correlated constraints on them. The dimension slice entry will
have -inf/+inf as start/end range initially for a given correlated
constraint and the chunk_constraint entry will refer back to this slice
entry.

This start/end range will be refreshed later. One of the entry points
is during compression for now.

We can thus store the min/max values for such correlated contraints
in these catalog tables at the per-chunk level. Note that correlated
constraints do not participate in partitioning of the data. Such a
correlated constraint will be used for chunk pruning if the WHERE
clause of a SQL query specifies ranges on such a column.

Disable-check: force-changelog-file

Disable-check: commit-count

@nikkhils nikkhils self-assigned this Mar 4, 2024
@nikkhils nikkhils marked this pull request as draft March 4, 2024 10:53
@nikkhils nikkhils force-pushed the correlated_cons branch 4 times, most recently from c28040d to c5fdccd Compare March 4, 2024 13:46
@nikkhils nikkhils force-pushed the correlated_cons branch 8 times, most recently from 97cc717 to 2482fd5 Compare March 8, 2024 07:47
Copy link

codecov bot commented Mar 8, 2024

Codecov Report

Attention: Patch coverage is 82.59109% with 43 lines in your changes are missing coverage. Please review.

Project coverage is 81.00%. Comparing base (59f50f2) to head (9c879ec).
Report is 161 commits behind head on main.

❗ Current head 9c879ec differs from pull request most recent head 03553c9. Consider uploading reports for the commit 03553c9 to get more accurate results

Files Patch % Lines
src/dimension.c 74.19% 15 Missing and 9 partials ⚠️
src/dimension_slice.c 82.05% 6 Missing and 8 partials ⚠️
src/chunk_constraint.c 94.59% 2 Missing ⚠️
src/process_utility.c 71.42% 0 Missing and 2 partials ⚠️
src/chunk_adaptive.c 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6727      +/-   ##
==========================================
+ Coverage   80.06%   81.00%   +0.93%     
==========================================
  Files         190      198       +8     
  Lines       37181    37499     +318     
  Branches     9450     9788     +338     
==========================================
+ Hits        29770    30376     +606     
- Misses       2997     3208     +211     
+ Partials     4414     3915     -499     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nikkhils nikkhils force-pushed the correlated_cons branch 2 times, most recently from 0f5c701 to 7c8a60c Compare March 14, 2024 12:42
@nikkhils nikkhils marked this pull request as ready for review March 14, 2024 12:43
@nikkhils nikkhils force-pushed the correlated_cons branch 6 times, most recently from ad8c04b to eb2fe8b Compare March 21, 2024 13:52
src/dimension.h Outdated Show resolved Hide resolved
src/dimension.c Outdated Show resolved Hide resolved
src/chunk_constraint.h Outdated Show resolved Hide resolved
@nikkhils nikkhils force-pushed the correlated_cons branch 3 times, most recently from 812bb75 to 425fe4c Compare May 3, 2024 09:21
@svenklemm
Copy link
Member

Have you considered not storing these as dimensions. This seems to be quite an invasive change and it means that all existing dimension code needs to be checked whether it handles correlated constraints correctly.

@nikkhils nikkhils force-pushed the correlated_cons branch 2 times, most recently from e613d0c to 00b31f7 Compare May 6, 2024 11:37
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I had a bunch on nits and suggestions however. You can go through them and address as you see fit.

src/dimension.c Outdated Show resolved Hide resolved
src/dimension.c Outdated Show resolved Hide resolved
src/dimension.c Outdated Show resolved Hide resolved
src/dimension.c Outdated Show resolved Hide resolved
src/dimension.c Show resolved Hide resolved
src/dimension_slice.c Show resolved Hide resolved
Comment on lines 1205 to 1212
snprintf(NameStr(constraint->fd.constraint_name),
NAMEDATALEN,
"%sconstraint_%d",
CC_DIM_PREFIX,
new_slice->fd.id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I asked a similar question about the new prefix we use. Not saying its wrong, just wondering why we are not hooking into existing code or postgres functions for naming.


-- recompress the partial chunk
SELECT compress_chunk(:'CH_NAME');
WARNING: no index on "sensor_id" found for correlated constraint on chunk "_hyper_1_1_chunk"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This WARNING is a bit jarring when compressing. I guess this is related to the fact that min/max is found using a seqscan on the chunk, which is slightly inefficient. But the user will have no idea. Furthermore, nothing is strictly wrong here, so probably should not be WARNING. I think an info message is more appropriate when defining the correlated constraint (add_dimension?), but not on every recompression.

I realize this message is inherited from adaptive chunking, but then it was used in a different context. I worry that a user would interpret this as something being wrong. It is unclear what the action to take is, if any. If a user isn't expected to do anything about this I think we should avoid displaying this warning.

Also, as I mentioned above that we might be able to get the min/max as output from the compression in the future, so this might not be relevant in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Converting it into a DEBUG1 for now.

Copy link
Contributor Author

@nikkhils nikkhils May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I asked a similar question about the new prefix we use. Not saying its wrong, just wondering why we are not hooking into existing code or postgres functions for naming.

Exposed the existing chunk_constraint_dimension_choose_name function and used it now
cc @fabriziomello

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder about the performance implications of the extra seqscan. Maybe better to implement the optimization with getting the min/max during compression sooner rather than later. Not for this PR though, but maybe a follow up?

It could be good to test that compress_chunk() is not significantly slowed down on bigger chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erimatnor typically if the correlated columns are being used in WHERE clauses, then the chances are that an INDEX exists on such a column already.

Comment on lines 245 to 248
SELECT * FROM _timescaledb_catalog.dimension WHERE type = 'C';
id | hypertable_id | column_name | column_type | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func | type
----+---------------+-------------+-------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------+------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe select on the dimension_id returned by remove_dimension? Or join in same query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ERROR: no correlated dimension on column "sensor_id"
-- should only work on correlated dimensions
SELECT * from remove_dimension('sample_table', 'time');
ERROR: no correlated dimension on column "time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more informational if the error for other type of dimensions was something like:

range dimension "time" cannot be removed
hint: Only correlated dimensions can be removed.

The current message is slightly confusing because the error assumes that the intention of the user was to remove a correlated dimension.

But just by looking at the API and the function name, it is implied that you can remove any dimensions (irrespective of type). So, if a user tries to intentionally remove the "time" dimension (knowing it is a time dimension), the error talks about not finding a correlated dimension. Instead we should let the user know that removing "time"/range dimensions is not supported.

As an aside, I think it would be easy to support removing dimensions if there are no chunks in the hypertable, which might be useful when you are creating the hypertable "schema" and erroneously adds a dimension. We don't need to implement that as part of this PR. Just saying it is easy to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erimatnor the errhint does elaborate that

"remove_dimension can only be used to remove correlated dimension constraints."

Since it can be called on any column, we will have to seek metadata to further identify the dimension type of the column and then say that it cannot be removed..

We will have to ensure that we add documentation for remove_dimension which mentions the restriction on this currently working on correlated constraints only.

Copy link
Contributor

@mkindahl mkindahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have two suggestions on the code (and a few nits, feel free to ignore), but I do not consider them blockers, so approving:

  • I do not think you need to extend the "old" add_dimension function. The new function is build to support this kind of extension, so we only need to retain the old functions for the "simple" case. It is also easier to add it later if necessary, but hard to remove it later if it turns out to not be used.
  • Lower case c and upper case C is going to be easy to mix up when deep in debugging. Might be good to change to use a different lower-case letter.

sql/ddl_api.sql Outdated Show resolved Hide resolved
sql/pre_install/tables.sql Show resolved Hide resolved
sql/updates/latest-dev.sql Outdated Show resolved Hide resolved
Comment on lines +548 to +88
ELSE
'a' -- any.. This should never happen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't NULL normally be used to indicate "undefined"? Is there a reason why NULL can't be used in this case?

sql/updates/reverse-dev.sql Outdated Show resolved Hide resolved
src/dimension.h Outdated
Comment on lines 26 to 27
DIMENSION_TYPE_CLOSED = 'c',
DIMENSION_TYPE_CORRELATED = 'C',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower and upper case c is going to be easy to mix up. I already do this with similar types in PG. Suggest to use a different letter.

src/dimension.h Outdated Show resolved Hide resolved
src/dimension_slice.c Outdated Show resolved Hide resolved
Comment on lines 91 to +94
SELECT * FROM _timescaledb_catalog.dimension;
id | hypertable_id | column_name | column_type | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func
----+---------------+-------------+--------------------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------
2 | 2 | time | timestamp with time zone | t | | | | 86400000000 | | |
id | hypertable_id | column_name | column_type | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func | type
----+---------------+-------------+--------------------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------+------
2 | 2 | time | timestamp with time zone | t | | | | 86400000000 | | | | o
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe switch to vertical display since you're changing the output here anyway. It will make it easier to review in the future because changes in the output will be highlighted better.

tsl/test/sql/correlated_constraints.sql Outdated Show resolved Hide resolved
Copy link
Contributor

@antekresic antekresic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several issues with this PR. I don't think this can work by relying on adaptive chunking code which has an unproven track record and was written before compression even existed.

Relying on indexes to update the constraint ranges will not work, doing heap scans on uncompressed chunk regions will also not work. Updating ranges needs to include the compressed data. Creating those ranges before compression also makes very little sense.

Updating ranges by hooking into the function that sets the chunk as partial is going to hurt DML on compressed chunks which is already a big pain point.

Needs a lot more testing, especially with updates on the column that has the correlated constraint set, I don't see such a test at all.

Finally, a tsbench run showing the performance improvement this promises is a must since this is an optimization.

@nikkhils
Copy link
Contributor Author

nikkhils commented May 9, 2024

I have several issues with this PR. I don't think this can work by relying on adaptive chunking code which has an unproven track record and was written before compression even existed.

@antekresic

We are not relying on the adaptive chunking code at all. We have fixed the issues in the function that it used for calculating min/max for a column using indexes, nothing else.

Relying on indexes to update the constraint ranges will not work, doing heap scans on uncompressed chunk regions will also not work. Updating ranges needs to include the compressed data. Creating those ranges before compression also makes very little sense.

The ranges are calculated and added when we call compress_chunk and AFTER a strong lock has been taken on the chunk being compressed. How can the ranges change in that case?

Updating ranges by hooking into the function that sets the chunk as partial is going to hurt DML on compressed chunks which is already a big pain point.

We do NOT update the range, we just mark them INVALID once as part of marking the chunk partial. The range will be re-calculated next only when the partial chunk gets compressed again.

Needs a lot more testing, especially with updates on the column that has the correlated constraint set, I don't see such a test at all.

As mentioned above, the moment you do DML on a compressed chunk, you mark it partial, that's when we invalidate the range for correlated constraints. This is tested in the PR.

Finally, a tsbench run showing the performance improvement this promises is a must since this is an optimization.

I did manual runs with data shared by AlexK sometime ago and mentioned on the slack thread that we see benefits of chunk exclusion as expected for queries involving WHERE clauses on these correlated columns.

Allow users to specify a specific column to be used as a correlated
constraint using the add_dimension() API. We store such correlated
constraints in the dimensions related timescaledb catalog tables. The
"dimension" catalog table has been amended by adding a "type" column.
This column now explicitly stores the type: open, closed or correlated
as appropriate.

We create dimension_slice and chunk_constraint entries for chunks which
have correlated constraints on them. The dimension slice entry will
have -inf/+inf as start/end range initially for a given correlated
constraint and the chunk_constraint entry will refer back to this slice
entry.

This start/end range will be refreshed later. One of the entry points
is during compression for now.

We can thus store the min/max values for such correlated contraints
in these catalog tables at the per-chunk level. Note that correlated
constraints do not participate in partitioning of the data. Such a
correlated constraint will be used for chunk pruning if the WHERE
clause of a SQL query specifies ranges on such a column.

A "DROP COLUMN" on a column with a correlated constraint on it ends up
removing all relevant entries from the catalog tables.

Also a new "remove_dimension" API has been introduced to allow removal
of correlated dimension entries from the catalogs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants