Add support for correlated constraints #6727

nikkhils · 2024-03-04T10:53:45Z

Allow users to specify a specific column to be used as a correlated
constraint using the add_dimension() API. We store such correlated
constraints in the dimensions related timescaledb catalog tables. The
"dimension" catalog table has been amended by adding a "type" column.
This column now explicitly stores the type: open, closed or correlated
as appropriate.

We create dimension_slice and chunk_constraint entries for chunks which
have correlated constraints on them. The dimension slice entry will
have -inf/+inf as start/end range initially for a given correlated
constraint and the chunk_constraint entry will refer back to this slice
entry.

This start/end range will be refreshed later. One of the entry points
is during compression for now.

We can thus store the min/max values for such correlated contraints
in these catalog tables at the per-chunk level. Note that correlated
constraints do not participate in partitioning of the data. Such a
correlated constraint will be used for chunk pruning if the WHERE
clause of a SQL query specifies ranges on such a column.

Disable-check: force-changelog-file

Disable-check: commit-count

sql/updates/latest-dev.sql

codecov · 2024-03-08T07:56:09Z

Codecov Report

Attention: Patch coverage is 82.59109% with 43 lines in your changes are missing coverage. Please review.

Project coverage is 81.00%. Comparing base (59f50f2) to head (9c879ec).
Report is 161 commits behind head on main.

❗ Current head 9c879ec differs from pull request most recent head 03553c9. Consider uploading reports for the commit 03553c9 to get more accurate results

Files	Patch %	Lines
src/dimension.c	74.19%	15 Missing and 9 partials ⚠️
src/dimension_slice.c	82.05%	6 Missing and 8 partials ⚠️
src/chunk_constraint.c	94.59%	2 Missing ⚠️
src/process_utility.c	71.42%	0 Missing and 2 partials ⚠️
src/chunk_adaptive.c	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6727      +/-   ##
==========================================
+ Coverage   80.06%   81.00%   +0.93%     
==========================================
  Files         190      198       +8     
  Lines       37181    37499     +318     
  Branches     9450     9788     +338     
==========================================
+ Hits        29770    30376     +606     
- Misses       2997     3208     +211     
+ Partials     4414     3915     -499

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sql/updates/latest-dev.sql

src/chunk_constraint.h

src/dimension.h

src/dimension.c

src/chunk_constraint.h

svenklemm · 2024-05-06T07:05:08Z

Have you considered not storing these as dimensions. This seems to be quite an invasive change and it means that all existing dimension code needs to be checked whether it handles correlated constraints correctly.

erimatnor

Approved. I had a bunch on nits and suggestions however. You can go through them and address as you see fit.

src/dimension.c

src/dimension_slice.c

erimatnor · 2024-05-07T16:01:33Z

src/dimension_slice.c

+			snprintf(NameStr(constraint->fd.constraint_name),
+					 NAMEDATALEN,
+					 "%sconstraint_%d",
+					 CC_DIM_PREFIX,
+					 new_slice->fd.id);


Yeah, I asked a similar question about the new prefix we use. Not saying its wrong, just wondering why we are not hooking into existing code or postgres functions for naming.

erimatnor · 2024-05-07T16:05:08Z

tsl/test/expected/correlated_constraints.out

+
+-- recompress the partial chunk
+SELECT compress_chunk(:'CH_NAME');
+WARNING:  no index on "sensor_id" found for correlated constraint on chunk "_hyper_1_1_chunk"


This WARNING is a bit jarring when compressing. I guess this is related to the fact that min/max is found using a seqscan on the chunk, which is slightly inefficient. But the user will have no idea. Furthermore, nothing is strictly wrong here, so probably should not be WARNING. I think an info message is more appropriate when defining the correlated constraint (add_dimension?), but not on every recompression.

I realize this message is inherited from adaptive chunking, but then it was used in a different context. I worry that a user would interpret this as something being wrong. It is unclear what the action to take is, if any. If a user isn't expected to do anything about this I think we should avoid displaying this warning.

Also, as I mentioned above that we might be able to get the min/max as output from the compression in the future, so this might not be relevant in that case.

Fair enough. Converting it into a DEBUG1 for now.

Yeah, I asked a similar question about the new prefix we use. Not saying its wrong, just wondering why we are not hooking into existing code or postgres functions for naming.

Exposed the existing chunk_constraint_dimension_choose_name function and used it now
cc @fabriziomello

I do wonder about the performance implications of the extra seqscan. Maybe better to implement the optimization with getting the min/max during compression sooner rather than later. Not for this PR though, but maybe a follow up?

It could be good to test that compress_chunk() is not significantly slowed down on bigger chunks.

@erimatnor typically if the correlated columns are being used in WHERE clauses, then the chances are that an INDEX exists on such a column already.

erimatnor · 2024-05-07T16:14:00Z

tsl/test/expected/correlated_constraints.out

+SELECT * FROM _timescaledb_catalog.dimension WHERE type = 'C';
+ id | hypertable_id | column_name | column_type | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func | type 
+----+---------------+-------------+-------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------+------


Maybe select on the dimension_id returned by remove_dimension? Or join in same query?

erimatnor · 2024-05-07T16:17:32Z

tsl/test/expected/correlated_constraints.out

+ERROR:  no correlated dimension on column "sensor_id"
+-- should only work on correlated dimensions
+SELECT * from remove_dimension('sample_table', 'time');
+ERROR:  no correlated dimension on column "time"


I think it would be more informational if the error for other type of dimensions was something like:

range dimension "time" cannot be removed hint: Only correlated dimensions can be removed.

The current message is slightly confusing because the error assumes that the intention of the user was to remove a correlated dimension.

But just by looking at the API and the function name, it is implied that you can remove any dimensions (irrespective of type). So, if a user tries to intentionally remove the "time" dimension (knowing it is a time dimension), the error talks about not finding a correlated dimension. Instead we should let the user know that removing "time"/range dimensions is not supported.

As an aside, I think it would be easy to support removing dimensions if there are no chunks in the hypertable, which might be useful when you are creating the hypertable "schema" and erroneously adds a dimension. We don't need to implement that as part of this PR. Just saying it is easy to do.

@erimatnor the errhint does elaborate that

"remove_dimension can only be used to remove correlated dimension constraints."

Since it can be called on any column, we will have to seek metadata to further identify the dimension type of the column and then say that it cannot be removed..

We will have to ensure that we add documentation for remove_dimension which mentions the restriction on this currently working on correlated constraints only.

mkindahl

Looks good. I have two suggestions on the code (and a few nits, feel free to ignore), but I do not consider them blockers, so approving:

I do not think you need to extend the "old" add_dimension function. The new function is build to support this kind of extension, so we only need to retain the old functions for the "simple" case. It is also easier to add it later if necessary, but hard to remove it later if it turns out to not be used.
Lower case c and upper case C is going to be easy to mix up when deep in debugging. Might be good to change to use a different lower-case letter.

sql/ddl_api.sql

sql/pre_install/tables.sql

sql/updates/latest-dev.sql

mkindahl · 2024-05-08T10:41:52Z

sql/updates/latest-dev.sql

+       ELSE
+        'a' -- any.. This should never happen


Wouldn't NULL normally be used to indicate "undefined"? Is there a reason why NULL can't be used in this case?

sql/updates/reverse-dev.sql

mkindahl · 2024-05-08T11:43:06Z

src/dimension.h

+	DIMENSION_TYPE_CLOSED = 'c',
+	DIMENSION_TYPE_CORRELATED = 'C',


Lower and upper case c is going to be easy to mix up. I already do this with similar types in PG. Suggest to use a different letter.

src/dimension.h

src/dimension_slice.c

mkindahl · 2024-05-08T12:27:22Z

test/expected/chunk_adaptive.out

 SELECT * FROM _timescaledb_catalog.dimension;
- id | hypertable_id | column_name |       column_type        | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func 
----+---------------+-------------+--------------------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------
-  2 |             2 | time        | timestamp with time zone | t       |            |                          |                   |     86400000000 |                          |                         | 
+ id | hypertable_id | column_name |       column_type        | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func | type 
+----+---------------+-------------+--------------------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------+------
+  2 |             2 | time        | timestamp with time zone | t       |            |                          |                   |     86400000000 |                          |                         |                  | o


Maybe switch to vertical display since you're changing the output here anyway. It will make it easier to review in the future because changes in the output will be highlighted better.

tsl/test/sql/correlated_constraints.sql

antekresic

I have several issues with this PR. I don't think this can work by relying on adaptive chunking code which has an unproven track record and was written before compression even existed.

Relying on indexes to update the constraint ranges will not work, doing heap scans on uncompressed chunk regions will also not work. Updating ranges needs to include the compressed data. Creating those ranges before compression also makes very little sense.

Updating ranges by hooking into the function that sets the chunk as partial is going to hurt DML on compressed chunks which is already a big pain point.

Needs a lot more testing, especially with updates on the column that has the correlated constraint set, I don't see such a test at all.

Finally, a tsbench run showing the performance improvement this promises is a must since this is an optimization.

nikkhils · 2024-05-09T09:10:06Z

I have several issues with this PR. I don't think this can work by relying on adaptive chunking code which has an unproven track record and was written before compression even existed.

@antekresic

We are not relying on the adaptive chunking code at all. We have fixed the issues in the function that it used for calculating min/max for a column using indexes, nothing else.

Relying on indexes to update the constraint ranges will not work, doing heap scans on uncompressed chunk regions will also not work. Updating ranges needs to include the compressed data. Creating those ranges before compression also makes very little sense.

The ranges are calculated and added when we call compress_chunk and AFTER a strong lock has been taken on the chunk being compressed. How can the ranges change in that case?

Updating ranges by hooking into the function that sets the chunk as partial is going to hurt DML on compressed chunks which is already a big pain point.

We do NOT update the range, we just mark them INVALID once as part of marking the chunk partial. The range will be re-calculated next only when the partial chunk gets compressed again.

Needs a lot more testing, especially with updates on the column that has the correlated constraint set, I don't see such a test at all.

As mentioned above, the moment you do DML on a compressed chunk, you mark it partial, that's when we invalidate the range for correlated constraints. This is tested in the PR.

Finally, a tsbench run showing the performance improvement this promises is a must since this is an optimization.

I did manual runs with data shared by AlexK sometime ago and mentioned on the slack thread that we see benefits of chunk exclusion as expected for queries involving WHERE clauses on these correlated columns.

Allow users to specify a specific column to be used as a correlated constraint using the add_dimension() API. We store such correlated constraints in the dimensions related timescaledb catalog tables. The "dimension" catalog table has been amended by adding a "type" column. This column now explicitly stores the type: open, closed or correlated as appropriate. We create dimension_slice and chunk_constraint entries for chunks which have correlated constraints on them. The dimension slice entry will have -inf/+inf as start/end range initially for a given correlated constraint and the chunk_constraint entry will refer back to this slice entry. This start/end range will be refreshed later. One of the entry points is during compression for now. We can thus store the min/max values for such correlated contraints in these catalog tables at the per-chunk level. Note that correlated constraints do not participate in partitioning of the data. Such a correlated constraint will be used for chunk pruning if the WHERE clause of a SQL query specifies ranges on such a column. A "DROP COLUMN" on a column with a correlated constraint on it ends up removing all relevant entries from the catalog tables. Also a new "remove_dimension" API has been introduced to allow removal of correlated dimension entries from the catalogs.

nikkhils self-assigned this Mar 4, 2024

nikkhils marked this pull request as draft March 4, 2024 10:53

nikkhils mentioned this pull request Mar 4, 2024

Add support for secondary dimensions #6021

Closed

nikkhils force-pushed the correlated_cons branch 4 times, most recently from c28040d to c5fdccd Compare March 4, 2024 13:46

fabriziomello reviewed Mar 4, 2024

View reviewed changes

sql/updates/latest-dev.sql Outdated Show resolved Hide resolved

nikkhils force-pushed the correlated_cons branch 8 times, most recently from 97cc717 to 2482fd5 Compare March 8, 2024 07:47

nikkhils force-pushed the correlated_cons branch 2 times, most recently from 0f5c701 to 7c8a60c Compare March 14, 2024 12:42

nikkhils marked this pull request as ready for review March 14, 2024 12:43

nikkhils requested review from mkindahl and svenklemm March 14, 2024 12:43

fabriziomello reviewed Mar 14, 2024

View reviewed changes

sql/updates/latest-dev.sql Outdated Show resolved Hide resolved

nikkhils force-pushed the correlated_cons branch 6 times, most recently from ad8c04b to eb2fe8b Compare March 21, 2024 13:52

fabriziomello reviewed Apr 1, 2024

View reviewed changes

src/chunk_constraint.h Outdated Show resolved Hide resolved

nikkhils force-pushed the correlated_cons branch 16 times, most recently from 2f1dd4a to 0d6f4d2 Compare April 10, 2024 10:48

nikkhils force-pushed the correlated_cons branch from 0d6f4d2 to ccfd879 Compare April 30, 2024 08:52

erimatnor reviewed Apr 30, 2024

View reviewed changes

src/dimension.h Outdated Show resolved Hide resolved

src/dimension.c Outdated Show resolved Hide resolved

src/chunk_constraint.h Outdated Show resolved Hide resolved

nikkhils force-pushed the correlated_cons branch 3 times, most recently from 812bb75 to 425fe4c Compare May 3, 2024 09:21

nikkhils force-pushed the correlated_cons branch 2 times, most recently from e613d0c to 00b31f7 Compare May 6, 2024 11:37

erimatnor approved these changes May 7, 2024

View reviewed changes

mkindahl approved these changes May 8, 2024

View reviewed changes

antekresic requested changes May 9, 2024

View reviewed changes

nikkhils force-pushed the correlated_cons branch from 00b31f7 to 03553c9 Compare May 16, 2024 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for correlated constraints #6727

Add support for correlated constraints #6727

nikkhils commented Mar 4, 2024 •

edited

codecov bot commented Mar 8, 2024 •

edited

svenklemm commented May 6, 2024

erimatnor left a comment

erimatnor May 7, 2024

erimatnor May 7, 2024

nikkhils May 8, 2024

nikkhils May 8, 2024 •

edited

erimatnor May 8, 2024

nikkhils May 9, 2024

erimatnor May 7, 2024

nikkhils May 9, 2024

erimatnor May 7, 2024

nikkhils May 8, 2024

mkindahl left a comment

mkindahl May 8, 2024

mkindahl May 8, 2024

mkindahl May 8, 2024

antekresic left a comment •

edited

nikkhils commented May 9, 2024

		DIMENSION_TYPE_CLOSED = 'c',
		DIMENSION_TYPE_CORRELATED = 'C',

Add support for correlated constraints #6727

Are you sure you want to change the base?

Add support for correlated constraints #6727

Conversation

nikkhils commented Mar 4, 2024 • edited

codecov bot commented Mar 8, 2024 • edited

Codecov Report

svenklemm commented May 6, 2024

erimatnor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikkhils May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkindahl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antekresic left a comment • edited

Choose a reason for hiding this comment

nikkhils commented May 9, 2024

nikkhils commented Mar 4, 2024 •

edited

codecov bot commented Mar 8, 2024 •

edited

nikkhils May 8, 2024 •

edited

antekresic left a comment •

edited