New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for correlated constraints #6727
base: main
Are you sure you want to change the base?
Conversation
c28040d
to
c5fdccd
Compare
97cc717
to
2482fd5
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6727 +/- ##
==========================================
+ Coverage 80.06% 81.00% +0.93%
==========================================
Files 190 198 +8
Lines 37181 37499 +318
Branches 9450 9788 +338
==========================================
+ Hits 29770 30376 +606
- Misses 2997 3208 +211
+ Partials 4414 3915 -499 ☔ View full report in Codecov by Sentry. |
0f5c701
to
7c8a60c
Compare
ad8c04b
to
eb2fe8b
Compare
2f1dd4a
to
0d6f4d2
Compare
0d6f4d2
to
ccfd879
Compare
812bb75
to
425fe4c
Compare
Have you considered not storing these as dimensions. This seems to be quite an invasive change and it means that all existing dimension code needs to be checked whether it handles correlated constraints correctly. |
e613d0c
to
00b31f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. I had a bunch on nits and suggestions however. You can go through them and address as you see fit.
src/dimension_slice.c
Outdated
snprintf(NameStr(constraint->fd.constraint_name), | ||
NAMEDATALEN, | ||
"%sconstraint_%d", | ||
CC_DIM_PREFIX, | ||
new_slice->fd.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I asked a similar question about the new prefix we use. Not saying its wrong, just wondering why we are not hooking into existing code or postgres functions for naming.
|
||
-- recompress the partial chunk | ||
SELECT compress_chunk(:'CH_NAME'); | ||
WARNING: no index on "sensor_id" found for correlated constraint on chunk "_hyper_1_1_chunk" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This WARNING is a bit jarring when compressing. I guess this is related to the fact that min/max is found using a seqscan on the chunk, which is slightly inefficient. But the user will have no idea. Furthermore, nothing is strictly wrong here, so probably should not be WARNING. I think an info message is more appropriate when defining the correlated constraint (add_dimension?), but not on every recompression.
I realize this message is inherited from adaptive chunking, but then it was used in a different context. I worry that a user would interpret this as something being wrong. It is unclear what the action to take is, if any. If a user isn't expected to do anything about this I think we should avoid displaying this warning.
Also, as I mentioned above that we might be able to get the min/max as output from the compression in the future, so this might not be relevant in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Converting it into a DEBUG1
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I asked a similar question about the new prefix we use. Not saying its wrong, just wondering why we are not hooking into existing code or postgres functions for naming.
Exposed the existing chunk_constraint_dimension_choose_name
function and used it now
cc @fabriziomello
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder about the performance implications of the extra seqscan. Maybe better to implement the optimization with getting the min/max during compression sooner rather than later. Not for this PR though, but maybe a follow up?
It could be good to test that compress_chunk() is not significantly slowed down on bigger chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erimatnor typically if the correlated columns are being used in WHERE clauses, then the chances are that an INDEX exists on such a column already.
SELECT * FROM _timescaledb_catalog.dimension WHERE type = 'C'; | ||
id | hypertable_id | column_name | column_type | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func | type | ||
----+---------------+-------------+-------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------+------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe select on the dimension_id returned by remove_dimension
? Or join in same query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ERROR: no correlated dimension on column "sensor_id" | ||
-- should only work on correlated dimensions | ||
SELECT * from remove_dimension('sample_table', 'time'); | ||
ERROR: no correlated dimension on column "time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more informational if the error for other type of dimensions was something like:
range dimension "time" cannot be removed
hint: Only correlated dimensions can be removed.
The current message is slightly confusing because the error assumes that the intention of the user was to remove a correlated dimension.
But just by looking at the API and the function name, it is implied that you can remove any dimensions (irrespective of type). So, if a user tries to intentionally remove the "time" dimension (knowing it is a time dimension), the error talks about not finding a correlated dimension. Instead we should let the user know that removing "time"/range dimensions is not supported.
As an aside, I think it would be easy to support removing dimensions if there are no chunks in the hypertable, which might be useful when you are creating the hypertable "schema" and erroneously adds a dimension. We don't need to implement that as part of this PR. Just saying it is easy to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erimatnor the errhint does elaborate that
"remove_dimension can only be used to remove correlated dimension constraints."
Since it can be called on any column, we will have to seek metadata to further identify the dimension type of the column and then say that it cannot be removed..
We will have to ensure that we add documentation for remove_dimension
which mentions the restriction on this currently working on correlated constraints only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I have two suggestions on the code (and a few nits, feel free to ignore), but I do not consider them blockers, so approving:
- I do not think you need to extend the "old"
add_dimension
function. The new function is build to support this kind of extension, so we only need to retain the old functions for the "simple" case. It is also easier to add it later if necessary, but hard to remove it later if it turns out to not be used. - Lower case
c
and upper caseC
is going to be easy to mix up when deep in debugging. Might be good to change to use a different lower-case letter.
ELSE | ||
'a' -- any.. This should never happen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't NULL normally be used to indicate "undefined"? Is there a reason why NULL can't be used in this case?
src/dimension.h
Outdated
DIMENSION_TYPE_CLOSED = 'c', | ||
DIMENSION_TYPE_CORRELATED = 'C', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lower and upper case c
is going to be easy to mix up. I already do this with similar types in PG. Suggest to use a different letter.
SELECT * FROM _timescaledb_catalog.dimension; | ||
id | hypertable_id | column_name | column_type | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func | ||
----+---------------+-------------+--------------------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------ | ||
2 | 2 | time | timestamp with time zone | t | | | | 86400000000 | | | | ||
id | hypertable_id | column_name | column_type | aligned | num_slices | partitioning_func_schema | partitioning_func | interval_length | compress_interval_length | integer_now_func_schema | integer_now_func | type | ||
----+---------------+-------------+--------------------------+---------+------------+--------------------------+-------------------+-----------------+--------------------------+-------------------------+------------------+------ | ||
2 | 2 | time | timestamp with time zone | t | | | | 86400000000 | | | | o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe switch to vertical display since you're changing the output here anyway. It will make it easier to review in the future because changes in the output will be highlighted better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have several issues with this PR. I don't think this can work by relying on adaptive chunking code which has an unproven track record and was written before compression even existed.
Relying on indexes to update the constraint ranges will not work, doing heap scans on uncompressed chunk regions will also not work. Updating ranges needs to include the compressed data. Creating those ranges before compression also makes very little sense.
Updating ranges by hooking into the function that sets the chunk as partial is going to hurt DML on compressed chunks which is already a big pain point.
Needs a lot more testing, especially with updates on the column that has the correlated constraint set, I don't see such a test at all.
Finally, a tsbench run showing the performance improvement this promises is a must since this is an optimization.
We are not relying on the
The ranges are calculated and added when we call compress_chunk and AFTER a strong lock has been taken on the chunk being compressed. How can the ranges change in that case?
We do NOT update the range, we just mark them INVALID once as part of marking the chunk partial. The range will be re-calculated next only when the partial chunk gets compressed again.
As mentioned above, the moment you do DML on a compressed chunk, you mark it partial, that's when we invalidate the range for correlated constraints. This is tested in the PR.
I did manual runs with data shared by AlexK sometime ago and mentioned on the slack thread that we see benefits of chunk exclusion as expected for queries involving WHERE clauses on these correlated columns. |
Allow users to specify a specific column to be used as a correlated constraint using the add_dimension() API. We store such correlated constraints in the dimensions related timescaledb catalog tables. The "dimension" catalog table has been amended by adding a "type" column. This column now explicitly stores the type: open, closed or correlated as appropriate. We create dimension_slice and chunk_constraint entries for chunks which have correlated constraints on them. The dimension slice entry will have -inf/+inf as start/end range initially for a given correlated constraint and the chunk_constraint entry will refer back to this slice entry. This start/end range will be refreshed later. One of the entry points is during compression for now. We can thus store the min/max values for such correlated contraints in these catalog tables at the per-chunk level. Note that correlated constraints do not participate in partitioning of the data. Such a correlated constraint will be used for chunk pruning if the WHERE clause of a SQL query specifies ranges on such a column. A "DROP COLUMN" on a column with a correlated constraint on it ends up removing all relevant entries from the catalog tables. Also a new "remove_dimension" API has been introduced to allow removal of correlated dimension entries from the catalogs.
Allow users to specify a specific column to be used as a correlated
constraint using the add_dimension() API. We store such correlated
constraints in the dimensions related timescaledb catalog tables. The
"dimension" catalog table has been amended by adding a "type" column.
This column now explicitly stores the type: open, closed or correlated
as appropriate.
We create dimension_slice and chunk_constraint entries for chunks which
have correlated constraints on them. The dimension slice entry will
have -inf/+inf as start/end range initially for a given correlated
constraint and the chunk_constraint entry will refer back to this slice
entry.
This start/end range will be refreshed later. One of the entry points
is during compression for now.
We can thus store the min/max values for such correlated contraints
in these catalog tables at the per-chunk level. Note that correlated
constraints do not participate in partitioning of the data. Such a
correlated constraint will be used for chunk pruning if the WHERE
clause of a SQL query specifies ranges on such a column.
Disable-check: force-changelog-file
Disable-check: commit-count