New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ALTER TABLE ADD PRIMARY KEY
statements
#11895
base: main
Are you sure you want to change the base?
Conversation
I'm not sure this is the best time to add this feature, as there is a pending PR to rework binding of Indexes (#11867) |
Thanks for the quick review! Regarding #11867, I guess that would affect the part about indexes, which should affect only small part of the code from this PR (the |
I fixed all comments, let me know if there is anything else I can change! |
f14b1f4
to
e05ed09
Compare
e05ed09
to
e7d1570
Compare
e7d1570
to
ce62349
Compare
I renamed it AddConstraintIndex. By the name, this method belongs there, and I need it later in the DataTable class constructor to handle addition of unique indices.
Now, when a new unique index is created and a modified instance of the storage is made, the DataTable will automatically add an ART index and index existing rows, throwing an error if the uniqueness is not satisfied.
This makes code that needs to access the index after the addition to the table simpler.
ce62349
to
70b552a
Compare
Looks like you need to run Until all the CI is green on your own fork though it's fine to leave it like this 👍 |
1f8d27d
to
5f19c75
Compare
This is not necessary because creating a unique index automatically checks the uniqueness.
…alter_table_add_primary_key # Conflicts: # src/include/duckdb/storage/table/table_index_list.hpp
…alter_table_add_primary_key
…alter_table_add_primary_key
This PR is now ready for review, I updated the code to the master branch and fixed a few additional things:
|
Hi @frapa, thanks for all your work on this new feature! 🎉 @Tishj asked me to have a go at reviewing your changes. As I'll be reviewing this PR, it might help me if you could close all of the comments that you've resolved. Also, as this is a new feature, it makes sense to rebase the PR against the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @frapa! Thanks for picking up this long-standing feature request. The PR is going in a great direction. I've added my comments, most importantly I've requested a lot of additional tests (around concurrency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have tests to cover the following scenario(s)?
- Multiple transactions add the same primary key constraint, but only one can commit successfully. All others need to throw an exception on
COMMIT
. - A few storage tests. Helpful settings can include the following. Also a test with a persistent database.
- Add a PK to an attached database, detach, and reattach.
require skip_reload
SET wal_autocheckpoint='1TB';
PRAGMA disable_checkpoint_on_shutdown;
restart
CHECKPOINT;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have tests to cover the following scenarios?
- Creation fails due to
NULL
value violation. - Creation fails because the key type is not supported.
- Creation fails because some types are not supported.
- Switch/shuffle column order.
- Creation fails because there is a PK on other columns.
- Adding a PK to a table+column that already has a
UNIQUE
index must succeed. - Dropping a table with an added PK constraint, then creating the same table again, and adding that same constraint again.
auto &result = *index; | ||
indexes.push_back(std::move(index)); | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Can we rewrite this?
lock_guard<mutex> lock(indexes_lock);
indexes.push_back(std::move(index));
return *indexes.back();
physical_index_set_t not_null_columns; | ||
vector<PhysicalIndex> primary_keys; | ||
for (const auto &bound_constr : bound_constraints) { | ||
if (bound_constr->type == ConstraintType::NOT_NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's revert this back to a switch
.
return bound_constraints; | ||
} | ||
|
||
unique_ptr<BoundConstraint> Binder::BindConstraint(Constraint &constraint, const string &table_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, we can split this into separate functions, like the previously used BindCheckConstraint
function. Then the switch
case becomes very readable, and we move the logic into dedicated code blocks.
bound_expressions.reserve(columns.size()); | ||
for (const auto &column_p : columns) { | ||
auto &column = column_p.get(); | ||
D_ASSERT(!column.Generated()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test with generated columns?
// fetch types and create expressions for the index from the columns | ||
vector<column_t> column_ids; | ||
vector<unique_ptr<Expression>> unbound_expressions; | ||
vector<unique_ptr<Expression>> bound_expressions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never use this.
return info->indexes.AddIndex(std::move(art)); | ||
} | ||
|
||
void DataTable::IndexExistingData(ClientContext &context, Index &index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does not work currently. The row ids need to correspond to the row ids in the storage, which are not always sequential. Otherwise, the row ids we physically store in the index are incorrect. Then, if we try to use an index scan, which uses the row ids of the index, we'll get wrong results.
You need to push the row_id
identifier to the column ids: cids.push_back(COLUMN_IDENTIFIER_ROW_ID)
. Then, your chunk will contain an additional vector with the correct row ids. That's the row identifier vector you need to pass to bound_index.Append
.
Also, can we spell out the name (column_ids
instead of cids
) for readability?
Here is a test to include which currently breaks. The optimized query plan uses an index scan.
statement ok
PRAGMA enable_verification
statement ok
CREATE TABLE integers(i integer)
statement ok
INSERT INTO integers SELECT * FROM range(50000);
statement ok
DELETE FROM integers WHERE i = 42;
statement ok
ALTER TABLE integers ADD PRIMARY KEY (i);
query I
SELECT i FROM integers WHERE i = 100
----
100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, there are important scenarios not yet covered in the tests in this PR. What happens if you are scanning the table to build your index while another transactions alters the table and commits its changes? Or drops the table? You can write tests for this by using the concurrentloop
functionality. Have a look at this test test/sql/parallelism/interquery/concurrent_appends.test
. Prior to the concurrentloop
functionality, we were writing these tests in dedicated .cpp
files. I.e., for the (older) index functionality, you can have a look at this file: test/sql/parallelism/interquery/test_concurrent_index.cpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time you add a PK constraint, or alter the data in a table after adding the PK constraint, it can help to use an index scan to additionally verify the row ids in the ART. Currently, your tests only cover that the ART keys are valid, which we use for constraint checking. The data in the ART is not tested. We 'only' use it for index scans, but we cannot break them.
This PR implements
ALTER TABLE table ADD PRIMARY KEY (col1, ...);
statements, as per #57.I suggest reviewing by commit, as I tried to make them thematic and self-contained. However, the first commits basically do changes needed for the last 2-3 commits, so please have a look at the final ones if something's unclear.
Let me know if there is anything I can improve.