Support `ALTER TABLE ADD PRIMARY KEY` statements #11895

frapa · 2024-05-01T16:49:06Z

This PR implements ALTER TABLE table ADD PRIMARY KEY (col1, ...); statements, as per #57.

I suggest reviewing by commit, as I tried to make them thematic and self-contained. However, the first commits basically do changes needed for the last 2-3 commits, so please have a look at the final ones if something's unclear.

Let me know if there is anything I can improve.

Tishj · 2024-05-01T18:00:29Z

I'm not sure this is the best time to add this feature, as there is a pending PR to rework binding of Indexes (#11867)

src/catalog/catalog_entry/duck_table_entry.cpp

src/planner/binder/statement/bind_create_table.cpp

src/catalog/catalog_entry/duck_table_entry.cpp

src/storage/data_table.cpp

src/catalog/catalog_entry/duck_table_entry.cpp

src/storage/data_table.cpp

src/storage/serialization/serialize_parse_info.cpp

src/storage/table_index_list.cpp

frapa · 2024-05-01T19:55:50Z

Thanks for the quick review!

Regarding #11867, I guess that would affect the part about indexes, which should affect only small part of the code from this PR (the DataTable constructor). Review of the rest should be safe.

frapa · 2024-05-05T18:49:51Z

I fixed all comments, let me know if there is anything else I can change!

I renamed it AddConstraintIndex. By the name, this method belongs there, and I need it later in the DataTable class constructor to handle addition of unique indices.

Now, when a new unique index is created and a modified instance of the storage is made, the DataTable will automatically add an ART index and index existing rows, throwing an error if the uniqueness is not satisfied.

This makes code that needs to access the index after the addition to the table simpler.

Tishj · 2024-05-06T12:58:23Z

Looks like you need to run python3 scripts/generate_enum_util.py
Also you'll need to mark the PR as Ready to Review (bottom of the page) to let the CI run here

Until all the CI is green on your own fork though it's fine to leave it like this 👍

…alter_table_add_primary_key

This is not necessary because creating a unique index automatically checks the uniqueness.

…alter_table_add_primary_key # Conflicts: # src/include/duckdb/storage/table/table_index_list.hpp

…alter_table_add_primary_key

frapa · 2024-05-17T06:40:35Z

This PR is now ready for review, I updated the code to the master branch and fixed a few additional things:

I fixed the code after Rework index binding #11867 (basically just cast to bound index).
I added a few tests and fixes for transactions (there was a bug where indexes were retained even after rollback, caused by missing cloning of a vector).

taniabogatsch · 2024-05-22T09:18:47Z

Hi @frapa, thanks for all your work on this new feature! 🎉 @Tishj asked me to have a go at reviewing your changes. As I'll be reviewing this PR, it might help me if you could close all of the comments that you've resolved. Also, as this is a new feature, it makes sense to rebase the PR against the feature branch.

taniabogatsch

Hi @frapa! Thanks for picking up this long-standing feature request. The PR is going in a great direction. I've added my comments, most importantly I've requested a lot of additional tests (around concurrency).

taniabogatsch · 2024-05-22T10:58:18Z

test/sql/alter/alter_col/test_add_primary_key_transaction.test

Can we have tests to cover the following scenario(s)?

Multiple transactions add the same primary key constraint, but only one can commit successfully. All others need to throw an exception on COMMIT.

A few storage tests. Helpful settings can include the following. Also a test with a persistent database.

Add a PK to an attached database, detach, and reattach.

require skip_reload SET wal_autocheckpoint='1TB'; PRAGMA disable_checkpoint_on_shutdown; restart CHECKPOINT;

taniabogatsch · 2024-05-22T11:04:52Z

test/sql/alter/alter_col/test_add_primary_key.test

Can we have tests to cover the following scenarios?

Creation fails due to NULL value violation.

Creation fails because the key type is not supported.

Creation fails because some types are not supported.

Switch/shuffle column order.

Creation fails because there is a PK on other columns.

Adding a PK to a table+column that already has a UNIQUE index must succeed.

Dropping a table with an added PK constraint, then creating the same table again, and adding that same constraint again.

taniabogatsch · 2024-05-22T11:21:06Z

src/storage/table_index_list.cpp

+	auto &result = *index;
 	indexes.push_back(std::move(index));
+	return result;


Nit. Can we rewrite this?

lock_guard<mutex> lock(indexes_lock); indexes.push_back(std::move(index)); return *indexes.back();

taniabogatsch · 2024-05-22T11:35:36Z

src/planner/binder/statement/bind_create_table.cpp

+	physical_index_set_t not_null_columns;
+	vector<PhysicalIndex> primary_keys;
+	for (const auto &bound_constr : bound_constraints) {
+		if (bound_constr->type == ConstraintType::NOT_NULL) {


Nit: Let's revert this back to a switch.

taniabogatsch · 2024-05-22T11:38:32Z

src/planner/binder/statement/bind_create_table.cpp

 	return bound_constraints;
 }

+unique_ptr<BoundConstraint> Binder::BindConstraint(Constraint &constraint, const string &table_name,


For readability, we can split this into separate functions, like the previously used BindCheckConstraint function. Then the switch case becomes very readable, and we move the logic into dedicated code blocks.

taniabogatsch · 2024-05-22T12:22:41Z

src/storage/data_table.cpp

+	bound_expressions.reserve(columns.size());
+	for (const auto &column_p : columns) {
+		auto &column = column_p.get();
+		D_ASSERT(!column.Generated());


Should we add a test with generated columns?

taniabogatsch · 2024-05-22T12:24:53Z

src/storage/data_table.cpp

+	// fetch types and create expressions for the index from the columns
+	vector<column_t> column_ids;
+	vector<unique_ptr<Expression>> unbound_expressions;
+	vector<unique_ptr<Expression>> bound_expressions;


We never use this.

taniabogatsch · 2024-05-22T12:59:02Z

src/storage/data_table.cpp

+	return info->indexes.AddIndex(std::move(art));
+}
+
+void DataTable::IndexExistingData(ClientContext &context, Index &index) {


This function does not work currently. The row ids need to correspond to the row ids in the storage, which are not always sequential. Otherwise, the row ids we physically store in the index are incorrect. Then, if we try to use an index scan, which uses the row ids of the index, we'll get wrong results.

You need to push the row_id identifier to the column ids: cids.push_back(COLUMN_IDENTIFIER_ROW_ID). Then, your chunk will contain an additional vector with the correct row ids. That's the row identifier vector you need to pass to bound_index.Append.

Also, can we spell out the name (column_ids instead of cids) for readability?

Here is a test to include which currently breaks. The optimized query plan uses an index scan.

statement ok PRAGMA enable_verification statement ok CREATE TABLE integers(i integer) statement ok INSERT INTO integers SELECT * FROM range(50000); statement ok DELETE FROM integers WHERE i = 42; statement ok ALTER TABLE integers ADD PRIMARY KEY (i); query I SELECT i FROM integers WHERE i = 100 ---- 100

taniabogatsch · 2024-05-22T13:03:28Z

test/sql/alter/alter_col/test_add_primary_key_transaction.test

Additionally, there are important scenarios not yet covered in the tests in this PR. What happens if you are scanning the table to build your index while another transactions alters the table and commits its changes? Or drops the table? You can write tests for this by using the concurrentloop functionality. Have a look at this test test/sql/parallelism/interquery/concurrent_appends.test. Prior to the concurrentloop functionality, we were writing these tests in dedicated .cpp files. I.e., for the (older) index functionality, you can have a look at this file: test/sql/parallelism/interquery/test_concurrent_index.cpp.

taniabogatsch · 2024-05-22T13:07:15Z

test/sql/alter/alter_col/test_add_primary_key.test

Any time you add a PK constraint, or alter the data in a table after adding the PK constraint, it can help to use an index scan to additionally verify the row ids in the ART. Currently, your tests only cover that the ART keys are valid, which we use for constraint checking. The data in the ART is not tested. We 'only' use it for index scans, but we cannot break them.

szarnyasg requested a review from Mytherin May 1, 2024 16:55

szarnyasg added the Ready For Review label May 1, 2024