Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ALTER TABLE ADD PRIMARY KEY statements #11895

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

frapa
Copy link

@frapa frapa commented May 1, 2024

This PR implements ALTER TABLE table ADD PRIMARY KEY (col1, ...); statements, as per #57.

I suggest reviewing by commit, as I tried to make them thematic and self-contained. However, the first commits basically do changes needed for the last 2-3 commits, so please have a look at the final ones if something's unclear.

Let me know if there is anything I can improve.

@Tishj
Copy link
Contributor

Tishj commented May 1, 2024

I'm not sure this is the best time to add this feature, as there is a pending PR to rework binding of Indexes (#11867)

src/storage/data_table.cpp Outdated Show resolved Hide resolved
src/storage/data_table.cpp Outdated Show resolved Hide resolved
src/storage/data_table.cpp Outdated Show resolved Hide resolved
@frapa
Copy link
Author

frapa commented May 1, 2024

Thanks for the quick review!

Regarding #11867, I guess that would affect the part about indexes, which should affect only small part of the code from this PR (the DataTable constructor). Review of the rest should be safe.

@frapa
Copy link
Author

frapa commented May 5, 2024

I fixed all comments, let me know if there is anything else I can change!

@frapa frapa force-pushed the support_alter_table_add_primary_key branch from f14b1f4 to e05ed09 Compare May 5, 2024 19:16
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 5, 2024 19:16
@frapa frapa force-pushed the support_alter_table_add_primary_key branch from e05ed09 to e7d1570 Compare May 5, 2024 19:20
@frapa frapa marked this pull request as ready for review May 5, 2024 19:20
@frapa frapa force-pushed the support_alter_table_add_primary_key branch from e7d1570 to ce62349 Compare May 5, 2024 19:27
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 5, 2024 19:28
@frapa frapa marked this pull request as ready for review May 5, 2024 19:28
frapa added 7 commits May 5, 2024 21:35
I renamed it AddConstraintIndex.

By the name, this method belongs there,
and I need it later in the DataTable class
constructor to handle addition of unique indices.
Now, when a new unique index is created and a modified
instance of the storage is made, the DataTable will
automatically add an ART index and index existing
rows, throwing an error if the uniqueness is not
satisfied.
This makes code that needs to access the index
after the addition to the table simpler.
@frapa frapa force-pushed the support_alter_table_add_primary_key branch from ce62349 to 70b552a Compare May 5, 2024 19:35
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 5, 2024 19:35
@Tishj
Copy link
Contributor

Tishj commented May 6, 2024

Looks like you need to run python3 scripts/generate_enum_util.py
Also you'll need to mark the PR as Ready to Review (bottom of the page) to let the CI run here

Until all the CI is green on your own fork though it's fine to leave it like this 👍

@frapa frapa marked this pull request as ready for review May 8, 2024 06:25
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 8, 2024 06:29
@frapa frapa force-pushed the support_alter_table_add_primary_key branch from 1f8d27d to 5f19c75 Compare May 8, 2024 19:08
@frapa frapa marked this pull request as ready for review May 13, 2024 06:51
frapa added 3 commits May 13, 2024 20:02
This is not necessary because
creating a unique index
automatically checks the uniqueness.
…alter_table_add_primary_key

# Conflicts:
#	src/include/duckdb/storage/table/table_index_list.hpp
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 13, 2024 18:27
@frapa frapa marked this pull request as ready for review May 16, 2024 18:38
@frapa
Copy link
Author

frapa commented May 17, 2024

This PR is now ready for review, I updated the code to the master branch and fixed a few additional things:

  • I fixed the code after Rework index binding #11867 (basically just cast to bound index).
  • I added a few tests and fixes for transactions (there was a bug where indexes were retained even after rollback, caused by missing cloning of a vector).

@taniabogatsch
Copy link
Contributor

taniabogatsch commented May 22, 2024

Hi @frapa, thanks for all your work on this new feature! 🎉 @Tishj asked me to have a go at reviewing your changes. As I'll be reviewing this PR, it might help me if you could close all of the comments that you've resolved. Also, as this is a new feature, it makes sense to rebase the PR against the feature branch.

Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @frapa! Thanks for picking up this long-standing feature request. The PR is going in a great direction. I've added my comments, most importantly I've requested a lot of additional tests (around concurrency).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have tests to cover the following scenario(s)?

  • Multiple transactions add the same primary key constraint, but only one can commit successfully. All others need to throw an exception on COMMIT.
  • A few storage tests. Helpful settings can include the following. Also a test with a persistent database.
  • Add a PK to an attached database, detach, and reattach.
require skip_reload
SET wal_autocheckpoint='1TB';
PRAGMA disable_checkpoint_on_shutdown;
restart
CHECKPOINT;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have tests to cover the following scenarios?

  • Creation fails due to NULL value violation.
  • Creation fails because the key type is not supported.
  • Creation fails because some types are not supported.
  • Switch/shuffle column order.
  • Creation fails because there is a PK on other columns.
  • Adding a PK to a table+column that already has a UNIQUE index must succeed.
  • Dropping a table with an added PK constraint, then creating the same table again, and adding that same constraint again.

Comment on lines +17 to +19
auto &result = *index;
indexes.push_back(std::move(index));
return result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Can we rewrite this?

lock_guard<mutex> lock(indexes_lock);
indexes.push_back(std::move(index));
return *indexes.back();

physical_index_set_t not_null_columns;
vector<PhysicalIndex> primary_keys;
for (const auto &bound_constr : bound_constraints) {
if (bound_constr->type == ConstraintType::NOT_NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's revert this back to a switch.

return bound_constraints;
}

unique_ptr<BoundConstraint> Binder::BindConstraint(Constraint &constraint, const string &table_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, we can split this into separate functions, like the previously used BindCheckConstraint function. Then the switch case becomes very readable, and we move the logic into dedicated code blocks.

bound_expressions.reserve(columns.size());
for (const auto &column_p : columns) {
auto &column = column_p.get();
D_ASSERT(!column.Generated());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test with generated columns?

// fetch types and create expressions for the index from the columns
vector<column_t> column_ids;
vector<unique_ptr<Expression>> unbound_expressions;
vector<unique_ptr<Expression>> bound_expressions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never use this.

return info->indexes.AddIndex(std::move(art));
}

void DataTable::IndexExistingData(ClientContext &context, Index &index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not work currently. The row ids need to correspond to the row ids in the storage, which are not always sequential. Otherwise, the row ids we physically store in the index are incorrect. Then, if we try to use an index scan, which uses the row ids of the index, we'll get wrong results.

You need to push the row_id identifier to the column ids: cids.push_back(COLUMN_IDENTIFIER_ROW_ID). Then, your chunk will contain an additional vector with the correct row ids. That's the row identifier vector you need to pass to bound_index.Append.

Also, can we spell out the name (column_ids instead of cids) for readability?

Here is a test to include which currently breaks. The optimized query plan uses an index scan.

statement ok
PRAGMA enable_verification

statement ok
CREATE TABLE integers(i integer)

statement ok
INSERT INTO integers SELECT * FROM range(50000);

statement ok
DELETE FROM integers WHERE i = 42;

statement ok
ALTER TABLE integers ADD PRIMARY KEY (i);

query I
SELECT i FROM integers WHERE i = 100
----
100

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, there are important scenarios not yet covered in the tests in this PR. What happens if you are scanning the table to build your index while another transactions alters the table and commits its changes? Or drops the table? You can write tests for this by using the concurrentloop functionality. Have a look at this test test/sql/parallelism/interquery/concurrent_appends.test. Prior to the concurrentloop functionality, we were writing these tests in dedicated .cpp files. I.e., for the (older) index functionality, you can have a look at this file: test/sql/parallelism/interquery/test_concurrent_index.cpp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time you add a PK constraint, or alter the data in a table after adding the PK constraint, it can help to use an index scan to additionally verify the row ids in the ART. Currently, your tests only cover that the ART keys are valid, which we use for constraint checking. The data in the ART is not tested. We 'only' use it for index scans, but we cannot break them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants