Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog: support composite primary key #625

Open
skyzh opened this issue Apr 14, 2022 · 18 comments
Open

catalog: support composite primary key #625

skyzh opened this issue Apr 14, 2022 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@skyzh
Copy link
Member

skyzh commented Apr 14, 2022

Currently, our catalog cannot handle the case for composite primary key. For example,

create table t1 (x int not null, y int not null, primary key(x, y));
create table t2 (x int not null, y int not null, primary key(y, x));

These two tables will have exactly the same catalog entry, despite that the first one should have (x, y) as composite sort key, and the second one have (y, x). The order is different!

Therefore, we will need to refactor the catalog to correctly record such information. For TableCatalog, we should have a field primary_key_indices, and remove the is_primary field from ColumnDesc.

This task might involve some refactors and some regressions in functionalities. Need plan first.

@skyzh
Copy link
Member Author

skyzh commented Apr 14, 2022

... may also be done together with #590

@skyzh
Copy link
Member Author

skyzh commented Apr 14, 2022

assigned to @shmiwy

@shmiwy
Copy link
Contributor

shmiwy commented Apr 18, 2022

@skyzh
It seems that we don't support primary key? when I type this sql

create table t (a int not null, b int not null, c int, d int, primary key(a, b));

the physicalCreateTable says that neither a nor b is primary key🤣
image

@skyzh
Copy link
Member Author

skyzh commented Apr 18, 2022

Then our binder doesn't support such syntax now 🤣 Please also help refactor the binder to support primary key (a, b) syntax.

Statement::CreateTable { name, columns, .. } => {
let name = &lower_case_name(name);
let (database_name, schema_name, table_name) = split_name(name)?;

For example, the first statement produces AST like:

        CreateTable {
            or_replace: false,
            temporary: false,
            external: false,
            if_not_exists: false,
            name: ObjectName([Ident {
                value: "t1",
                quote_style: None,
            }]),
            columns: [
                ColumnDef {
                    name: Ident {
                        value: "x",
                        quote_style: None,
                    },
                    data_type: Int(None),
                    collation: None,
                    options: [ColumnOptionDef {
                        name: None,
                        option: NotNull,
                    }],
                },
                ColumnDef {
                    name: Ident {
                        value: "y",
                        quote_style: None,
                    },
                    data_type: Int(None),
                    collation: None,
                    options: [ColumnOptionDef {
                        name: None,
                        option: NotNull,
                    }],
                },
            ],
            constraints: [Unique {
                name: None,
                columns: [
                    Ident {
                        value: "x",
                        quote_style: None,
                    },
                    Ident {
                        value: "y",
                        quote_style: None,
                    },
                ],
                is_primary: true,
            }],
            hive_distribution: NONE,
            hive_formats: Some(HiveFormat {
                row_format: None,
                storage: None,
                location: None,
            }),
            table_properties: [],
            with_options: [],
            file_format: None,
            location: None,
            query: None,
            without_rowid: false,
            like: None,
            engine: None,
            default_charset: None,
            collation: None,
        };

We can read the constraints variable and assign the primary keys.

@skyzh
Copy link
Member Author

skyzh commented Apr 18, 2022

create table t (a int not null primary key, b int not null, c int, d int);

By the way, currently the binder only supports primary key at this location 🤣

@shmiwy
Copy link
Contributor

shmiwy commented Apr 18, 2022

Can we add a field called primary_key_indices in TableCatalog without removng is_primary field from ColumnDesc?
I think we can just change the new function for TableCatalog(src/ catalog/ table.rs) around these line,

for col_catalog in columns {
table_catalog.add_column(col_catalog).unwrap();
}
table_catalog

add a judgment about col_catalog.is_primary() before line 44. If it is true, we can record the corresponding col_id in somewhere (maybe a vector). When we end up the for loop, we can change the primary_key_indices.

After we have created the a table, it seems that subsequent queries will not modify the catalog data like primary keys' id of the table. So it seems acceptable that we store redundant data about primary keys.🤣

@shmiwy
Copy link
Contributor

shmiwy commented Apr 18, 2022

By the way, changing the binder to supports this kind of sql query seems easy.

create table t(a int not null, b int not null, c int, d int, primary key(a, b));

(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪

@skyzh
Copy link
Member Author

skyzh commented Apr 18, 2022

(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪

You may send a draft PR and we can investigate it...

Can we add a field called primary_key_indices in TableCatalog without removng is_primary field from ColumnDesc?
I think we can just change the new function for TableCatalog(src/ catalog/ table.rs) around these line,

Of course we can :) But from my perspective, is_primary won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.

@shmiwy
Copy link
Contributor

shmiwy commented Apr 18, 2022

But from my perspective, is_primary won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.

I think it just make some thing like is_primary function for ColumnCatalog easier hhhh(I I find this function used in many places). if we remove it, maybe we need a more complex logic to Implement this function(maybe a extra filed in ColumnDesc to indicate which table the column belongs to, or a complex logic to scan travel from database to schema to table to column just to find whether the column is primary

@shmiwy
Copy link
Contributor

shmiwy commented Apr 18, 2022

Or we need to refactor everything that uses the function (more bugs can emerge hhhh🤩)

@skyzh
Copy link
Member Author

skyzh commented Apr 18, 2022

I think it just make some thing like is_primary function for ColumnCatalog easier

Or we need to refactor everything that uses the function

Exactly. If you want to refactor, we can do this. But storing is_primary inside ColumnDesc is more efficient -- we can do an O(1) hash lookup to determine whether a key is a primary key, and we don't need to refactor everywhere. For the first step, I'd like to retain it.

I find this function used in many places

For most of the places using is_primary (especially in planner and optimizer), it's used to determine the sort key of storage. However, most of the usages only consider single primary key. If we want to extend the case to multiple sort keys, a refactor is unavoidable. For example, BoundColumnRef's is_primary IMO should be removed in the future. We can do this later when we have a more detailed plan. Just keep a single PR concise and only solving one problem.

@shmiwy
Copy link
Contributor

shmiwy commented Apr 19, 2022

still not tell difference between primary key(a, b) and primary key(b, a), i will do this int the not-too-distant future.

@shmiwy
Copy link
Contributor

shmiwy commented Apr 19, 2022

on the first step, I'm going to add two filed in TableCatalog.Inner, one is pk_ids which is a HashSet to indicate which columns are primary, the other one is extra_ipk_ids. extra_pk_ids is a vector and will be used to record which columns are declared by "primary key(c1, c2, ..)" syntax.

for example, if we have a sql like

create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));

the pk_ids will record the column id for a b c, and the extra_pk_ids will record the column id for b c in order b followed by c.

I'm trying to maintain the data of the two filed without using any of the ColumnDesc information except column id, because we will remove is_primary filed in ColumnDesc in the future and refactor some code. So I will not use any of them.

To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable like 'TableCatalog,Inner', get the two filed data from stmt after parsing, pass them through binder, logical_plan .. until I call create_table on storage.

@skyzh
Copy link
Member Author

skyzh commented Apr 19, 2022

create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));

I think we should forbid this in binder. We should only allow single primary key if a int primary key, or multiple primary keys using primary key(b, c) syntax.

@skyzh
Copy link
Member Author

skyzh commented Apr 19, 2022

So, if

create table t (a int, b int not null, c int not null, d int , primary key(b, c));

We should have hashset to set b -> true, c -> true, and ordered_pk_ids to id of b, id of c.

If

create table t (a int primary key, b int not null, c int not null, d int);

We should set a -> true, while ordered_pk_ids only contain id of a.

I believe in the future, the HashSet won't be used, as all inquiries about pks will consider order. But we can keep it for now, to refactor everything little by little.

@skyzh
Copy link
Member Author

skyzh commented Apr 19, 2022

To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable like 'TableCatalog,Inner', get the two filed data from stmt after parsing, pass them through binder, logical_plan .. until I call create_table on storage.

This plan looks great! We can proceed on this.

@skyzh
Copy link
Member Author

skyzh commented Apr 19, 2022

ok, if so, I will only keep one vector in tableCatalog, and forbiden that syntax from binder

Sounds good to me.

@shmiwy
Copy link
Contributor

shmiwy commented Apr 19, 2022

ok🥳

@skyzh skyzh added the enhancement New feature or request label Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants