Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Join Issue? #61

Open
yywe opened this issue Sep 28, 2023 · 2 comments
Open

Hash Join Issue? #61

yywe opened this issue Sep 28, 2023 · 2 comments

Comments

@yywe
Copy link

yywe commented Sep 28, 2023

Hi @erikgrinaker,

Thank you for open source this great project. I would like to say this is the best resource to learn database principles. Meanwhile, the code quality is awesome, neat and elegant. Every piece is excellent.

I have a question for one of the optimizer, as shown below:

                (Expression::Field(a, a_label), Expression::Field(b, b_label)) => {
                    let (left_field, right_field) = if a < left_size {
                        ((a, a_label), (b - left_size, b_label))
                    } else {
                        ((b, b_label), (a - left_size, a_label))
                    };
                    Ok(Node::HashJoin { left, left_field, right, right_field, outer })
                }

when it is equal join, here we switch to hash join. based on Field(a) and Field(b). Not sure if I miss anything, but what if the corresponding field (column) are not unique ? if there are duplicate values in the fields. I feel the hash join will miss some rows?

The hash map is created by collect the key value pairs:

let right: HashMap<Value, Row> = rrows
.map(|res| match res {
Ok(row) if row.len() <= r => {
Err(Error::Internal(format!("Right index {} out of bounds", r)))
}
Ok(row) => Ok((row[r].clone(), row)),
Err(err) => Err(err),
})
.collect::<Result<_>>()?;

if there are duplicate values, I think only the last pair will exist.

do I miss anything? Appreciate any feedback. Thanks!

@erikgrinaker
Copy link
Owner

erikgrinaker commented Oct 8, 2023

Yep, you're right, nice find!

toydb> create table a (id int primary key, value string not null);
Created table a
toydb> create table b (id int primary key, value string not null);
Created table b
toydb> insert into a values (1, 'a'), (2, 'b');
Created 2 rows
toydb> insert into b values (1, 'b'), (2, 'b');
Created 2 rows
toydb> select a.id, b.id from a join b on a.value = b.value;
2|2
toydb> explain select a.id, b.id from a join b on a.value = b.value;
Projection: a.id, b.id
└─ HashJoin: inner on a.value = b.value
   ├─ Scan: a
   └─ Scan: b

@laxjesse
Copy link

laxjesse commented Nov 6, 2023

encountered this while playing around a little and wanted to to share in case it was helpful. I believe that this test should actually return duplicate results e.g.

 Result: ["id", "title", "genre", "studio", "rating"]
 [Integer(10), String("Inception"), String("Science Fiction"), String("Warner Bros"), Float(8.8)]
+[Integer(10), String("Inception"), String("Science Fiction"), String("Warner Bros"), Float(8.8)]
+[Integer(1), String("Stalker"), String("Science Fiction"), String("Mosfilm"), Float(8.2)]
 [Integer(1), String("Stalker"), String("Science Fiction"), String("Mosfilm"), Float(8.2)]
 [Integer(4), String("Heat"), String("Action"), String("Warner Bros"), Float(8.2)]
+[Integer(4), String("Heat"), String("Action"), String("Warner Bros"), Float(8.2)]
+[Integer(6), String("Solaris"), String("Science Fiction"), String("Mosfilm"), Float(8.1)]
 [Integer(6), String("Solaris"), String("Science Fiction"), String("Mosfilm"), Float(8.1)]
 [Integer(7), String("Gravity"), String("Science Fiction"), String("Warner Bros"), Float(7.7)]
+[Integer(7), String("Gravity"), String("Science Fiction"), String("Warner Bros"), Float(7.7)]
+[Integer(9), String("Birdman"), String("Comedy"), String("Warner Bros"), Float(7.7)]
 [Integer(9), String("Birdman"), String("Comedy"), String("Warner Bros"), Float(7.7)]
 [Integer(5), String("The Fountain"), String("Science Fiction"), String("Warner Bros"), Float(7.2)]
+[Integer(5), String("The Fountain"), String("Science Fiction"), String("Warner Bros"), Float(7.2)]

(also just to reiterate, thanks for open sourcing this! it's great fun to play with)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants