Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for random row sampling on binary id columns #1135

Merged
merged 16 commits into from
May 21, 2024

Conversation

nj1973
Copy link
Contributor

@nj1973 nj1973 commented May 15, 2024

This PR adds support for random row sampling on binary id columns by:

  1. Casting random binary values to hex
  2. Adding a reverse cast back to binary in the IN list added for random binary id filters

Also adds random row options to DVT_BINARY integration tests. There are only 5 rows in the table as we validate with batch size 5 which at least proves that the filter works on each system. We probably need to add a more targeted random row integration tests in the future but that was not the purpose of this PR, I just want to know the casting works on all engines.

Unfortunately I had to disable the Hive test because our Hive test instance is not behaving properly when asked to process a multi-element IN list on a binary column. This reproduces outside of DVT.

@nj1973 nj1973 requested a review from a team as a code owner May 15, 2024 14:33
@nj1973 nj1973 marked this pull request as draft May 15, 2024 14:33
@nj1973
Copy link
Contributor Author

nj1973 commented May 15, 2024

We appear to have an issue on Hive, unrelated to our changes.

Single IN list:

select * from pso_data_validator.`dvt_binary` where `binary_id` in (unhex('4456542d6b65792d35'));

+-----------------------+--------------------+------------------------+
| dvt_binary.binary_id  | dvt_binary.int_id  | dvt_binary.other_data  |
+-----------------------+--------------------+------------------------+
| DVT-key-5             | 5                  | Row 5                  |
+-----------------------+--------------------+------------------------+

Add a second item to the IN list:

select * from pso_data_validator.`dvt_binary` where `binary_id` in (unhex('4456542d6b65792d35'),unhex('4456542d6b65792d33');

+-----------------------+--------------------+------------------------+
| dvt_binary.binary_id  | dvt_binary.int_id  | dvt_binary.other_data  |
+-----------------------+--------------------+------------------------+
+-----------------------+--------------------+------------------------+

No rows selected!

@pull-request-size pull-request-size bot added size/L and removed size/M labels May 17, 2024
@nj1973
Copy link
Contributor Author

nj1973 commented May 17, 2024

/gcbrun

@nj1973 nj1973 marked this pull request as ready for review May 20, 2024 13:01
Copy link
Collaborator

@nehanene15 nehanene15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice job!!

@nj1973
Copy link
Contributor Author

nj1973 commented May 21, 2024

/gcbrun

@nj1973 nj1973 merged commit c3d2155 into develop May 21, 2024
5 checks passed
@nj1973 nj1973 deleted the issue1070-debug-binary-pks-row-validation branch May 21, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants