Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partitioning by DATE column #1057

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bnaul
Copy link

@bnaul bnaul commented Apr 5, 2024

Fixes #1056

@bnaul bnaul requested review from a team as code owners April 5, 2024 16:45
@bnaul bnaul requested a review from farhan0102 April 5, 2024 16:45
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. size: xl Pull request size is extra large. and removed size: s Pull request size is small. labels Apr 5, 2024
@product-auto-label product-auto-label bot added size: s Pull request size is small. and removed size: xl Pull request size is extra large. labels Apr 5, 2024
Comment on lines 831 to 840
if time_partitioning.field is not None:
field = time_partitioning.field
if isinstance(
table.columns[time_partitioning.field].type,
sqlalchemy.sql.sqltypes.DATE,
):
return f"PARTITION BY {field}"
elif isinstance(
table.columns[time_partitioning.field].type,
sqlalchemy.sql.sqltypes.TIMESTAMP,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast this PR is looking at adding logic to also handle partitioning on a date above and beyond using just DATE_TRUNC and TIMESTAMP_TRUNC functions.

Looking at the breakdown :TIMESTAMP_TRUNC: and :DATE_TRUNC: of how bigquery accepts time partitioning, it feels like we are missing a variety of the options/possibilities.

Am I missing something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chalmerlowe can you clarify what you think is missing? it looks like the API expects one of

Supported values are: * HOUR * DAY * MONTH * YEAR

so it seems like TIMESTAMP_TRUNC would work for all of those and DATE would work for MONTH and YEAR, so all that's missing is just plain DATE

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column Data Type HOUR DAY MONTH YEAR
DATE N/A Fixed by #1057 via DATE_TRUNC via DATE_TRUNC
DATETIME Currently via DATE_TRUNC TODO: use  DATETIME_TRUNC Currently via DATE_TRUNC TODO: use  DATETIME_TRUNC Currently via DATE_TRUNC TODO: use  DATETIME_TRUNC Currently via DATE_TRUNC TODO: use  DATETIME_TRUNC
TIMESTAMP via TIMESTAMP_TRUNC via TIMESTAMP_TRUNC via TIMESTAMP_TRUNC via TIMESTAMP_TRUNC
_PARTITIONDATE N/A via DATE_TRUNC via DATE_TRUNC via DATE_TRUNC
_PARTITIONTIME Not supported TODO: USE TIMESTAMP_TRUNC Not supported TODO: USE TIMESTAMP_TRUNC Not supported TODO: USE TIMESTAMP_TRUNC Not supported TODO: USE TIMESTAMP_TRUNC

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaul For clarity, my question was less about whether this PR is correct and more focused on the bigger question:

Since we have the function _process_time_partitioning() and it handles two use cases (soon to be three based on this PR), does it cover all the use cases that it ought to?

Tim and I looked at this. I had a matrix to help me ID what the function does/doesn't do/should do.

Tim cleaned it up and posted it above. There are capabilities that are satisfied via an existing BigQuery function (i.e. DATE_TRUNC). But there are use cases that are not fully satisfied OR not supported at all. Tim provides suggestions for how to resolve those gaps.

NOTE: In no way am I saying you need to do all this work. This is more for me as a maintainer to figure out what I want the end state to be either through this PR or a follow-on PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes that matrix is very helpful, thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious @chalmerlowe @tswast, any more thoughts on whether this PR should go ahead or wait on additional changes? To me it's distinct enough from the other "Not supported" cases above that it makes sense to keep them separate, but obviously I'm a bit biased 🙂

@chalmerlowe
Copy link
Collaborator

chalmerlowe commented May 2, 2024

@bnaul

Updated: Copied the matrix to a new issue. Beyond that I intend to accept this PR.
Will need to run it through the kokoro test suite and will likely take a last look at it as it stands to make sure I am not missing anything in regards to adding this nugget of functionality.

@chalmerlowe chalmerlowe added kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels May 2, 2024
@yoshi-kokoro yoshi-kokoro removed kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels May 2, 2024
@chalmerlowe
Copy link
Collaborator

The reason the Kokoro check is failing is because we do not detect full test coverage.
The coverage tool is set to monitor our overall unit test suite and confirm that the unit tests cover all the paths through the code. We do NOT apply coverage to assessing the tests that run during system testing and are found in the system folder.

Because the test we have included here is in the system folder and we do not get credit for testing the new branch.

I recommend adding a test to the unit folder that covers this pathway.

@chalmerlowe
Copy link
Collaborator

@bnaul

I was looking at the possibility of adding a unit test or two to this and run into a snag.
Checking with @tswast offline to see if I can get some clarity on what I am seeing.

More to come.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. size: s Pull request size is small.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Time partitioning on DATE column fails with type_="DAY"
4 participants