Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

zhangbutao
Copy link
Contributor

@zhangbutao zhangbutao commented Apr 25, 2024

What changes were proposed in this pull request?

At present, in case of iceberg.hive.keep.stats=true & hive.compute.query.using.stats=true, HS2 will do a fetch task to get iceberg table's numRows property from HMS to optimize count query.
If iceberg.hive.keep.stats=false, HS2 will always launch tez task to compute table's row count when filing a count query.

However, as we know, iceberg table's metadata has some stats information, we can also just start a fetch task to retrieve the row count from iceberg's snapshot summary when iceberg.hive.keep.stats=false or no stats stored in hms. This can avoid launching tez task to compute the table's row count.

BTW, timetravel or branch/tag has different stats from current snapshot, so we need to get the specified snapshotid based on the different iceberg version. Otherwise, we will get the wrong stats when querying the time travel/branch/tag.

Why are the changes needed?

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Qtest

@@ -237,19 +237,19 @@ STAGE PLANS:
alias: ice01
filterExpr: (a = 22) (type: boolean)
Snapshot ref: branch_test1
Statistics: Num rows: 3 Data size: 291 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, we always get row count of branch/tag/timetravel by the current snapshot summary, which is not right.

Copy link

sonarcloud bot commented Apr 30, 2024

Quality Gate Passed Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@zhangbutao zhangbutao marked this pull request as ready for review May 21, 2024 10:20
@zhangbutao zhangbutao changed the title Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants