Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove fields from PlanFragment for task serialization #22771

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented May 16, 2024

Description

This PR removes the statsAndCosts field when a plan fragment is serialized and sent to a worker in a TaskUpdateRequest.

Motivation and Context

The PlanFragment class is serialized in two cases

  1. Sending TaskUpdateRequest's to workers
  2. When the query detail info is requested via the UI

With the addition of new statistics such as histograms, the stats estimates can have a measurable impact on the size of the plan fragments being sent to workers. This leads to the case where we hit the plan task update request size limit set through experimental.internal-communication.max-task-update-size. The stats are not used by workers, so it should be safe to remove. It should also decrease serialization times.

This should allow us to send much larger plans regardless of the statistics which are present in the fragment. This solution also mitigates compatibility issues with the UI that occurred in a previous attempt at this optimization in #13451 that was subsequently reverted by #13739

Impact

Should have no user-facing impact. May affect the CPP worker serialization.

Test Plan

Existing tests. Manual testing to check UI plan view still works.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@ZacBlanco ZacBlanco force-pushed the upstream-fragment-remove-stats branch 3 times, most recently from 30f6a30 to f683fdf Compare May 16, 2024 21:44
@ZacBlanco ZacBlanco marked this pull request as ready for review May 16, 2024 22:28
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacBlanco In the near future, the workers may make use of the stats to speed up execution, e.g. it can use the NDVs to estimate the max size of a hash table. Would it be possible to control the behavior through a session property and have a finer control of the granularity what to send and what not? E.g. NDVs are usually small and may always be sent. Histograms may be subjected to a size limit, etc.

@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented May 17, 2024

Is there a plan or design document for that @yingsu00? This topic was covered in the most recent optimizer working group meeting and no one else had brought up this capability. @aditi-pandit was present there as well.

In the case that workers need statistics, I think it might necessitate re-designing the way statistics are are sent to workers because with the addition of histograms, we could end up shipping a significantly large amount of redundant information alongside each plan node. I think adding flags to pass only certain bits of information to the workers is out of scope for this PR, but could theoretically be implemented easily with a configuration parameter and some additional logic in the serialization method of the PlanFragment

@yingsu00
Copy link
Contributor

@ZacBlanco There hasn't been a design document yet, but having the NDVs is generally beneficial for query execution. If it's not mentioned before, it will be brought up soon. For example, in the upcoming aggregation pushdown improvement, if I know the NDV then different algorithms could be used to suit the different scales. Take TPCH Q1 for example, there is only 4 distinct groups so everything can be done in registers. Histograms could be useful as well but I don't have a concrete idea how to use them yet. As you said, re-designing the way to send the stats might be necessary. Do you have an idea what the new design might look like?

@aditi-pandit
Copy link
Contributor

@yingsu00 : Is there something special about sending the NDV's to the worker for the optimization you describe above ? We could use NDVs at the co-ordinator itself to decide special processing in certain PlanNodes or use new PlanNodes entirely for the example you have. The worker can also track stats in Vectors already. We could track NDVs for a vector in that codepath as well. That should suit worker runtime optimizations better right ?

elharo
elharo previously approved these changes May 17, 2024
@tdcmeehan tdcmeehan self-assigned this May 17, 2024
@yingsu00
Copy link
Contributor

@yingsu00 : Is there something special about sending the NDV's to the worker for the optimization you describe above ? We could use NDVs at the co-ordinator itself to decide special processing in certain PlanNodes or use new PlanNodes entirely for the example you have. The worker can also track stats in Vectors already. We could track NDVs for a vector in that codepath as well. That should suit worker runtime optimizations better right ?

Yes NDVs could help a number of things in the execution so that the engine could choose different strategies. For another exmaple, growing memory is very costly operation, but if we know the NDVs then we can directly allocate enough memory for hash tables. Having the coordinator choosing the execution strategy is too much as the coordinator doesn't need to know every detailed and fine granular choices the worker make. Having the local counting and sampling doesn't give the table/partition level stats and would make the code more complex. Since we already know the NDVs, why not make use of it? How big are they? Can we separate the NDVs from other stats and estimations and send the NDVs only, if the histograms are too big? If too big, will there be a way to compress/encode them?

@feilong-liu
Copy link
Contributor

I think having a flag to gate the change is generally a good idea, as it's easier to test and turn off if later found problematic. Perhaps another way is to simply not serialize the histogram if it's not to be used, but keep the other fields.

@rschlussel
Copy link
Contributor

If there aren't immediate plans relying on using this on the workers, I think we should stop sending them as per this PR and if it's needed in the future add back only the stats we need in a more targeted way. I'm not even sure the ndvs other than simple table scans with no filters pushed down are reliable enough to be used on the worker. I think you'd be better off getting estimated ndvs during runtime processing for worker optimizations. They'll be much more accurate.

@ZacBlanco one question: will the histogram stats cause any problems for the coordinator UI if they are so large. Should we null those out anyway for all serialization purposes?

@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented May 21, 2024

@rschlussel I think they could present an issue. I can try nulling them out for serialization over in the histogram PR in #22664

rschlussel
rschlussel previously approved these changes May 22, 2024
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change even if histograms are nulled out anyway for all serialization.

@aaneja aaneja self-requested a review May 23, 2024 17:09
This should allow us to send much larger plans regardless of the
statistics which are present in the fragment. This solution also
mitigates compatibility issues with the UI that occurred in
a previous attempt at this optimization
@ZacBlanco ZacBlanco force-pushed the upstream-fragment-remove-stats branch from f683fdf to a85f988 Compare May 29, 2024 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants