Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizer: minimal rule of range scan #645

Open
skyzh opened this issue May 7, 2022 · 13 comments
Open

optimizer: minimal rule of range scan #645

skyzh opened this issue May 7, 2022 · 13 comments
Assignees

Comments

@skyzh
Copy link
Member

skyzh commented May 7, 2022

After #644 gets implemented, our storage layer would have range scan capability!

To make SQL leverage this property, we can have a range filter scan rule, which does the following:

  • It runs after filter scan rule.
  • It finds a TableScan node with filter.
  • If the filter condition is in the form of cond1 and cond2 and cond3, ..., and there is one condition in the form of pk <op> constant, then we change the TableScan to scan with pk sorted, and apply the begin / end sort key to the scan interface.

We can eventually move forward to #601. But to keep everything simple, we can have this minimal rule.

@skyzh
Copy link
Member Author

skyzh commented May 12, 2022

@shmiwy will handle this 🤣

@yyin-dev
Copy link
Contributor

This looks like a reasonable exercise for an optimizer newbie... If no one is working on it right now, I can take a look?

@yyin-dev
Copy link
Contributor

I understand what a range scan is: specifying the start and end point of a scan, and (hopefully) uses an index to avoid a full table scan.

However, I am not sure what's a filter scan. Reading FilterScanRule, it seems to convert a LogicalFilter to a LogicalTableScan, which I don't see the reason. Can you give some explanation on the filter scan rule, or provide pointers to reference I can read? (I Googled "SQL filter scan" but did not find anything very relevant). Thanks.

@skyzh
Copy link
Member Author

skyzh commented Sep 28, 2022

Filters can be pushed down to the storage layer. We can put the predicate inside TableScanExecutor, so that the storage layer can filter more blocks instead of fetching them to the compute layer.

FilterScanRule merges filter + table scan to a table scan with filter.

@yyin-dev
Copy link
Contributor

Here's my understanding of the problem we are trying to solve, and a few questions:

  1. High level idea: we want to build a "range scan rule" such that, for a table scan with predicates on pks, we use the range scan ability provided by the storage layer to redece the # of blocks we read into memory (and thus improve performance). Sample query: SELECT * FROM table WHERE pk > 42 AND non_pk < 42.

  2. The logical plan for the above sample query will be a LogicalFilter, and the existing FilterScanRule will converts it into a LogicalTableScan with conditions pk > 42 AND non_pk < 42. However, this requires fetching all pages into memory to perform the filtering. The "range scan rule" (that we are going to implement) will see that we have a LogicalTableScan with conditions on the primary key, and rewrite it to a LogicalTableScan on sorted pks (pk > 42) and remaining conditions (non_pk < 42).

My questions:

  1. Is above understanding correct?

  2. According to what I described, it seems that both the existing "filter scan rule" and the new "range scan rule" should be performed by the optimizer. However, the current HeuristicOptimizer stops once it successfully applies a rule. So for the sample query, the optimizer stops after applying FilterScanRule.

    Does that mean we won't create a new RangeScanRule, but will just add more logic to FilterScanRule?

impl HeuristicOptimizer {
    pub fn optimize(&self, mut root: PlanRef) -> PlanRef {
        for rule in &self.rules {
            if let Ok(applied) = rule.apply(root.clone()) {
                root = applied;
                break;
            }
        }
    ...
  1. To provide begin_sort_key and end_sort_key arguments to Transaction::scan, we need to add those fields to LogicalTableScan, right?

@skyzh
Copy link
Member Author

skyzh commented Sep 28, 2022

Is above understanding correct?

Yes. Note that non-pk can also be pushed down to the storage layer. Our column store supports filter scan (on any column) and range filter scan on pk.

According to what I described, it seems that both the existing "filter scan rule" and the new "range scan rule" should be performed by the optimizer. However, the current HeuristicOptimizer stops once it successfully applies a rule. So for the sample query, the optimizer stops after applying FilterScanRule.

Yes, you will need to modify the rule. It would be okay to create a new rule, or modify the existing FilterScanRule. Both will work.

To provide begin_sort_key and end_sort_key arguments to Transaction::scan, we need to add those fields to LogicalTableScan, right?

Yes.

@yyin-dev
Copy link
Contributor

yyin-dev commented Sep 28, 2022

Thanks for replying!

According to what I described, it seems that both the existing "filter scan rule" and the new "range scan rule" should be performed by the optimizer. However, the current HeuristicOptimizer stops once it successfully applies a rule. So for the sample query, the optimizer stops after applying FilterScanRule.

Yes, you will need to modify the rule. It would be okay to create a new rule, or modify the existing FilterScanRule. Both will work.

Suppose we create a new rule, call it RangeScanRule. If we put it after FilterScanRule, it won't be applied, because the heuristic optimizer stops once it applies FilterScanRule; If we put it before FilterScanRule, the optimizer applies it instead of the FilterScanRule, but there will be some duplicate code between the two rules. So I guess the best way is to extend the existing FilterScanRule (and probably give it a new name like PredicatePushdown)?

impl Optimizer {
    pub fn optimize(&mut self, mut plan: PlanRef) -> PlanRef {
        ...
        let mut rules: Vec<Box<(dyn rules::Rule + 'static)>> = vec![
            Box::new(FilterAggRule {}),
            Box::new(FilterJoinRule {}),
            Box::new(LimitOrderRule {}),
        ];
        if self.enable_filter_scan {
            rules.push(Box::new(FilterScanRule {}));
        }
        let hep_optimizer = HeuristicOptimizer { rules };
        plan = hep_optimizer.optimize(plan);
    ...
}

impl HeuristicOptimizer {
    pub fn optimize(&self, mut root: PlanRef) -> PlanRef {
        for rule in &self.rules {
            if let Ok(applied) = rule.apply(root.clone()) {
                root = applied;
                break;
            }
        }

@skyzh
Copy link
Member Author

skyzh commented Sep 28, 2022

heuristic optimizer stops once it applies FilterScanRule

No. The optimizer will should apply ALL rules one by one in the list.

@skyzh
Copy link
Member Author

skyzh commented Sep 28, 2022

Seems like a bug. Need fix 🤣

@yyin-dev
Copy link
Contributor

yyin-dev commented Sep 29, 2022

Seems like a bug. Need fix 🤣

The optimizer should iterate through all rules, and try all of them sequentially, right? If that's the case, it should be an easy fix (and if we have enough tests for the optimizer, it should be relatively easy to verify that this fix won't break anything)?

Just verified that if we remove the break statement (and the optimizer will try apply all rules), all unit tests still pass.

@skyzh
Copy link
Member Author

skyzh commented Sep 29, 2022

Yes. Welcome to submit a PR. Also cc @st1page 🤣

@st1page
Copy link
Collaborator

st1page commented Sep 29, 2022

remove the break statement

+1

we put it after FilterScanRule, it won't be applied, because the heuristic optimizer stops once it applies FilterScanRule; If we put it before FilterScanRule, the optimizer applies it instead of the FilterScanRule, but there will be some duplicate code between the two rules.

BTW, if the order of the rules can affect the optimizing result, we can split them into multi optimize phases. In other words, construct multiple HeuristicOptimizer struct with different rule set.

@yyin-dev
Copy link
Contributor

I am finally picking up this again (after a long time)...

Currently, Transaction::scan looks like this:

fn scan<'a>(
        &'a self,
        begin_sort_key: &'a [DataValue],
        end_sort_key: &'a [DataValue],
        col_idx: &'a [StorageColumnRef],
        is_sorted: bool,
        reversed: bool,
        expr: Option<BoundExpr>,
    ) -> Self::ScanResultFuture<'a> {

Since begin_sort_key and end_sort_key are & [DataValue], there's no way to tell which columns they are referring to. Should scan takes another argument, e.g. sort_key_cols of type & [StorageColumnRef] or [int], to pass this information?

The implementation of range scan currently supports only one key, and always uses the first column: https://github.com/risinglightdb/risinglight/pull/644/files#r867324032 (which seems wrong).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants