Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizer should consider projections and stored values when selecting indexes #20903

Open
AntoineAA opened this issue May 7, 2024 · 5 comments

Comments

@AntoineAA
Copy link

My Environment

  • ArangoDB Version: 3.11.8 enterprise
  • Deployment Mode: Single Server
  • Deployment Strategy: Manual Start
  • Configuration:
  • Infrastructure: own
  • Operating System: Ubuntu 22.04
  • Total RAM in your machine: 128Gb
  • Disks in use: SSD
  • Used Package: Ubuntu .deb

Component, Query & Data

Affected feature:

index selection when executing a query

AQL query (if applicable):

FOR tm IN z_taxa_meta
	FILTER tm.referential == 'k-world-flora'
	COLLECT families = tm.family_without_author
	SORT families
	RETURN families

29.969 s

FOR tm IN z_taxa_meta OPTIONS { "indexHint": 'idx_1798343241501769728', "forceIndexHint": true }
	FILTER tm.referential == 'k-world-flora'
	COLLECT families = tm.family_without_author
	SORT families
	RETURN families

572.090 ms

AQL explain and/or profile (if applicable):

Query String (217 chars, cacheable: true):
 FOR tm IN z_taxa_meta
 	FILTER tm.referential == 'k-world-flora'
 	COLLECT families = tm.family_without_author
 	SORT families
 	RETURN families

Execution plan:
 Id   NodeType           Est.   Comment
  1   SingletonNode         1   * ROOT
 10   IndexNode         78282     - FOR tm IN z_taxa_meta   /* persistent index scan, index scan + document lookup (projections: `family_without_author`) */    
  5   CalculationNode   78282       - LET #4 = tm.`family_without_author`   /* attribute expression */   /* collections used: tm : z_taxa_meta */
  6   CollectNode       62625       - COLLECT families = #4   /* hash */
  7   SortNode          62625       - SORT families ASC   /* sorting strategy: standard */
  8   ReturnNode        62625       - RETURN families

Indexes used:
 By   Name                      Type         Collection    Unique   Sparse   Cache   Selectivity   Fields                         Stored values   Ranges
 10   idx_1726113601910996992   persistent   z_taxa_meta   true     false    false      100.00 %   [ `referential`, `species` ]   [  ]            (tm.`referential` == "k-world-flora")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   remove-redundant-sorts
  4   move-calculations-up-2
  5   move-filters-up-2
  6   use-indexes
  7   remove-filter-covered-by-index
  8   remove-unnecessary-calculations-2
  9   reduce-extraction-to-projection

Optimization rules with highest execution times:
 RuleName                                    Duration [s]
 use-indexes                                      0.00023

82 rule(s) executed, 2 plan(s) created, peak mem [b]: 0, exec time [s]: 0.02075
Query String (215 chars, cacheable: true):
 FOR tm IN z_taxa_meta OPTIONS { "indexHint": 'idx_1798343241501769728', "forceIndexHint": true }
 	FILTER tm.referential == 'k-world-flora'
 	COLLECT families = tm.family_without_author
 	SORT families
 	RETURN families

Execution plan:
 Id   NodeType           Est.   Comment
  1   SingletonNode         1   * ROOT
 10   IndexNode         14911     - FOR tm IN z_taxa_meta   /* persistent index scan, index only (projections: `family_without_author`) */    
  5   CalculationNode   14911       - LET #4 = tm.`family_without_author`   /* attribute expression */   /* collections used: tm : z_taxa_meta */
  6   CollectNode       11928       - COLLECT families = #4   /* hash */
  7   SortNode          11928       - SORT families ASC   /* sorting strategy: standard */
  8   ReturnNode        11928       - RETURN families

Indexes used:
 By   Name                      Type         Collection    Unique   Sparse   Cache   Selectivity   Fields              Stored values                                         Ranges
 10   idx_1798343241501769728   persistent   z_taxa_meta   false    false    false        0.01 %   [ `referential` ]   [ `family_without_author`, `genus_without_author` ]   (tm.`referential` == "k-world-flora")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   remove-redundant-sorts
  4   move-calculations-up-2
  5   move-filters-up-2
  6   use-indexes
  7   remove-filter-covered-by-index
  8   remove-unnecessary-calculations-2
  9   reduce-extraction-to-projection

82 rule(s) executed, 2 plan(s) created, peak mem [b]: 0, exec time [s]: 0.00101

Dataset:

  • Type: document
  • Number of documents: 1.565.656
  • Estimated documents size: 1.95 GB
  • Number of indexes: 14
  • Estimated size of indexes: 651.11 MB

Size of your Dataset on disk:

600GB

Replication Factor & Number of Shards (Cluster only):

Steps to reproduce

  1. Run the first query, without indexHint
  2. Run the second query, with indexHint

Problem:

The automatic choice of index (based on selectivity?) results in very poor performance for this query.

As the default chosen index is based on 2 fields, the selectivity does not seem to be good if a filter is applied on only 1 of the 2 fields.

The index we force has less selectivity, but it also has stored values, and the performance is really better (29s vs. 0.5s).

Expected result:

We expect the selected index to match both the filters applied and the return statement (stored values).

@dothebart
Copy link
Contributor

Please add the index definitions in question as well.

@AntoineAA
Copy link
Author

is this OK?

ID Type Unique Sparse Extras Selectivity Est. Fields Stored Values Name Action
0 primary true false 100.00% _key primary
28576209 persistent false false deduplicate: false, estimates: true, cacheEnabled: false 0.01% referential family_without_author, genus_without_author idx_1798343241501769728
14861033977 persistent true false deduplicate: false, estimates: true, cacheEnabled: false 100.00% referential, species idx_1726113601910996992

@jsteemann
Copy link
Contributor

Thanks for reporting this issue.
Currently the optimizer will pick indexes based on how many attributes of the FILTER / SORT conditions are covered by an index, how selective an index is, and how many attributes it covers.
Unfortunately the stored values of an index are currently not taken into account during index selection. The same is true for attributes of documents that are used outside of FILTER / SORT conditions, which may be covered by indexes as well.
Right now the only workaround in this case is to manually add an index hint to the query as you did. Longer term, the optimizer will need to be changed so that it will also take into account which attributes of the collection/indexes are using by the query besides in the FILTER / SORT conditions.

@AntoineAA
Copy link
Author

Hi @jsteemann

OK, thanks for those clarifications.

@dothebart dothebart changed the title optimizer chooses multiple field index instead of single field index (+ stored values) because of selectivity optimizer should consider projections and stored values when selecting indexes May 14, 2024
@dothebart
Copy link
Contributor

I'm turning this to a feature request, since we should eventually learn how to get this working without index hints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants