optimizer should consider projections and stored values when selecting indexes #20903

AntoineAA · 2024-05-07T10:33:16Z

My Environment

ArangoDB Version: 3.11.8 enterprise
Deployment Mode: Single Server
Deployment Strategy: Manual Start
Configuration:
Infrastructure: own
Operating System: Ubuntu 22.04
Total RAM in your machine: 128Gb
Disks in use: SSD
Used Package: Ubuntu .deb

Component, Query & Data

Affected feature:

index selection when executing a query

AQL query (if applicable):

FOR tm IN z_taxa_meta
	FILTER tm.referential == 'k-world-flora'
	COLLECT families = tm.family_without_author
	SORT families
	RETURN families

29.969 s

FOR tm IN z_taxa_meta OPTIONS { "indexHint": 'idx_1798343241501769728', "forceIndexHint": true }
	FILTER tm.referential == 'k-world-flora'
	COLLECT families = tm.family_without_author
	SORT families
	RETURN families

572.090 ms

AQL explain and/or profile (if applicable):

Query String (217 chars, cacheable: true):
 FOR tm IN z_taxa_meta
 	FILTER tm.referential == 'k-world-flora'
 	COLLECT families = tm.family_without_author
 	SORT families
 	RETURN families

Execution plan:
 Id   NodeType           Est.   Comment
  1   SingletonNode         1   * ROOT
 10   IndexNode         78282     - FOR tm IN z_taxa_meta   /* persistent index scan, index scan + document lookup (projections: `family_without_author`) */    
  5   CalculationNode   78282       - LET #4 = tm.`family_without_author`   /* attribute expression */   /* collections used: tm : z_taxa_meta */
  6   CollectNode       62625       - COLLECT families = #4   /* hash */
  7   SortNode          62625       - SORT families ASC   /* sorting strategy: standard */
  8   ReturnNode        62625       - RETURN families

Indexes used:
 By   Name                      Type         Collection    Unique   Sparse   Cache   Selectivity   Fields                         Stored values   Ranges
 10   idx_1726113601910996992   persistent   z_taxa_meta   true     false    false      100.00 %   [ `referential`, `species` ]   [  ]            (tm.`referential` == "k-world-flora")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   remove-redundant-sorts
  4   move-calculations-up-2
  5   move-filters-up-2
  6   use-indexes
  7   remove-filter-covered-by-index
  8   remove-unnecessary-calculations-2
  9   reduce-extraction-to-projection

Optimization rules with highest execution times:
 RuleName                                    Duration [s]
 use-indexes                                      0.00023

82 rule(s) executed, 2 plan(s) created, peak mem [b]: 0, exec time [s]: 0.02075

Query String (215 chars, cacheable: true):
 FOR tm IN z_taxa_meta OPTIONS { "indexHint": 'idx_1798343241501769728', "forceIndexHint": true }
 	FILTER tm.referential == 'k-world-flora'
 	COLLECT families = tm.family_without_author
 	SORT families
 	RETURN families

Execution plan:
 Id   NodeType           Est.   Comment
  1   SingletonNode         1   * ROOT
 10   IndexNode         14911     - FOR tm IN z_taxa_meta   /* persistent index scan, index only (projections: `family_without_author`) */    
  5   CalculationNode   14911       - LET #4 = tm.`family_without_author`   /* attribute expression */   /* collections used: tm : z_taxa_meta */
  6   CollectNode       11928       - COLLECT families = #4   /* hash */
  7   SortNode          11928       - SORT families ASC   /* sorting strategy: standard */
  8   ReturnNode        11928       - RETURN families

Indexes used:
 By   Name                      Type         Collection    Unique   Sparse   Cache   Selectivity   Fields              Stored values                                         Ranges
 10   idx_1798343241501769728   persistent   z_taxa_meta   false    false    false        0.01 %   [ `referential` ]   [ `family_without_author`, `genus_without_author` ]   (tm.`referential` == "k-world-flora")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   remove-redundant-sorts
  4   move-calculations-up-2
  5   move-filters-up-2
  6   use-indexes
  7   remove-filter-covered-by-index
  8   remove-unnecessary-calculations-2
  9   reduce-extraction-to-projection

82 rule(s) executed, 2 plan(s) created, peak mem [b]: 0, exec time [s]: 0.00101

Dataset:

Type: document
Number of documents: 1.565.656
Estimated documents size: 1.95 GB
Number of indexes: 14
Estimated size of indexes: 651.11 MB

Size of your Dataset on disk:

600GB

Replication Factor & Number of Shards (Cluster only):

Steps to reproduce

Run the first query, without indexHint
Run the second query, with indexHint

Problem:

The automatic choice of index (based on selectivity?) results in very poor performance for this query.

As the default chosen index is based on 2 fields, the selectivity does not seem to be good if a filter is applied on only 1 of the 2 fields.

The index we force has less selectivity, but it also has stored values, and the performance is really better (29s vs. 0.5s).

Expected result:

We expect the selected index to match both the filters applied and the return statement (stored values).

The text was updated successfully, but these errors were encountered:

dothebart · 2024-05-07T11:49:44Z

Please add the index definitions in question as well.

AntoineAA · 2024-05-07T12:35:53Z

is this OK?

ID	Type	Unique	Sparse	Extras	Selectivity Est.	Fields	Stored Values	Name
0	primary	true	false		100.00%	_key		primary
28576209	persistent	false	false	deduplicate: false, estimates: true, cacheEnabled: false	0.01%	referential	family_without_author, genus_without_author	idx_1798343241501769728
14861033977	persistent	true	false	deduplicate: false, estimates: true, cacheEnabled: false	100.00%	referential, species		idx_1726113601910996992

jsteemann · 2024-05-10T14:11:06Z

Thanks for reporting this issue.
Currently the optimizer will pick indexes based on how many attributes of the FILTER / SORT conditions are covered by an index, how selective an index is, and how many attributes it covers.
Unfortunately the stored values of an index are currently not taken into account during index selection. The same is true for attributes of documents that are used outside of FILTER / SORT conditions, which may be covered by indexes as well.
Right now the only workaround in this case is to manually add an index hint to the query as you did. Longer term, the optimizer will need to be changed so that it will also take into account which attributes of the collection/indexes are using by the query besides in the FILTER / SORT conditions.

AntoineAA · 2024-05-13T14:39:27Z

Hi @jsteemann

OK, thanks for those clarifications.

dothebart · 2024-05-14T09:27:12Z

I'm turning this to a feature request, since we should eventually learn how to get this working without index hints.

dothebart changed the title ~~optimizer chooses multiple field index instead of single field index (+ stored values) because of selectivity~~ optimizer should consider projections and stored values when selecting indexes May 14, 2024

dothebart added 1 Feature 3 Optimizer AQL query optimizer is involved workaround available 3 Index labels May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizer should consider projections and stored values when selecting indexes #20903

optimizer should consider projections and stored values when selecting indexes #20903

AntoineAA commented May 7, 2024

dothebart commented May 7, 2024

AntoineAA commented May 7, 2024

jsteemann commented May 10, 2024

AntoineAA commented May 13, 2024

dothebart commented May 14, 2024

optimizer should consider projections and stored values when selecting indexes #20903

optimizer should consider projections and stored values when selecting indexes #20903

Comments

AntoineAA commented May 7, 2024

My Environment

Component, Query & Data

Steps to reproduce

dothebart commented May 7, 2024

AntoineAA commented May 7, 2024

jsteemann commented May 10, 2024

AntoineAA commented May 13, 2024

dothebart commented May 14, 2024