Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster RemoteNode slowdowns query by 50 times comparing to single instance #20621

Open
Encouse opened this issue Feb 14, 2024 · 3 comments
Open

Comments

@Encouse
Copy link

Encouse commented Feb 14, 2024

My Environment

  • ArangoDB Version: 3.11.5

  • Deployment Mode: Cluster

  • Deployment Strategy: ArangoDB Starter

  • Configuration: arangodb_database_directory: "/var/lib/arangodb/"
    http__keep_alive_timeout: 15000
    cluster__default_replication_factor: 3
    cluster__system_replication_factor: 1
    query__cache_mode: "demand"
    query__tracking_with_bindvars: "false"
    rocksdb__pending_compactions_slowdown_trigger: 17179869184
    log__level: "debug"
    rocksdb__compaction_read_ahead_size: 0
    rocksdb__max_parallel_compactions: 16
    rocksdb__max_subcompactions: 16
    rocksdb__wal_archive_size_limit: 1073741824
    nossl: false

                                               Also in env args there's arango used memory override (we set it to 14GB, having 16GB of memory in each machine)
                                                
                                                (all parameters then formatted right as we use jinja template and ansible to deploy arango)
    
  • Infrastructure: own (5 agents on separate machines, 3 machines with db-server and coordinator on each)

  • Operating System: Ubuntu 20.04

  • Total RAM in your machine: 16GB

  • Disks in use: SSD

  • Used Package: Debian or Ubuntu .deb

Component, Query & Data

Affected feature: AQL subqueries

AQL query (if applicable):
Here we perform the join of interfaces to baremetal servers (id is hash and list of connected interfaces ids is link_hashes.net_interface)

FOR server IN baremetal
    LET interfaces = (FOR net IN net_interface FILTER net.hash IN server.link_hashes.net_interface RETURN net)
    RETURN {server, interfaces}

AQL explain and/or profile (if applicable):

Query String (166 chars, cacheable: false):
 FOR server IN baremetal
     LET interfaces = (FOR net IN net_interface FILTER net.hash IN server.link_hashes.net_interface 
 RETURN net)
     RETURN {server, interfaces}

Execution plan:
 Id   NodeType                  Site  Calls   Items   Filtered   Runtime [s]   Comment
  1   SingletonNode             DBS       1       1          0       0.00000   * ROOT
  2   EnumerateCollectionNode   DBS       7    6261          0       0.01628     - FOR server IN baremetal   /* full collection scan, 1 shard(s)  */
 14   RemoteNode                COOR     14    6261          0       0.38528       - REMOTE
 15   GatherNode                COOR     14    6261          0       0.00702       - GATHER   /* unsorted */
 20   SubqueryStartNode         COOR   6268   12522          0       0.03825       - LET interfaces = ( /* subquery begin */
 16   ScatterNode               COOR   6268   12522          0       0.05023         - SCATTER
 17   RemoteNode                DBS   12522   12522          0      11.21860         - REMOTE
 11   IndexNode                 DBS   12522   22731          0       0.61711         - FOR net IN net_interface   /* persistent index scan, index scan + document lookup, 1 shard(s) */    
 18   RemoteNode                COOR  12522   22731          0      11.01305           - REMOTE
 19   GatherNode                COOR   6286   22731          0       0.04732           - GATHER   /* unsorted */
 21   SubqueryEndNode           COOR   6268    6261          0       0.01915           - RETURN  net ) /* subquery end */
  9   CalculationNode           COOR   6268    6261          0       0.03066       - LET #7 = { "server" : server, "interfaces" : interfaces }   /* simple expression */   /* collections used: server : baremetal */
 10   ReturnNode                COOR   6268    6261          0       0.00005       - RETURN #7

Indexes used:
 By   Name                      Type         Collection      Unique   Sparse   Cache   Selectivity   Fields       Stored values   Ranges
 11   idx_1773683048075755520   persistent   net_interface   true     false    false      100.00 %   [ `hash` ]   [  ]            (net.`hash` IN server.`link_hashes`.`net_interface`)

Optimization rules applied:
 Id   RuleName
  1   use-indexes
  2   remove-filter-covered-by-index
  3   remove-unnecessary-calculations-2
  4   scatter-in-cluster
  5   remove-unnecessary-remote-scatter
  6   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Cache Hits/Misses   Filtered   Requests   Peak Mem [b]   Exec Time [s]
           0            0        6261        16470               0 / 0          0      12531       16515072        23.44577

Query Profile:
 Query Stage               Duration [s]
 initializing                   0.00000
 parsing                        0.00004
 optimizing ast                 0.00001
 loading collections            0.00000
 instantiating plan             0.00002
 optimizing plan                0.00024
 instantiating executors        0.00141
 executing                     23.44411
 finalizing                     0.00167

Warnings:
 Code   Message
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected
 1563   in function 'SORTED_UNIQUE()': array expected

Dataset: Dataset is private

Size of your Dataset on disk: 3GB

Replication Factor & Number of Shards (Cluster only): Replication factor - 3, number of shards - 1

Steps to reproduce

  1. Make a query profile or execute
  2. Wait till the end
  3. See that RemoteNode insanely slows down it all

Problem: Cluster performance of RemoteNode is poor, in single instance the same query executes 100 times faster

Expected result: Execution of query above in cluster should be not much slower comparing to a single instance with reasonable time for RemoteNode

@Encouse
Copy link
Author

Encouse commented Feb 14, 2024

In addition - look at RemoteNode with ID 14, it takes 0.3s for 6261 items, and RemoteNode with ID 17 takes 11.2s for 12522 items it's just unfair!

@mpoeter
Copy link
Member

mpoeter commented Feb 16, 2024

The difference is that RemoteNode with ID 14 performs only 14 calls to produce these 6261 items, while the RemoteNode 17 performs 12522 calls. Can you try to reformulate that query so that it does not use a subquery, e.g. using COLLECT?

@Encouse
Copy link
Author

Encouse commented Feb 22, 2024

And if I want to join many relations to one type just like
FOR server IN baremetal LET interfaces = (FOR net IN net_interface FILTER net.hash IN server.link_hashes.net_interface RETURN net) LET virtual_machines = (FOR vm IN virtual_machine FILTER vm.hash IN server.link_hashes.virtual_machine RETURN vm) RETURN {server, interfaces, virtual_machines}
Which means "I want all baremetal servers with their corresponding interfaces and virtual machines for each"

Maybe I'm wrong, but it seems not possible to reformulate the query in such way...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants