Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to disable query cache (both for CPU and GPU)? #26

Open
dongheuw opened this issue Jan 13, 2022 · 2 comments
Open

Is there a way to disable query cache (both for CPU and GPU)? #26

dongheuw opened this issue Jan 13, 2022 · 2 comments

Comments

@dongheuw
Copy link

dongheuw commented Jan 13, 2022

Hi there,

I'm benchmarking OmnisciDB via pyomnisci, is there a way to disable query cache (both for CPU and GPU) between runs for a single query (disabling omnisci to use any result of the previously run queries but letting it use the tables loaded in memory)?

Also, is there a way to set the number of CPU threads used during query execution (on CPU)? Is num-executors the correct flag to set?

Thanks so much,
Dong

@cdessanti
Copy link

Hi @dongheuw ,

Sorry for the extremely late response.

There isn't any proper automatic result cache in the database; the speedup you get between the first and subsequent runs is because of a series of factors. The most important ones are the following.

  1. all the data is loaded in the proper cache layer (on system ram for CPU execution, on VRAM for GPU execution).
  2. the llvm code that serves the query has been generated, compiled, and cached.
  3. The hash indexes have been calculated and cached in case a join is occurring.
  4. other minors.

To test that, try running some queries and changing the filter. As an example using the NYC Open data of yellow cab.

this query
heavysql> select vendorid,extract(month from tpep_pickup_datetime), sum(total_amount), avg(passenger_count) from yellow_tripdata where tpep_pickup_datetime between timestamp '2010-01-01 00:00:00' and timestamp '2011-01-01 00:00:00' group by 1,2;

┌────────┬──────┬───────────┬─────────────────┐
│vendorid│EXPR$1│   EXPR$2  │      EXPR$3     │
├────────┼──────┼───────────┼─────────────────┤
│VTS     │     4│88972158.69│2.040251185236914│
[CUT RESULTS]
│VTS     │    12│87473959.67│2.039379561238744│
│CMT     │    12│78540890.76│ 1.31927503873061│
└────────┴──────┴───────────┴─────────────────┘
 22 rows returned. Exec: 1781 ms, Total: 1875 ms 

it took 1781 ms to parse, optimize, generate the llvm code, read around 3GB data from disk to system memory, and execute the query.

heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
   51.411,41 MB    2.938,84 MB    8.192,00 MB    5.253,16 MB

Clearing the caches and re-running the query it takes 300ms less, because the 0plan generation step has been

heavysql> alter system clear cpu memory;
heavysql> select vendorid,extract(month from tpep_pickup_datetime), sum(total_amount), avg(passenger_count) from yellow_tripdata where tpep_pickup_datetime between timestamp '2010-01-01 00:00:00' and timestamp '2011-01-01 00:00:00' group by 1,2;
┌────────┬──────┬─────────────┬─────────────────┐
│vendorid│EXPR$1│    EXPR$2   │      EXPR$3     │
├────────┼──────┼─────────────┼─────────────────┤
│VTS     │     4│88.972.158,69│2,040251185236914│
[CUT RESULT]
│VTS     │    12│87.473.959,67│2,039379561238744│
│CMT     │    12│78.540.890,76│ 1,31927503873061│
└────────┴──────┴─────────────┴─────────────────┘
 22 rows returned. Exec: 1423 ms, Total: 1433 ms 

re-running the query with everything in the cache, takes just 1000ms less

heavysql> select vendorid,extract(month from tpep_pickup_datetime), sum(total_amount), avg(passenger_count) from yellow_tripdata where tpep_pickup_datetime between timestamp '2010-01-01 00:00:00' and timestamp '2011-01-01 00:00:00' group by 1,2;
[Omitted results]
22 rows returned. Exec: 731 ms, Total: 732 ms

If we change the filters, the query will reuse the plan and the data cached in memory, so it will take more or less the same

heavysql> select vendorid,extract(month from tpep_pickup_datetime), sum(total_amount), avg(passenger_count) from yellow_tripdata where tpep_pickup_datetime between timestamp '2010-01-02 00:00:00' and timestamp '2010-12-31 00:00:00' group by 1,2;
[Results omitted]
22 rows returned. Exec: 767 ms, Total: 768 ms 

At the time of writing, the number of cores used to run a query is determined by the number of fragments processed by the query. The default fragment size is 32 Million, so a table of 1B of rows will be of 31 fragments, and the queries will use up to 31 threads to run the queries.

The num-executor flag isn't related to the number of CPU threads used to run the query, and in the latest version of the software we added the option to run queries concurrently on CPU and on CPU and GPUs, so depending on the workload is better having less threads run per query rather than all the resources used for a single query.

If you have any further question, let me know.

Candido

@dongheuw
Copy link
Author

Thank you! I will let you know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants