Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpc-h compression seems not work well #2430

Open
louishust opened this issue Dec 15, 2021 · 24 comments
Open

tpc-h compression seems not work well #2430

louishust opened this issue Dec 15, 2021 · 24 comments
Assignees

Comments

@louishust
Copy link

louishust commented Dec 15, 2021

I want to test the memory usage for tpc-h.
use hyriseConsole to run the tpch.

(release)> generate_tpch 5
Generating all TPCH tables (this might take a while) ...
- Loading/Generating tables
- Loading/Generating tables done (54 s 805 ms)
- Encoding tables (if necessary) and generating pruning statistics
-  Encoding 'nation' - encoding applied (375 µs 162 ns)
-  Encoding 'region' - encoding applied (375 µs 872 ns)
-  Encoding 'supplier' - encoding applied (689 ms 556 µs)
-  Encoding 'customer' - encoding applied (1 s 644 ms)
-  Encoding 'partsupp' - encoding applied (2 s 104 ms)
-  Encoding 'part' - encoding applied (2 s 669 ms)
-  Encoding 'orders' - encoding applied (12 s 411 ms)
-  Encoding 'lineitem' - encoding applied (1 min 6 s)
- Encoding tables and generating pruning statistic done (1 min 6 s)
- Adding tables to StorageManager and generating table statistics
-  Added 'nation' (952 µs 576 ns)
-  Added 'region' (1 ms 118 µs)
-  Added 'supplier' (132 ms 589 µs)
-  Added 'customer' (2 s 34 ms)
-  Added 'part' (2 s 910 ms)
-  Added 'partsupp' (10 s 134 ms)
-  Added 'orders' (21 s 328 ms)
-  Added 'lineitem' (1 min 9 s)
- Adding tables to StorageManager and generating table statistics done (1 min 9 s)
- No indexes created as --indexes was not specified or set to false

The memory used :

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 25958 root      20   0  152.1g  12.6g  18432 S   0.0  2.5  29:00.69 hyriseConsole

So the data is about 5G, but the memory usage is 12.6G.

Then I modified the default encoding type from dictionary to lz4 for generate_and_store function.

--- a/src/benchmarklib/abstract_table_generator.cpp
+++ b/src/benchmarklib/abstract_table_generator.cpp
@@ -207,11 +207,12 @@ void AbstractTableGenerator::generate_and_store() {
     for (auto& table_info_by_name_pair : table_info_by_name) {
       const auto& table_name = table_info_by_name_pair.first;
       auto& table_info = table_info_by_name_pair.second;
+      auto encoding_config = EncodingConfig{SegmentEncodingSpec{EncodingType::LZ4}};

       const auto encode_table = [&]() {
         Timer per_table_timer;
         table_info.re_encoded =
-            BenchmarkTableEncoder::encode(table_name, table_info.table, _benchmark_config->en
+            BenchmarkTableEncoder::encode(table_name, table_info.table, encoding_config);
         auto output = std::stringstream{};
         output << "-  Encoding '" + table_name << "' - "
                << (table_info.re_encoded ? "encoding applied" : "no encoding necessary") <<

The memory used :

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 57645 root      20   0  153.4g   4.3g  19200 S   0.0  0.8  94:18.67 hyriseConsole

I want to reduce the memory usage as much as possible.
AFAIK many column store database can reach a high compression ratio.
So Is there any best practice for memory tuning?

@Bouncner
Copy link
Collaborator

Well, there are different things to discuss here. First a question: the 12.5 GB for SF 5, is this the max RSS you have seen?

We do have a rather significant footprint during data ingestion. For TPC-H usually a factor 2.5 of the actual data. This is only for data loading as we create histograms in parallel and (admittedly) are not very resource sensitive here.
So there is definitely room for improvement, but it's not been in our focus recently as it does not impact the actual benchmarks or a running Hyrise server.

If you care about the data footprint of the running database, I just want to mention that there are various data structures to look at. One of them is secondary indexes, which can be very large (but we don't really use them in TPC-H and they are not created by default). There are data statistics (this includes the mentioned histograms, but that's mostly KB/few MB range after creation). And there are various caches which can grow over time (e.g., the cache for the physical/logical query plans).
They all contribute.

Now to the actual data. If you want to examine the table data footprint, you can run ./hyriseServer --benchmark_data TPC-H:5 to start the server with TPC-H data and use a two-liner in Python to connect (e.g., using psycopg2) to Hyrise and then call select * from meta_segments_accurate to get a full overview of the data consumption.
LZ4 is very well suited for infrequently accessed string data, but less so for integer data, because Hyrise's implementation has a few optimizations for single accesses (e.g., zstd dictionaries and a block-based splitting of LZ4 segments) which is optimized for string data. If you compress integer columns with LZ4, the added data structures might be larger than the actual compressed integer data (they usually don't matter when we talk about strings though).
I better choice (depending on the data and workload) is usually dictonary-encoding (few distinct values) or frame-of-reference encoding with bit-packing. You can pass an encoding config that selects the encodings per data type (see --full_help on a benchmark binary).

We are actively researching this area and it is a super interesting field of research. In case you have any questions, you can also send me an email and I can share the most recent work you with (it's not yet published).

@Bouncner
Copy link
Collaborator

Oh, btw: it's not yet in the master branch, but we also support FSST (https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf) encoding. It can be directly merged if you're interested in it.

@Bouncner
Copy link
Collaborator

(release)> select sum(size_in_bytes) from meta_segments_accurate;
=== Columns
|SUM(size_in_bytes)|
|              long|
|              null|
=== Chunk 0 ===
|<ValueS>          |
|        5107316678|

That's the data for SF 5. For TPC-H with dictionary compression (without any further compression, no bit-packing), the string data is often stored in more bytes than the raw data while many integer columns are compressed by a factor 10x.

@louishust
Copy link
Author

louishust commented Dec 16, 2021

First a question: the 12.5 GB for SF 5, is this the max RSS you have seen?

There are 3 phases:

  1. Loading/Generating tables: RSS about 18 GB
  2. Encoding tables (if necessary) and generating pruning statistics: RSS reduced below 12.5(not record the exact number)
  3. adding tables to StorageManager and generating table statistics: RSS first increase more than 12.5G, and then reduced to 12.5G

@Bouncner
Copy link
Collaborator

Yeah, I just verified it and it pretty much looks the same on my machine.
After all the data loading is done, the console has around 6.5 GB of data loaded. I haven't looked where 1.4 GB on top of the table data are coming from, but I wouldn't consider them as problematic for now.

Is this overhead of interest for you?

image

@Bouncner
Copy link
Collaborator

In case you want to check what exactly is going on, colleagues have been successfully using Heaptrack for similar tasks: https://milianw.de/blog/heaptrack-a-heap-memory-profiler-for-linux.html

@louishust
Copy link
Author

thanks for the help.

hyrise is an in-memory database, so the memory usage is very important for us.
as you known, memory is not so large and cheap as disk.
if the memory usage can not reduce to a reasonable ratio, we can not use it in production.
For example, we have 1 TB data, which can store on a 10TB disk, but can not store on 512G RAM.
Or hyrise have a cluster mode which can take up more than one machine's memory?

@Bouncner
Copy link
Collaborator

Yes, the memory footprint is definitely of high importance. 1 TB of raw data can easily be stored on 512 GB (assuming TPC-H data), but not with the default settings of Hyrise. We are actually even a bit more space-efficient here than other database (e.g., Umbra, MonetDB, DuckDB). Using "a bit" more storage for TPC-H appears to be the norm (commercial database systems are better in this regards, SAP HANA uses ~4GB for SF 10).
If you need to store 1 TB in 512 GB RAM, you need to decide if you want to (i) use some form of tiering or (ii) provide a compression configuration. We have automated means for that, but that's not in the master (with Hyrise being a research database that's also used for teaching).

What's even more important here is the fact that's is not only about the base data. In case you want to support multiple concurrent clients, you also need DRAM capacity for the query processing. Most database thus recommend to only use half of the DRAM for data and have the rest available for query processing.

We don't have a cluster mode, but we are currently working on using disaggregated memory in Hyrise. That is one potential form of data tiering beside tiering to SSD/HDD.

@louishust
Copy link
Author

Actually we want to run tpc-h with SF 1000 on 512G RAM.
I do not know what form of tiering you mentioned above, special hardware?
compression configuration is ok for us, which branch is suitable?

@Bouncner
Copy link
Collaborator

There are branches that tier segments to secondary storage (e.g., SSD) but they are not in a state where we recommend using them. Maybe might look different in about 6-8 weeks.

I might have a branch that suits your needs (will take a while to have that all in the master). I try to send you some information in mid January, ok?

@louishust
Copy link
Author

OK, thank you very much

@Bouncner
Copy link
Collaborator

I have to postpone the information to end of January. Cannot access the test server right now and going on vacation for a week. Hope that still works for you.

@louishust
Copy link
Author

It's ok

@Bouncner
Copy link
Collaborator

Bouncner commented Feb 9, 2022

Hey @louishust, can you tell me a little bit what you are planning to do?
Do you have a script that connects via postgres to a database and issues TPC-H queries? Are TPC-H binaries fine that produce a JSON with runtimes?

We have branch that should work sufficiently well on a 512 GB machine (still testing it), but depending on what you are planning to do, I'd need to adjust a few things more.

@louishust
Copy link
Author

Hi @Bouncner ,

We are trying clickhouse to see if TPCH 1TB can run on 512 RAM.

we use hyriseConsole to generate tpch data, and use psql to connect hyrise and issue tpc-h queries.

when the branch is testing ready, we can tried the new branch.

BTW, does hyrise support mysql protocol?

@Bouncner
Copy link
Collaborator

We extended the server to encode the data during TPC-H's data generation. Calling ./hyriseServer --benchmark_data TPCH:1000 will be sufficient then (and it's already now; no need to use the console).
Further, we don't build full histograms any longer. That both should help a lot.

I am running a few to check if the current approach works fine. I'll keep you posted.

BTW, does hyrise support mysql protocol?

Unfortunately, no. Until now, we only support the postgresql protocol.

@louishust
Copy link
Author

OK

@Bouncner
Copy link
Collaborator

Bouncner commented Mar 2, 2022

Just to keep you up to date: working on memory leaks. It's a bit more tricky than I thought.

@louishust
Copy link
Author

It's ok

@Bouncner
Copy link
Collaborator

I think it should be running now. Haven't seen any leaks. Testing large scale factors now.
If you're still interested in testing Hyrise for SF 1000, I can post instructions here.

The approach that works is based on a current branch with which we study encoding schemes (a rather memory-efficient configuration has to be used as Hyrise's default dictionary decoding does not compress long strings well) and has some modifications for very large scale factors (encode data already while TPC-H data generation runs etc.)

@louishust
Copy link
Author

Glad to here that.
I still interested in testing Hyrise for SF 1000 in single machine.

@Bouncner
Copy link
Collaborator

I guess I have to give up for now. SF 1000 is running on Hyrise, but without reworking the aggregate we will not run on a 512 GB machine. Data loading and compression is not an issue (I have a working branch that gets the data to <400 GB and loads the data concurrently). But when running the queries, the memory consumption gets above 700 GB with many threads.

Sorry that it took so long to give this unsatisfying answer.

@mweisgut
Copy link
Collaborator

I have a working branch that gets the data to <400 GB and loads the data concurrently

Could you link that branch here?

@Bouncner
Copy link
Collaborator

Bouncner commented Aug 18, 2022

The branch is: https://github.com/hyrise/hyrise/tree/martin/sf1000

You can execute TPC-H SF 1000 the following way: ./cmake-build-release/hyriseBenchmarkTPCH -s 1000 --scheduler --data_preparation_cores 8 --encoding simple__LPCompressionSelection_3148260871.json

The configuration is required to reduce the data set size. It's committed to the branch. The data preparation cores are used to limit the concurrency of encoding. Otherwise, all available cores would encode the 1000 GB of data, which is too much for a 512 GB system and a server with many cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants