-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A native parquet reader for primitive types #60361
Conversation
This is an automatic comment. The PR descriptions does not match the template. Please, edit it accordingly. The error is: More than one changelog category specified: 'Improvement', 'Performance Improvement' |
This is an automated comment for commit 3d7befe with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
5012933
to
5166d8b
Compare
Well, I think these break tests are not caused by my commit. UnitTestsAsan and ASTFuzzerTestAsan have succeeded in former test. While the integration test is failed because of logical error. Hope for further suggestion. |
Yes, we need to start the investigation, and then check and fix every failure one by one. |
@alexey-milovidov |
@copperybean, please resolve conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for working on this!
Lots of review comments, mostly unimportant. The only blockers are:
- Seems that TIMESTAMP types will be incorrectly interpreted as seconds, please test.
- The block_missing_values_ptr thing, please test. If it doesn't work, at least add a TODO so we don't forget.
- Pass file metadata to ParquetFileReader::Open to avoid re-reading it for each row group.
/// If defaults_for_omitted_fields is true, calculate the default values from default expression for omitted fields. | ||
/// Otherwise fill the missing columns with zero values of its type. | ||
BlockMissingValues * block_missing_values_ptr = format_settings.defaults_for_omitted_fields ? &res.block_missing_values : nullptr; | ||
row_group_batch.arrow_column_to_ch_column->arrowTableToCHChunk(res.chunk, *tmp_table, (*tmp_table)->num_rows(), block_missing_values_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this comment says, block_missing_values_ptr
is used for applying default expressions to omitted fields (here "omitted" means NULLs, I think).
Native reader doesn't populate block_missing_values_ptr
, so it'll probably break default expressions. Please test this:
create table a (x Int64, y String default 'asd') engine Memory
insert into function file('t.parquet') select 1 as x, null::Nullable(String) as y
insert into a select * from file('t.parquet')
select * from a
┌─x─┬─y───┐
1. │ 1 │ asd │
└───┴─────┘
If it doesn't work, either populate block_missing_values_ptr
based on null masks or at least add a TODO comment about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this feature. I would like to add a TODO comment, because the block-missing-values logic will be modified again when supporting nested type, so the it can be implemented later for all types.
src/Processors/Formats/Impl/Parquet/ParquetDataValuesReader.cpp
Outdated
Show resolved
Hide resolved
IndividualVisitor && individual_visitor, | ||
RepeatedVisitor && repeated_visitor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simpler way to handle nulls is to read all non-null values into a column, then use IColumn::expand to insert default values according to null mask. A little slower because of copying.
I don't mind the current approach, just mentioning the alternative in case it turns out better for the general case of arbitrarily nested arrays, nullables, and tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'd prefer to keep current implementation currently, may be I will change the logic as your suggestion when supporting nested columns.
visitColStrIndexType(dictionary->size(), [&]<typename TColVec>(TColVec *) | ||
{ | ||
const TColVec & col_src = *static_cast<const TColVec *>(col_existing.get()); | ||
reserveColumnStrRows(column, reading_rows_num); | ||
|
||
col_dest.getOffsets().resize(col_src.size()); | ||
for (size_t i = 0; i < col_src.size(); i++) | ||
{ | ||
auto src_idx = col_src.getData()[i]; | ||
if (0 == src_idx) | ||
{ | ||
null_map->setNull(i); | ||
} | ||
auto dict_chars_cursor = col_dict_str.getOffsets()[src_idx - 1]; | ||
auto str_len = col_dict_str.getOffsets()[src_idx] - dict_chars_cursor; | ||
auto dst_chars_cursor = col_dest.getChars().size(); | ||
col_dest.getChars().resize(dst_chars_cursor + str_len); | ||
|
||
memcpySmallAllowReadWriteOverflow15( | ||
&col_dest.getChars()[dst_chars_cursor], &col_dict_str.getChars()[dict_chars_cursor], str_len); | ||
col_dest.getOffsets()[i] = col_dest.getChars().size(); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier to create a ColumnLowCardinality and call convertToFullColumn() on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it will be easier. But the current implementation is more efficient, even not very much. Because the size of new Column generated by convertToFullColumn
is smaller than reading_rows_num
, so it should be resized to reading_rows_num
again.
By the way, I will implement the logic as your suggestion next time. For current codes, I'd prefer not to change it right now. Instead, I can add some comments.
Thanks for the comments, I will fix them in this weekend. |
Change-Id: I83a8ec8271edefcd96cb5b3bcd12f6b545d9dec0
Change-Id: I38b8368b022263d9a71cb3f3e9fdad5d6ca26753
Change-Id: If79741b7456667a8dde3e355d9dc684c2dd84f4f
This reverts commit 5df94b7.
5166d8b
to
718fa1c
Compare
…18 tidy warnings Change-Id: I3119c44dc764caed0dc471f52ac5e634c75c7b50
ccf801c
to
5d848aa
Compare
|
Confusing error message when trying to read an array:
|
Fix for the error message:
|
case parquet::Type::BYTE_ARRAY: | ||
return fromByteArray(); | ||
case parquet::Type::FIXED_LEN_BYTE_ARRAY: | ||
return fromFLBA(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the Bool
and FLBA (fixed string)
types are not currently supported, Are there any plans to add support for these types in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bool types are distinct due to their 1-bit width, while clickhouse uses a UInt8 column vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With current framework, supporting these types is not difficult. My plan is to support nested type after this PR, then support all parquet types and enable 02735_parquet_encoder test.
Change-Id: Iefec91bca223eedaabe302b7891808c6d94eed9d
ddaf172
to
ad5f6f2
Compare
Change-Id: Iff9f5f894e815b184ac35f61b4cac87908c612b5
dd95beb
to
b253ca3
Compare
It seems that the ClickHouse special build check is failed because of OOM? |
It's broken in master, waiting for #64204 (and maybe something else, that one only fixes one of the two OOMing builds). |
A native parquet reader for primitive types
ee3e7f2
@@ -93,6 +93,7 @@ static std::map<ClickHouseVersion, SettingsChangesHistory::SettingsChanges> sett | |||
{"cross_join_min_bytes_to_compress", 0, 1_GiB, "A new setting."}, | |||
{"http_max_chunk_size", 0, 0, "Internal limitation"}, | |||
{"prefer_external_sort_block_bytes", 0, DEFAULT_BLOCK_SIZE * 256, "Prefer maximum block bytes for external sort, reduce the memory usage during merging."}, | |||
{"input_format_parquet_use_native_reader", false, false, "When reading Parquet files, to use native reader instead of arrow reader."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go to 24.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a great feature! @copperybean Will you support complex types? @al13n321 Do you have plan to support complex types to make native parquet reader production ready? |
@copperybean it's cool feature Do you have plans to support page index? In this case,we need to skip reading some rows. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
A native parquet reader, which can read parquet binary to ClickHouse Columns directly. Now this feature can be activated by setting
input_format_parquet_use_native_reader
to true.Currently, parquet file is read by arrow library, and it's read to arrow table first, and then copy the arrow table to ClickHouse Columns. There are some shortcomings in performance.
int16_t
array first; then nullability relatedvalid_bits_
is generated. While, thenull_map
andoffsets
(not included this time) can be generated directly when reading definition and repetition levels.This feature is first implemented in the product BMR of Baidu AI Cloud, which has been fully tested.
Performance Test
As a result, the perforation of current implementation is speedup obviously. To generate the test data, a parquet file
src.parquet
of TPCDS table store_sales with scale 5000 is used, there are 35207247 rows in this file. Next, the test data is generated with following query:Then the performance is tested by following command
For each field, two types tests are triggered with different
input_format_parquet_use_native_reader
setting, and single thread is used. The parquet reading duration is counted as commit log duration while reading parquet. The CPU model used by this test isIntel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
.The test result is detailed in following table
Documentation entry for user-facing changes