Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many small data files created for a table, unable to optimize the number of data files #21808

Open
jhatcher1 opened this issue May 2, 2024 · 2 comments
Labels
iceberg Iceberg connector

Comments

@jhatcher1
Copy link

Description

Cluster: 1 coordinator, 3 workers
Trino version: 441
Connector: iceberg

Hello! I'm running a query to create a new iceberg table from an existing iceberg table. Something like this:

CREATE TABLE catalog.schema.new_table
WITH (sorted_by = ARRAY['rowid'])
AS SELECT rowid, value FROM catalog.schema.existing_table;

The data is small after compression (< 100MB), so I was sort of expecting there to only be one data file for the table, but I'm noticing that the data is being split into multiple data files:

SELECT file_format, file_size_in_bytes, record_count FROM catalog.schema.new_table;
 file_format | file_size_in_bytes | record_count
-------------+--------------------+--------------
 PARQUET     |            3594324 |      3530752
 PARQUET     |            4903661 |      4816896
 PARQUET     |            6421286 |      6307840
 PARQUET     |            8255345 |      8110080
 PARQUET     |           10539607 |     10354688
 PARQUET     |           13581970 |     13344768
 PARQUET     |           18151801 |     17836240
 PARQUET     |           27264124 |     26796031
(8 rows)

After this, I tried running:

ALTER TABLE catalog.schema.new_table EXECUTE optimize

as the docs mention:

All files with a size below the optional file_size_threshold parameter (default value for the threshold is 100MB) are merged:

However even after running that command multiple times, it never merges below 3 files, despite them all being less than 100MB:

 file_format | file_size_in_bytes | record_count
-------------+--------------------+--------------
 PARQUET     |           39828589 |     39157755
 PARQUET     |           27067668 |     26609876
 PARQUET     |           25763435 |     25329664
(3 rows)

Reproduction steps:

  1. Create a table with dummy data:
CREATE TABLE catalog.test_schema.test
WITH (sorted_by = ARRAY['rowid']) 
AS SELECT rowid, rowid*2 AS value
FROM (
        SELECT sequential_number AS rowid
        FROM TABLE(sequence(start => 0, stop=> 40000000))
);
  1. Observe that there are multiple small data files created:
SELECT * FROM catalog.test_schema."test$files";
  1. Run an optimization on the table:
ALTER TABLE catalog.test_schema.test EXECUTE optimize;
  1. Repeat step 2, observe that there are still several data files smaller than 100MB for the table.
  2. After running step 3 multiple times, there are still several data files smaller than 100MB.
@findinpath findinpath added the iceberg Iceberg connector label May 16, 2024
@findinpath
Copy link
Contributor

I tried to reproduce exactly the scenario you pointed out and did have initially the following files

content |                                                                            file_path                                                                            | file_format | record_count | file_size_in_bytes |       column_sizes   >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-27602614-48a2-46eb-9baa-b6db886c2e23.parquet | PARQUET     |       874421 |            1752503 | {1=889537, 2=862660} >
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-3c85df99-c2ab-47d5-9253-e330fcb9c711.parquet | PARQUET     |      2262365 |            4518053 | {1=2301435, 2=2216305>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-fc9ace1d-b5c5-4fc1-83f8-0f41ecdf966c.parquet | PARQUET     |      4015604 |            8040134 | {1=4084631, 2=3955190>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-252522f7-225d-47ab-b119-8f9ccc692251.parquet | PARQUET     |      6275254 |           12568807 | {1=6382707, 2=6185785>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-cb981908-7b63-4a96-a209-6882586fb0b2.parquet | PARQUET     |      9906127 |           19865711 | {1=10076133, 2=978926>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-0bdc5507-edec-4ebd-8e05-6a4f5c09e142.parquet | PARQUET     |     16666230 |           33443003 | {1=16946156, 2=164965>
(6 rows)

After optimize however there was only one file

trino> SELECT * from iceberg.default."t1$files";
 content |                                                                            file_path                                                                            | file_format | record_count | file_size_in_bytes |       column_sizes   >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081901_00002_t8p79-03481c56-03ef-46da-9482-0bdc1c8f7eb4.parquet | PARQUET     |     40000001 |           80273843 | {1=40660676, 2=396128>
(1 row)

@jhatcher1
Copy link
Author

After trying some things, I think it might be related to the number of workers in the cluster. When I scaled down to a single worker I was able to optimize down to one file, but with 3 workers I could only get it down to 3 files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iceberg Iceberg connector
Development

No branches or pull requests

2 participants