Many small data files created for a table, unable to optimize the number of data files #21808

jhatcher1 · 2024-05-02T20:57:37Z

Description

Cluster: 1 coordinator, 3 workers
Trino version: 441
Connector: iceberg

Hello! I'm running a query to create a new iceberg table from an existing iceberg table. Something like this:

CREATE TABLE catalog.schema.new_table
WITH (sorted_by = ARRAY['rowid'])
AS SELECT rowid, value FROM catalog.schema.existing_table;

The data is small after compression (< 100MB), so I was sort of expecting there to only be one data file for the table, but I'm noticing that the data is being split into multiple data files:

SELECT file_format, file_size_in_bytes, record_count FROM catalog.schema.new_table;
 file_format | file_size_in_bytes | record_count
-------------+--------------------+--------------
 PARQUET     |            3594324 |      3530752
 PARQUET     |            4903661 |      4816896
 PARQUET     |            6421286 |      6307840
 PARQUET     |            8255345 |      8110080
 PARQUET     |           10539607 |     10354688
 PARQUET     |           13581970 |     13344768
 PARQUET     |           18151801 |     17836240
 PARQUET     |           27264124 |     26796031
(8 rows)

After this, I tried running:

ALTER TABLE catalog.schema.new_table EXECUTE optimize

as the docs mention:

All files with a size below the optional file_size_threshold parameter (default value for the threshold is 100MB) are merged:

However even after running that command multiple times, it never merges below 3 files, despite them all being less than 100MB:

 file_format | file_size_in_bytes | record_count
-------------+--------------------+--------------
 PARQUET     |           39828589 |     39157755
 PARQUET     |           27067668 |     26609876
 PARQUET     |           25763435 |     25329664
(3 rows)

Reproduction steps:

Create a table with dummy data:

CREATE TABLE catalog.test_schema.test
WITH (sorted_by = ARRAY['rowid']) 
AS SELECT rowid, rowid*2 AS value
FROM (
        SELECT sequential_number AS rowid
        FROM TABLE(sequence(start => 0, stop=> 40000000))
);

Observe that there are multiple small data files created:

SELECT * FROM catalog.test_schema."test$files";

Run an optimization on the table:

ALTER TABLE catalog.test_schema.test EXECUTE optimize;

Repeat step 2, observe that there are still several data files smaller than 100MB for the table.
After running step 3 multiple times, there are still several data files smaller than 100MB.

The text was updated successfully, but these errors were encountered:

findinpath · 2024-05-16T08:23:25Z

I tried to reproduce exactly the scenario you pointed out and did have initially the following files

content |                                                                            file_path                                                                            | file_format | record_count | file_size_in_bytes |       column_sizes   >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-27602614-48a2-46eb-9baa-b6db886c2e23.parquet | PARQUET     |       874421 |            1752503 | {1=889537, 2=862660} >
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-3c85df99-c2ab-47d5-9253-e330fcb9c711.parquet | PARQUET     |      2262365 |            4518053 | {1=2301435, 2=2216305>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-fc9ace1d-b5c5-4fc1-83f8-0f41ecdf966c.parquet | PARQUET     |      4015604 |            8040134 | {1=4084631, 2=3955190>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-252522f7-225d-47ab-b119-8f9ccc692251.parquet | PARQUET     |      6275254 |           12568807 | {1=6382707, 2=6185785>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-cb981908-7b63-4a96-a209-6882586fb0b2.parquet | PARQUET     |      9906127 |           19865711 | {1=10076133, 2=978926>
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081115_00000_t8p79-0bdc5507-edec-4ebd-8e05-6a4f5c09e142.parquet | PARQUET     |     16666230 |           33443003 | {1=16946156, 2=164965>
(6 rows)

After optimize however there was only one file

trino> SELECT * from iceberg.default."t1$files";
 content |                                                                            file_path                                                                            | file_format | record_count | file_size_in_bytes |       column_sizes   >
---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+--------------+--------------------+---------------------->
       0 | hdfs://hadoop-master:9000/user/hive/warehouse/t1-57ebe823af944510bbbaec9f2d745fed/data/20240516_081901_00002_t8p79-03481c56-03ef-46da-9482-0bdc1c8f7eb4.parquet | PARQUET     |     40000001 |           80273843 | {1=40660676, 2=396128>
(1 row)

jhatcher1 · 2024-05-16T12:24:59Z

After trying some things, I think it might be related to the number of workers in the cluster. When I scaled down to a single worker I was able to optimize down to one file, but with 3 workers I could only get it down to 3 files.

findinpath added the iceberg Iceberg connector label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many small data files created for a table, unable to optimize the number of data files #21808

Many small data files created for a table, unable to optimize the number of data files #21808

jhatcher1 commented May 2, 2024

findinpath commented May 16, 2024

jhatcher1 commented May 16, 2024

Many small data files created for a table, unable to optimize the number of data files #21808

Many small data files created for a table, unable to optimize the number of data files #21808

Comments

jhatcher1 commented May 2, 2024

Description

Reproduction steps:

findinpath commented May 16, 2024

jhatcher1 commented May 16, 2024