RFC: Hive distributed processing #220

Clark0 · 2023-04-11T08:12:13Z

Enhancement

The current approach to reading Hive external tables involves three steps

Retrieving all partitions from the HMS
Fetching all data files from the partition directory
Sending the data files to the workers.

This approach can result in unbalanced IO costs among workers due to varying data file sizes. And worker side may do the data filtering before reading.

To address these issues, a proposed solution is to dynamically distribute the data files among workers. The server would divide the data files into roughly equal-sized slices and the workers would request these slices during data reading. The worker should keep requesting task slices from the server until all tasks are done. The server may preallocate a task queue for each worker to achieve better cache locality. If the allocated task queue is finished, the worker is allowed to steal tasks from other workers.

For example we divide the data files in to slices of 64M on the server. and workers are allowed to read row groups that start within the slice's range to avoid single point stress.

Tasks:

worker requests tasks from the servers dynamically
server can do the hdfs listing asynchronously

Clark0 · 2023-04-11T08:13:49Z

ClickHouse/ClickHouse#26748

Clark0 added the enhancement New feature or request label Apr 11, 2023

Clark0 linked a pull request Apr 11, 2023 that will close this issue

Hive distributed parallel processing #204

Draft

1 task

hustnn assigned hustnn and unassigned hustnn Jul 5, 2023

hustnn mentioned this issue Jul 26, 2023

Roadmap 2023 #26

Open

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Hive distributed processing #220

RFC: Hive distributed processing #220

Clark0 commented Apr 11, 2023 •

edited

Clark0 commented Apr 11, 2023

RFC: Hive distributed processing #220

RFC: Hive distributed processing #220

Comments

Clark0 commented Apr 11, 2023 • edited

Enhancement

Clark0 commented Apr 11, 2023

Clark0 commented Apr 11, 2023 •

edited