Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Hive distributed processing #220

Open
2 tasks
Clark0 opened this issue Apr 11, 2023 · 1 comment · May be fixed by #204
Open
2 tasks

RFC: Hive distributed processing #220

Clark0 opened this issue Apr 11, 2023 · 1 comment · May be fixed by #204
Labels
enhancement New feature or request

Comments

@Clark0
Copy link
Collaborator

Clark0 commented Apr 11, 2023

Enhancement

The current approach to reading Hive external tables involves three steps

  1. Retrieving all partitions from the HMS
  2. Fetching all data files from the partition directory
  3. Sending the data files to the workers.

This approach can result in unbalanced IO costs among workers due to varying data file sizes. And worker side may do the data filtering before reading.

To address these issues, a proposed solution is to dynamically distribute the data files among workers. The server would divide the data files into roughly equal-sized slices and the workers would request these slices during data reading. The worker should keep requesting task slices from the server until all tasks are done. The server may preallocate a task queue for each worker to achieve better cache locality. If the allocated task queue is finished, the worker is allowed to steal tasks from other workers.

For example we divide the data files in to slices of 64M on the server. and workers are allowed to read row groups that start within the slice's range to avoid single point stress.

Tasks:

  • worker requests tasks from the servers dynamically
  • server can do the hdfs listing asynchronously
@Clark0 Clark0 added the enhancement New feature or request label Apr 11, 2023
@Clark0
Copy link
Collaborator Author

Clark0 commented Apr 11, 2023

@Clark0 Clark0 linked a pull request Apr 11, 2023 that will close this issue
1 task
@hustnn hustnn assigned hustnn and unassigned hustnn Jul 5, 2023
@hustnn hustnn mentioned this issue Jul 26, 2023
38 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants