You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current approach to reading Hive external tables involves three steps
Retrieving all partitions from the HMS
Fetching all data files from the partition directory
Sending the data files to the workers.
This approach can result in unbalanced IO costs among workers due to varying data file sizes. And worker side may do the data filtering before reading.
To address these issues, a proposed solution is to dynamically distribute the data files among workers. The server would divide the data files into roughly equal-sized slices and the workers would request these slices during data reading. The worker should keep requesting task slices from the server until all tasks are done. The server may preallocate a task queue for each worker to achieve better cache locality. If the allocated task queue is finished, the worker is allowed to steal tasks from other workers.
For example we divide the data files in to slices of 64M on the server. and workers are allowed to read row groups that start within the slice's range to avoid single point stress.
Tasks:
worker requests tasks from the servers dynamically
server can do the hdfs listing asynchronously
The text was updated successfully, but these errors were encountered:
Enhancement
The current approach to reading Hive external tables involves three steps
This approach can result in unbalanced IO costs among workers due to varying data file sizes. And worker side may do the data filtering before reading.
To address these issues, a proposed solution is to dynamically distribute the data files among workers. The server would divide the data files into roughly equal-sized slices and the workers would request these slices during data reading. The worker should keep requesting task slices from the server until all tasks are done. The server may preallocate a task queue for each worker to achieve better cache locality. If the allocated task queue is finished, the worker is allowed to steal tasks from other workers.
For example we divide the data files in to slices of 64M on the server. and workers are allowed to read row groups that start within the slice's range to avoid single point stress.
Tasks:
The text was updated successfully, but these errors were encountered: