IN optimization and controlling task size during multipartition scan #139

parekuti · 2017-02-22T14:13:51Z

Currently there is a limit on how may partitions are supported during multipartition scan. If we increase the limit then will degrade the performance. Can we start thinking about how far we can go without degrading performance or cause issues? Also can we have a plan to add more tasks to get more cores during multipartition scan.
For example =>
0-200 or new limit --> default plan
New limit 400 --> same plan.. Some how create more tasks but fewer tasks than 5000
400 --> full table scan default behavior

velvia · 2017-02-24T00:43:40Z

Basically multi partition queries always runs on one Spark partition. WE want to enable bigger multi partition queries which can spread to multiple Spark partitions without invoking filtered full table scans. This will require some intelligent logic.

…queries using new shardKeyColumns DatasetOption (#139) * Replace chunk_size DatasetOption with shardKeyColumns; new CLI option to set during dataset creation * feat(coordinator): Compute shardKeyHash from query filters * Fix a MatchError found during flushing/ingestion * Add chunk-length histogram for ChunkSink writes * feat(cli): Add --everyNSeconds option to repeatedly query for data * Don't read or write options column for C* datasets table - not needed anymore * Make sure no negative watermarks are written in all cases

velvia added Architecture enhancement labels Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IN optimization and controlling task size during multipartition scan #139

IN optimization and controlling task size during multipartition scan #139

parekuti commented Feb 22, 2017

velvia commented Feb 24, 2017

IN optimization and controlling task size during multipartition scan #139

IN optimization and controlling task size during multipartition scan #139

Comments

parekuti commented Feb 22, 2017

velvia commented Feb 24, 2017