Optimize scheduler queue data structures to reduce cache misses and fragmentation #129

benvanik · 2020-05-19T19:16:28Z

From the TODO here:

Lines 283 to 287 in 0591f0e

    
           // TODO: Implement a queue that recycles elements to reduce number of 
        
           // heap allocations. 
        
           using TaskQueue = std::deque<Task>; 
        
           using FiberQueue = std::deque<Fiber*>; 
        
           using FiberSet = std::unordered_set<Fiber*>;

and this:

marl/include/marl/scheduler.h

Line 436 in 0591f0e

std::vector<Allocator::unique_ptr<Fiber>>

unordered_map and deque aren't great for memory locality and heap fragmentation, and currently these are not using the marl Allocator so they are coming right from the system allocator. It'd be good to evaluate some alternative data structures (adding to or reusing those already in https://github.com/google/marl/blob/master/include/marl/containers.h) and pooling.

ben-clayton · 2020-05-20T13:06:41Z

This could be broken down into two issues:

We're performing 'system' heap allocations outside of the Allocator interface.
unordered_map and deque aren't great for memory locality and heap fragmentation

I definitely agree with 1, and could be quickly fixed by implementing a std::allocator that wraps the marl::Allocator. I've created #131 to track this. I'd like to get to the point that we can run all the unit tests and examples without a single heap allocation that wasn't performed by the marl::Allocator.

As for 2, I don't disagree, but I've been using various profilers and benchmarks for guidance here. The slow replacement of std containers with those in marl/containers.h have all been done when they appeared hot in profiles. I'd suspect that you'll find other, more significant optimisation opportunities before you find these to be the bottleneck. If you can show they are a major performance sponge, I'm more than happy to try replacing them.

benvanik · 2020-05-20T18:40:02Z

For 2 I was mainly just going off the TODO in the code and have no empirical results yet to show why anything different would be worth it besides the "doing zero work and not thrashing cache is always cheaper than doing any work" :)

Part of this would probably be adding more test cases to scheduler_bench.cpp so that we can stress those structures a bit more (in terms of having a large number of waiters, etc).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize scheduler queue data structures to reduce cache misses and fragmentation #129

Optimize scheduler queue data structures to reduce cache misses and fragmentation #129

benvanik commented May 19, 2020

ben-clayton commented May 20, 2020

benvanik commented May 20, 2020

Optimize scheduler queue data structures to reduce cache misses and fragmentation #129

Optimize scheduler queue data structures to reduce cache misses and fragmentation #129

Comments

benvanik commented May 19, 2020

ben-clayton commented May 20, 2020

benvanik commented May 20, 2020