Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy-initialize the global reference pool to reduce its overhead when unused #4178

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adamreichold
Copy link
Member

@adamreichold adamreichold commented May 11, 2024

As an alternative approach to the discussion in #4174 to at least have something concrete to discuss w.r.t. lock-free approaches.

This ensures that we never need more than a single atomic access (albeit acquire-release instead of relaxed) to check for a dirty pool without adding dependencies or depending on implementation details of std.

If the pool is never used, this should be as cheap as having a separate flag. However, if the pool is used, this does add costs to each deferred reference count, i.e. allocating the node and also freeing it when applied. This could be alleviated by allocating blocks of pointers instead of boxing them individually, but I think I would want to use a dependency for that.

@davidhewitt
Copy link
Member

This gets a big 👍 from me; I'd been wondering about lock free structures myself though hadn't ever thought about it in depth.

I'll try to test out the performance of this tomorrow, I suspect it will be similar to #4174 in perf in the empty case and probably quite good in the busy case too...

@adamreichold
Copy link
Member Author

adamreichold commented May 12, 2024

probably quite good in the busy case too...

I am more sceptical about the busy case, because this will heap-allocate for each and every deferred reference count update. So while the synchronization is cheap, at least the CPU time should significantly increase due to the additional allocator work.

@davidhewitt
Copy link
Member

Indeed I had an extremely quick test this morning and found that the empty case seemed indistinguishable from #4174 (as it is currently implemented) but the benchmark in #4179 performed worse. (Though we should definitely try some multi threaded benchmarks before concluding anything.)

@adamreichold
Copy link
Member Author

the empty case seemed indistinguishable from #4174 (as it is currently implemented)

Note that there is one difference insofar the lock-free implementation will keep the same performance if used intermittently. I do feel that this will only be relevant to the embedding use case though and hence we might want to prefer the lazy version that is fast when never used or always used. (In the end, I can also see working around slow pthread mutexes on macOS and going for the dirty flag again even though I would prefer a fast mutex instead.)

@adamreichold
Copy link
Member Author

adamreichold commented May 18, 2024

@davidhewitt I thought about this some more I really think we should not place a fair mutex into our Python-Rust entry point, i.e. sync::Mutex on macOS. Not so much because the overhead of acquiring the mutex by calling into pthreads, but because the danger of convoys and the general throughput limits this implies. Personally, I could ignore performance on macOS until they get their sync::Mutex in order but I strongly suspect that this is not a universally popular position. ;-)

Just biting the bullet and using parking_lot might be a simple solution but I think @alex would be quite unhappy about the dependency (and so would I).

So I tried a bit harder and pushed a blocking implementation here that sadly gets another tunable (block capacity) but it would allow us to amortize allocation overhead. Of course, it does imply quite a bit of complicated lock-free code that might actually be worse than a dependency.

(I would not propose to merge the code as it is even if the performance is satisfactory. If we do want this, we should factor out the lock-free blocked stealing stack into a generic separately documented and tested data structure. I also had a look for dependency offering this but did not find any, especially the stealing operation appears to be missing from the (mostly defunct) crates in this space.)

Copy link

codspeed-hq bot commented May 18, 2024

CodSpeed Performance Report

Merging #4178 will not alter performance

Comparing lockfree-ref-pool (8d87c83) with main (b4b780b)

Summary

✅ 67 untouched benchmarks

🆕 1 new benchmarks

Benchmarks breakdown

Benchmark main lockfree-ref-pool Change
🆕 drop_many_objects_without_gil N/A 44 µs N/A

@alex
Copy link
Contributor

alex commented May 18, 2024

I'm not hard opposed to parking_lot. But I think maybe it suggests we should raise the cfg for disabling the pool entirely up to being a feature so people can easily opt-in to not relying on it.

I think the vast majority of extensions (citation needed!) probably are not dropping references while the GIL is not held.

@adamreichold
Copy link
Member Author

I put the data structure itself into a separate crate available at

https://github.com/adamreichold/lockfree-collector

and where it does pass a simple smoke test running under Miri. (Strangely enough, Miri does not seem to care for fetch_update and but is fine with an open-coded compare_exchange_weak loop.)

@alex
Copy link
Contributor

alex commented May 18, 2024 via email

@adamreichold
Copy link
Member Author

Obligatory: consider filing a bug with miri with a reproducer :-)

Indeed. Did so immediately: rust-lang/miri#3613 (Albeit I am not really sure if the closure doesn't indeed introduced additional requirements (in the form of LLVM attributes attached to its arguments or captures) that the concurrent access violate.)

@davidhewitt
Copy link
Member

But I think maybe it suggests we should raise the cfg for disabling the pool entirely up to being a feature so people can easily opt-in to not relying on it.

If your goal is to have a feature to opt-out of the pool parking_lot, I'm not sure it works like this (we would need to make a default-on feature with the pool, though that's not out of the question of course).


That said, I am excited by the new development on the lock free datastructure and intend to go read lockfree-collector shortly!

@adamreichold adamreichold force-pushed the lockfree-ref-pool branch 4 times, most recently from 3cee3f0 to 714f4d4 Compare May 20, 2024 16:10
@adamreichold
Copy link
Member Author

I updated the lock-free collector to newer deallocate blocks (just like our current Mutex<Vec<_>> never releases its allocation) while still allowing concurrent use and would be glad if someone running macOS could compare this with the mutex-based version.

@davidhewitt
Copy link
Member

👍 I can hopefully test this tomorrow at the PyCon sprints. Today we had a wave of new contributors

@davidhewitt
Copy link
Member

With #4200 merged and this branch rebased the comparison is now fair (as both branches exercise the pool again) so I ran the macOS numbers quickly this morning (Python 3.12, MBP m1):

# Before:
$ python test.py
py 2.0644041011109948e-08
rust 2.7087415975984187e-08

# After:
$ python test.py
py 2.065412502270192e-08
rust 1.902304100804031e-08

So at least for macOS (as we somewhat expect) this makes a big difference. I'd be keen to proceed with this as it hopefully removes the user need for the more drastic cfg flags to rip the pool out entirely.

@adamreichold
Copy link
Member Author

With #4200 merged and this branch rebased the comparison is now fair (as both branches exercise the pool again) so I ran the macOS numbers quickly this morning (Python 3.12, MBP m1):

This is the no-op test, right? Did you run anything (besides our single-threaded benchmark which Codspeed ran here as well and which did not regress after I made the lock-free collector reuse allocations) anything which actually exercises the pool? Should I give adding a multi-threaded benchmark here a try?

I'd be keen to proceed with this as it hopefully removes the user need for the more drastic cfg flags to rip the pool out entirely.

Agreed. I guess the main issue is where to put the lockfree-collector crate. I don't pyo3 should depend on personal repository of mine. But I would like to keep the data structure in a separate crate as it allows it to be tested under Miri (and I guess it could be generally useful).

Should I just move the crate into the PyO3 organisation, publish to crates.io and use a proper dependency here? Or do we want to move it into this repository proper?

I think having it here would simplify the release process but further complicate our already complicated CI. I don't the data structure itself will see too many releases, so personally, I would just put into the organisation but as a separate repository and publish version 0.1.0 so we can use that here.

@adamreichold
Copy link
Member Author

So I added some benchmarks for dropping without the GIL and doing that using multiple threads and I guess I should have started with that. No performance optimization without measurement.

After doing that, I am dropping the lock-free collector on the floor as its overhead of constantly allocating new blocks if there is contention makes it really slow. (This is admittedly the worst case and common patterns like using Rayon will exhibit collection concurrently with enqueuing new decrements, but I guess it does happen somewhere and the performance profile is just too bad.)

I then replaced it by a commit that uses once_cell::sync::Lazy around the global reference pool hiding its overhead when it is not in use but keeping the same performance profile of the mutex-based solution when it is used. I do expect the no-op test to yield the same results using that approach, i.e. a single atomic access. (Strangely enough, this is even a bit faster than main which I guess is due to Lazy adding some additional alignment to the mutex itself.)

So here are the results of the pushed benchmarks:

main              
----                        

drop_many_objects       time:   [1.6580 µs 1.6581 µs 1.6583 µs]

drop_many_objects_without_gil
                        time:   [5.1415 µs 5.1476 µs 5.1523 µs]

drop_many_objects_multiple_threads
                        time:   [118.78 µs 120.70 µs 122.45 µs]


lazy
----

drop_many_objects       time:   [1.6948 µs 1.6956 µs 1.6965 µs]

drop_many_objects_without_gil
                        time:   [5.6908 µs 5.6978 µs 5.7038 µs]

drop_many_objects_multiple_threads
                        time:   [105.58 µs 106.61 µs 107.58 µs]

lock-free
---------

drop_many_objects       time:   [1.6589 µs 1.6594 µs 1.6600 µs]

drop_many_objects_without_gil
                        time:   [42.540 µs 42.551 µs 42.563 µs]

drop_many_objects_multiple_threads
                        time:   [7.8457 ms 8.2430 ms 8.6565 ms]

I am not sure if the benchmarks are stable enough or if they will need to be added to the Codspeed exceptions.

This also still leaves a fair (i.e. queuing) mutex in there on macOS which is a throughput hazard (but certainly not as much as my ill-designed lock-free collector) for which I currently have no other solution than to go back to using parking_lot.

@adamreichold adamreichold changed the title Use a lock-free stack to implement the global reference pool to penalize producers versus consumers Lazy-initialize the global reference pool to reduce its overhead when unused Jun 3, 2024
@adamreichold adamreichold marked this pull request as ready for review June 3, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants