Lazy-initialize the global reference pool to reduce its overhead when unused #4178

adamreichold · 2024-05-11T16:54:47Z

As an alternative approach to the discussion in #4174 to at least have something concrete to discuss w.r.t. lock-free approaches.

This ensures that we never need more than a single atomic access (albeit acquire-release instead of relaxed) to check for a dirty pool without adding dependencies or depending on implementation details of std.

If the pool is never used, this should be as cheap as having a separate flag. However, if the pool is used, this does add costs to each deferred reference count, i.e. allocating the node and also freeing it when applied. This could be alleviated by allocating blocks of pointers instead of boxing them individually, but I think I would want to use a dependency for that.

davidhewitt · 2024-05-11T21:05:34Z

This gets a big 👍 from me; I'd been wondering about lock free structures myself though hadn't ever thought about it in depth.

I'll try to test out the performance of this tomorrow, I suspect it will be similar to #4174 in perf in the empty case and probably quite good in the busy case too...

adamreichold · 2024-05-12T06:44:48Z

probably quite good in the busy case too...

I am more sceptical about the busy case, because this will heap-allocate for each and every deferred reference count update. So while the synchronization is cheap, at least the CPU time should significantly increase due to the additional allocator work.

davidhewitt · 2024-05-12T17:46:18Z

Indeed I had an extremely quick test this morning and found that the empty case seemed indistinguishable from #4174 (as it is currently implemented) but the benchmark in #4179 performed worse. (Though we should definitely try some multi threaded benchmarks before concluding anything.)

adamreichold · 2024-05-12T17:54:21Z

the empty case seemed indistinguishable from #4174 (as it is currently implemented)

Note that there is one difference insofar the lock-free implementation will keep the same performance if used intermittently. I do feel that this will only be relevant to the embedding use case though and hence we might want to prefer the lazy version that is fast when never used or always used. (In the end, I can also see working around slow pthread mutexes on macOS and going for the dirty flag again even though I would prefer a fast mutex instead.)

adamreichold · 2024-05-18T16:13:45Z

@davidhewitt I thought about this some more I really think we should not place a fair mutex into our Python-Rust entry point, i.e. sync::Mutex on macOS. Not so much because the overhead of acquiring the mutex by calling into pthreads, but because the danger of convoys and the general throughput limits this implies. Personally, I could ignore performance on macOS until they get their sync::Mutex in order but I strongly suspect that this is not a universally popular position. ;-)

Just biting the bullet and using parking_lot might be a simple solution but I think @alex would be quite unhappy about the dependency (and so would I).

So I tried a bit harder and pushed a blocking implementation here that sadly gets another tunable (block capacity) but it would allow us to amortize allocation overhead. Of course, it does imply quite a bit of complicated lock-free code that might actually be worse than a dependency.

(I would not propose to merge the code as it is even if the performance is satisfactory. If we do want this, we should factor out the lock-free blocked stealing stack into a generic separately documented and tested data structure. I also had a look for dependency offering this but did not find any, especially the stealing operation appears to be missing from the (mostly defunct) crates in this space.)

codspeed-hq · 2024-05-18T16:21:11Z

CodSpeed Performance Report

Merging #4178 will not alter performance

_{Comparing lockfree-ref-pool (8d87c83) with main (b4b780b)}

Summary

✅ 67 untouched benchmarks

🆕 1 new benchmarks

Benchmarks breakdown

	Benchmark	`main`	`lockfree-ref-pool`	Change
🆕	`drop_many_objects_without_gil`	N/A	44 µs	N/A

alex · 2024-05-18T16:36:07Z

I'm not hard opposed to parking_lot. But I think maybe it suggests we should raise the cfg for disabling the pool entirely up to being a feature so people can easily opt-in to not relying on it.

I think the vast majority of extensions (citation needed!) probably are not dropping references while the GIL is not held.

adamreichold · 2024-05-18T18:13:39Z

I put the data structure itself into a separate crate available at

https://github.com/adamreichold/lockfree-collector

and where it does pass a simple smoke test running under Miri. (Strangely enough, Miri does not seem to care for fetch_update and but is fine with an open-coded compare_exchange_weak loop.)

alex · 2024-05-18T18:20:47Z

Obligatory: consider filing a bug with miri with a reproducer :-)

…

On Sat, May 18, 2024 at 2:14 PM Adam Reichold ***@***.***> wrote: I put the data structure itself into a separate crate available at https://github.com/adamreichold/lockfree-collector and where it does pass a simple smoke test running under Miri. (Strangely enough, Miri does not seem to care for fetch_update and but is fine with an open-coded compare_exchange_weak loop.) — Reply to this email directly, view it on GitHub <#4178 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAAGBAE22OPWTTFFEZC6LDZC6K6TAVCNFSM6AAAAABHSB7DZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYHEYDMMJZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- All that is necessary for evil to succeed is for good people to do nothing.

adamreichold · 2024-05-18T19:29:34Z

Obligatory: consider filing a bug with miri with a reproducer :-)

Indeed. Did so immediately: rust-lang/miri#3613 (Albeit I am not really sure if the closure doesn't indeed introduced additional requirements (in the form of LLVM attributes attached to its arguments or captures) that the concurrent access violate.)

davidhewitt · 2024-05-19T14:22:59Z

But I think maybe it suggests we should raise the cfg for disabling the pool entirely up to being a feature so people can easily opt-in to not relying on it.

If your goal is to have a feature to opt-out of the pool parking_lot, I'm not sure it works like this (we would need to make a default-on feature with the pool, though that's not out of the question of course).

That said, I am excited by the new development on the lock free datastructure and intend to go read lockfree-collector shortly!

adamreichold · 2024-05-20T16:36:36Z

I updated the lock-free collector to newer deallocate blocks (just like our current Mutex<Vec<_>> never releases its allocation) while still allowing concurrent use and would be glad if someone running macOS could compare this with the mutex-based version.

davidhewitt · 2024-05-20T21:27:38Z

👍 I can hopefully test this tomorrow at the PyCon sprints. Today we had a wave of new contributors

davidhewitt · 2024-06-03T08:16:44Z

With #4200 merged and this branch rebased the comparison is now fair (as both branches exercise the pool again) so I ran the macOS numbers quickly this morning (Python 3.12, MBP m1):

# Before:
$ python test.py
py 2.0644041011109948e-08
rust 2.7087415975984187e-08

# After:
$ python test.py
py 2.065412502270192e-08
rust 1.902304100804031e-08

So at least for macOS (as we somewhat expect) this makes a big difference. I'd be keen to proceed with this as it hopefully removes the user need for the more drastic cfg flags to rip the pool out entirely.

adamreichold · 2024-06-03T09:24:33Z

With #4200 merged and this branch rebased the comparison is now fair (as both branches exercise the pool again) so I ran the macOS numbers quickly this morning (Python 3.12, MBP m1):

This is the no-op test, right? Did you run anything (besides our single-threaded benchmark which Codspeed ran here as well and which did not regress after I made the lock-free collector reuse allocations) anything which actually exercises the pool? Should I give adding a multi-threaded benchmark here a try?

I'd be keen to proceed with this as it hopefully removes the user need for the more drastic cfg flags to rip the pool out entirely.

Agreed. I guess the main issue is where to put the lockfree-collector crate. I don't pyo3 should depend on personal repository of mine. But I would like to keep the data structure in a separate crate as it allows it to be tested under Miri (and I guess it could be generally useful).

Should I just move the crate into the PyO3 organisation, publish to crates.io and use a proper dependency here? Or do we want to move it into this repository proper?

I think having it here would simplify the release process but further complicate our already complicated CI. I don't the data structure itself will see too many releases, so personally, I would just put into the organisation but as a separate repository and publish version 0.1.0 so we can use that here.

… unused

adamreichold · 2024-06-03T18:10:11Z

So I added some benchmarks for dropping without the GIL and doing that using multiple threads and I guess I should have started with that. No performance optimization without measurement.

After doing that, I am dropping the lock-free collector on the floor as its overhead of constantly allocating new blocks if there is contention makes it really slow. (This is admittedly the worst case and common patterns like using Rayon will exhibit collection concurrently with enqueuing new decrements, but I guess it does happen somewhere and the performance profile is just too bad.)

I then replaced it by a commit that uses once_cell::sync::Lazy around the global reference pool hiding its overhead when it is not in use but keeping the same performance profile of the mutex-based solution when it is used. I do expect the no-op test to yield the same results using that approach, i.e. a single atomic access. (Strangely enough, this is even a bit faster than main which I guess is due to Lazy adding some additional alignment to the mutex itself.)

So here are the results of the pushed benchmarks:

main              
----                        

drop_many_objects       time:   [1.6580 µs 1.6581 µs 1.6583 µs]

drop_many_objects_without_gil
                        time:   [5.1415 µs 5.1476 µs 5.1523 µs]

drop_many_objects_multiple_threads
                        time:   [118.78 µs 120.70 µs 122.45 µs]


lazy
----

drop_many_objects       time:   [1.6948 µs 1.6956 µs 1.6965 µs]

drop_many_objects_without_gil
                        time:   [5.6908 µs 5.6978 µs 5.7038 µs]

drop_many_objects_multiple_threads
                        time:   [105.58 µs 106.61 µs 107.58 µs]

lock-free
---------

drop_many_objects       time:   [1.6589 µs 1.6594 µs 1.6600 µs]

drop_many_objects_without_gil
                        time:   [42.540 µs 42.551 µs 42.563 µs]

drop_many_objects_multiple_threads
                        time:   [7.8457 ms 8.2430 ms 8.6565 ms]

I am not sure if the benchmarks are stable enough or if they will need to be added to the Codspeed exceptions.

This also still leaves a fair (i.e. queuing) mutex in there on macOS which is a throughput hazard (but certainly not as much as my ill-designed lock-free collector) for which I currently have no other solution than to go back to using parking_lot.

adamreichold force-pushed the lockfree-ref-pool branch from c36a48f to 7683a50 Compare May 11, 2024 17:50

davidhewitt mentioned this pull request May 12, 2024

add benchmark for deferred object drops #4179

Open

adamreichold force-pushed the lockfree-ref-pool branch from 7683a50 to 2e2e5d9 Compare May 18, 2024 16:04

adamreichold force-pushed the lockfree-ref-pool branch from 2e2e5d9 to ddd9f10 Compare May 18, 2024 16:37

adamreichold force-pushed the lockfree-ref-pool branch from cd9a49f to 0f41e8b Compare May 18, 2024 20:40

adamreichold force-pushed the lockfree-ref-pool branch 4 times, most recently from 3cee3f0 to 714f4d4 Compare May 20, 2024 16:10

davidhewitt mentioned this pull request May 21, 2024

fix calling POOL.update_counts() when no gil-refs feature #4200

Merged

adamreichold force-pushed the lockfree-ref-pool branch 3 times, most recently from 3063128 to e4110f7 Compare June 2, 2024 13:05

davidhewitt mentioned this pull request Jun 3, 2024

add flag to skip reference pool mutex if the program doesn't use the pool #4174

Closed

Add benchmarks exercising the global reference count decrement pool.

d0764a6

adamreichold force-pushed the lockfree-ref-pool branch from e4110f7 to 9123994 Compare June 3, 2024 17:59

Lazy-initialize the global reference pool to reduce its overhead when…

8d87c83

… unused

adamreichold force-pushed the lockfree-ref-pool branch from 9123994 to 8d87c83 Compare June 3, 2024 18:01

adamreichold changed the title ~~Use a lock-free stack to implement the global reference pool to penalize producers versus consumers~~ Lazy-initialize the global reference pool to reduce its overhead when unused Jun 3, 2024

adamreichold marked this pull request as ready for review June 3, 2024 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy-initialize the global reference pool to reduce its overhead when unused #4178

Lazy-initialize the global reference pool to reduce its overhead when unused #4178

adamreichold commented May 11, 2024 •

edited

davidhewitt commented May 11, 2024

adamreichold commented May 12, 2024 •

edited

davidhewitt commented May 12, 2024

adamreichold commented May 12, 2024

adamreichold commented May 18, 2024 •

edited

codspeed-hq bot commented May 18, 2024 •

edited

alex commented May 18, 2024

adamreichold commented May 18, 2024

alex commented May 18, 2024 via email

adamreichold commented May 18, 2024

davidhewitt commented May 19, 2024

adamreichold commented May 20, 2024

davidhewitt commented May 20, 2024

davidhewitt commented Jun 3, 2024

adamreichold commented Jun 3, 2024

adamreichold commented Jun 3, 2024

Lazy-initialize the global reference pool to reduce its overhead when unused #4178

Are you sure you want to change the base?

Lazy-initialize the global reference pool to reduce its overhead when unused #4178

Conversation

adamreichold commented May 11, 2024 • edited

davidhewitt commented May 11, 2024

adamreichold commented May 12, 2024 • edited

davidhewitt commented May 12, 2024

adamreichold commented May 12, 2024

adamreichold commented May 18, 2024 • edited

codspeed-hq bot commented May 18, 2024 • edited

CodSpeed Performance Report

Merging #4178 will not alter performance

Summary

Benchmarks breakdown

alex commented May 18, 2024

adamreichold commented May 18, 2024

alex commented May 18, 2024 via email

adamreichold commented May 18, 2024

davidhewitt commented May 19, 2024

adamreichold commented May 20, 2024

davidhewitt commented May 20, 2024

davidhewitt commented Jun 3, 2024

adamreichold commented Jun 3, 2024

adamreichold commented Jun 3, 2024

adamreichold commented May 11, 2024 •

edited

adamreichold commented May 12, 2024 •

edited

adamreichold commented May 18, 2024 •

edited

codspeed-hq bot commented May 18, 2024 •

edited