Replace sort implementations #124032

Voultapher · 2024-04-16T18:28:13Z

This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for slice::sort while improving it for slice::sort_unstable. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic.

slice::sort -> driftsort design document, includes detailed benchmarks and analysis.
slice::sort_unstable -> ipnsort design document, includes detailed benchmarks and analysis.

Why should we change the sort implementations?

In the 2023 Rust survey, one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both.

Why is this one big PR and not multiple?

The current documentation gives performance recommendations for slice::sort and slice::sort_unstable. If for example only one of them were to be changed, this advise would be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code.
driftsort and ipnsort share a substantial part of their implementations.
The implementation of select_nth_unstable uses internals of slice::sort_unstable, which makes it impractical to split changes.

This PR is a collaboration with @orlp.

rustbot · 2024-04-16T18:28:22Z

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Voultapher · 2024-04-16T18:28:27Z

r? thomcc

matthiaskrgr · 2024-04-16T19:31:32Z

@bors try @rust-timer queue

library/alloc/src/slice/tests.rs

Kobzol · 2024-04-17T12:31:40Z

@bors try

(looks like the try build didn't get through)

bors · 2024-04-17T12:32:51Z

⌛ Trying commit eeee5de with merge a4132fa...

Replace sort implementations This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic. * `slice::sort` -> driftsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md) * `slice::sort_unstable` -> ipnsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md) #### Why should we change the sort implementations? In the [2023 Rust survey](https://blog.rust-lang.org/2024/02/19/2023-Rust-Annual-Survey-2023-results.html#challenges), one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both. #### Why is this one big PR and not multiple? * The current documentation gives performance recommendations for `slice::sort` and `slice::sort_unstable`. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code. * driftsort and ipnsort share a substantial part of their implementations. * The implementation of `select_nth_unstable` uses internals of `slice::sort_unstable`, which makes it impractical to split changes. --- This PR is a collaboration with `@orlp.`

bors · 2024-04-17T14:06:17Z

☀️ Try build successful - checks-actions
Build commit: a4132fa (a4132fa857b71c2cec6719753cc4f31d6f7ea3c4)

rust-timer · 2024-04-17T15:23:06Z

Finished benchmarking commit (a4132fa): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.2%	[0.2%, 18.4%]	44
Regressions ❌ (secondary)	1.1%	[0.2%, 4.3%]	9
Improvements ✅ (primary)	-0.4%	[-0.5%, -0.3%]	7
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.9%	[-0.5%, 18.4%]	51

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.6%	[1.0%, 7.3%]	10
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-3.2%	[-3.8%, -2.0%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.1%	[-3.8%, 7.3%]	13

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.6%	[0.9%, 19.0%]	24
Regressions ❌ (secondary)	3.2%	[2.2%, 4.2%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	3.6%	[0.9%, 19.0%]	24

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.5%	[0.1%, 4.7%]	75
Regressions ❌ (secondary)	2.5%	[0.8%, 4.1%]	76
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.5%	[0.1%, 4.7%]	75

Bootstrap: 679.119s -> 683.166s (0.60%)
Artifact size: 316.14 MiB -> 317.49 MiB (0.43%)

Voultapher · 2024-04-17T15:58:56Z

The binary-size results are inline with what we expected, given that slice::sort is more prevalent than slice::sort_unstable it makes sense that the regressions outweigh the improvements.

Regarding compile-times, we looked at those in our analysis here and here. The instruction count metric does not account for gains via parallelizability, right?

orlp · 2024-04-17T22:07:31Z

One thing that I find particularly strange is a 4.31% regression in helloworld-tiny, which doesn't call sort. Is this just due to the sort call in backtrace-rs?

Amanieu · 2024-05-15T16:33:02Z

We discussed this in the library team meeting today: we're happy to add this despite the compilation time regressions, however it would be nice to investigate the regression in html5ever in more detail to see if there is a low-hanging fruit to avoid the regression.

cc @rust-lang/wg-compiler-performance

Kobzol · 2024-05-15T17:02:28Z

Let's do a fresh run to get more updated results.

@bors try @rust-timer queue

Replace sort implementations This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic. * `slice::sort` -> driftsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md) * `slice::sort_unstable` -> ipnsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md) #### Why should we change the sort implementations? In the [2023 Rust survey](https://blog.rust-lang.org/2024/02/19/2023-Rust-Annual-Survey-2023-results.html#challenges), one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both. #### Why is this one big PR and not multiple? * The current documentation gives performance recommendations for `slice::sort` and `slice::sort_unstable`. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code. * driftsort and ipnsort share a substantial part of their implementations. * The implementation of `select_nth_unstable` uses internals of `slice::sort_unstable`, which makes it impractical to split changes. --- This PR is a collaboration with `@orlp.`

bors · 2024-05-15T17:03:43Z

⌛ Trying commit 2d82ce7 with merge 5b76fc7...

bors · 2024-05-15T18:42:00Z

☀️ Try build successful - checks-actions
Build commit: 5b76fc7 (5b76fc7c5bc3549c9069da214ae6095eb65aacac)

rust-timer · 2024-05-15T20:42:22Z

Finished benchmarking commit (5b76fc7): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.9%	[0.2%, 18.1%]	52
Regressions ❌ (secondary)	2.1%	[0.9%, 3.6%]	3
Improvements ✅ (primary)	-0.8%	[-1.2%, -0.4%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.8%	[-1.2%, 18.1%]	54

Max RSS (memory usage)

Results (primary 0.9%, secondary 4.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.4%	[1.2%, 3.7%]	9
Regressions ❌ (secondary)	4.2%	[4.2%, 4.2%]	1
Improvements ✅ (primary)	-6.0%	[-9.1%, -3.0%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.9%	[-9.1%, 3.7%]	11

Cycles

Results (primary 3.4%, secondary 3.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.4%	[0.5%, 18.9%]	27
Regressions ❌ (secondary)	3.0%	[2.2%, 3.8%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	3.4%	[0.5%, 18.9%]	27

Binary size

Results (primary 1.4%, secondary 2.4%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.5%	[0.0%, 4.0%]	76
Regressions ❌ (secondary)	2.4%	[0.7%, 4.1%]	76
Improvements ✅ (primary)	-0.5%	[-0.8%, -0.3%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.4%	[-0.8%, 4.0%]	78

Bootstrap: 678.357s -> 686.997s (1.27%)
Artifact size: 316.14 MiB -> 319.70 MiB (1.13%)

nnethercote · 2024-05-15T23:04:03Z

Worth noting that there are binary size regressions, too :(

- `slice::sort` -> driftsort https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md - `slice::sort_unstable` -> ipnsort https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md Replaces the sort implementations with tailor made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic. In addition the implementation of `select_nth_unstable` is also adapted as it uses `slice::sort_unstable` internals.

Also includes small doc fixes.

Voultapher · 2024-05-16T15:24:45Z

@nnethercote, indeed that's an unfortunate effect of this change. As discussed in the design documents, slice::sort_unstable should generally see improvements in binary-size, and slice::sort regressions. Orson and I invested a serious amount of effort in getting slice::sort as small as it is. Competitive designs in this space like glidesort are 5x larger. Given the 2023 Rust survey results, binary-size was the third lowest priority, and run-time performance the second highest. So I'd argue it's a worthwhile tradeoff.

Kobzol · 2024-05-16T15:32:41Z

CC @diondokter - this might be an interesting use-case for "optimize-for-size" stdlib feature.

Voultapher · 2024-05-16T15:37:25Z

Users that care about binary-size and or compile-times over everything else, are probably better served with https://github.com/Voultapher/tiny-sort-rs.

Voultapher · 2024-05-16T15:59:04Z

@Amanieu I've investigated the html5ever regression. And here are my findings:

After rebasing this branch on master I ran hyperfine --min-runs 5 --prepare 'cargo clean' 'cargo build --release' with different versions of rustc, testing the 8412575 commit of html5ever. All tests were performed on my Zen 3 machine.

Results using rustc 1.78.0-nightly (516b6162a 2024-03-03)

Time (mean ± σ):      3.093 s ±  0.012 s    [User: 10.846 s, System: 2.155 s]
Range (min … max):    3.080 s …  3.107 s    5 runs

Locally built stage2 branch master

Time (mean ± σ):      6.315 s ±  0.039 s    [User: 19.684 s, System: 2.282 s]
Range (min … max):    6.265 s …  6.364 s    5 runs

Locally built stage2 branch a-new-sort

Time (mean ± σ):      6.462 s ±  0.085 s    [User: 20.048 s, System: 2.384 s]
Range (min … max):    6.407 s …  6.610 s    5 runs

Looking at the rustc perf suite results, we see the regression can be attributed entirely to the backend code generation. Which would match our own testing that significantly more LLVM-IR is generated, because the slice::sort impl goes from ~210 to ~850 LoC. The ~18% backend regressions accounts for a ~4.6% total compile-time regression in this scenario. My testing suggests ~2.3% wall-clock regression in this scenario, which was the highest measured regression in the rustc perf suite, suggesting other crates are likely less impacted.

Amanieu · 2024-05-16T17:45:17Z

Does cargo llvm-lines reveal any interesting hotspots or opportunities to reduce the number of monomorphizations?

Voultapher · 2024-05-16T18:40:32Z

Left side old, right new. Other than the fact that sort already played a major role in the LLVM-IR output, I see no clear hotspot.

nnethercote · 2024-05-16T20:31:58Z

I want to make sure I understand the current situation.

strike a balance of run-time, compile-time and binary-size,

AFAICT currently:

run-time: microbenchmarks show speed ups in a lot of cases, but also some slowdowns
compile-time: rustc-perf results are all worse for instructions, cycles, and wall-time
binary size: the same rustc-perf results show universal regressions

Am I overlooking anything?

Orson and I invested a serious amount of effort in getting slice::sort as small as it is. Competitive designs in this space like glidesort are 5x larger. Given the 2023 Rust survey results, binary-size was the third lowest priority, and run-time performance the second highest. So I'd argue it's a worthwhile tradeoff.

I appreciate the effort taken to make this implementation smaller, but the comparison with glidesort doesn't seem relevant; the relevant comparison is the current sort implementation.

Overall, I don't think these results are a worthwhile trade-off at the moment :(

orlp · 2024-05-16T21:50:46Z

@nnethercote I think you are underselling the magnitude of "run-time: microbenchmarks show speed ups in a lot of cases, but also some slowdowns". A lot of the benchmarks in that list are just table-stakes tests against regressions (you can only 'sort' an already ascending input so fast - obviously you won't see a significant improvement in that). It's not just about how many tests you improve/regress on, but also by how much.

Please allow me to highlight the important (in my opinion) tests from that microbenchmark list:

 name                                            diff %  speedup 
 slice::sort_large_random                       -65.49%   x 2.90 
 slice::sort_large_strings                      -37.75%   x 1.61 
 slice::sort_medium_random                      -47.89%   x 1.92 
 slice::sort_small_random                        11.11%   x 0.90 
 slice::sort_unstable_large_random              -47.57%   x 1.91 
 slice::sort_unstable_large_strings             -25.19%   x 1.34 
 slice::sort_unstable_medium_random             -22.15%   x 1.28 
 slice::sort_unstable_small_random              -15.79%   x 1.19

Also note that we have tons more benchmarks in the writeups (1, 2), with relevant real-world distributions.

For example, driftsort has a massive speedup compared to the current slice::sort for input distributions with only few elements (e.g. random_d20):

We are talking speedups that can be an order of magnitude or more... Are we really saying that it's okay the default [T]::sort function you use is 10x slower than what it could be because it regressed compile times on average by ~0.5% and binary size by... ~0.5-2%?

nnethercote · 2024-05-17T07:16:52Z

Thanks for the additional data, that is helpful. I didn't realize that the "design document" links in the PR description included measurements of this sort, so I latched on to the only runtime measurements I could see in this PR.

Based on that, I'm happy to defer to the libs team's judgment.

Voultapher · 2024-05-17T08:19:58Z

I'd appreciate it if participants in this conversation would read the design documents. They include comprehensive analysis and reasoning why this change makes sense in the author's eyes. Among other things they include 6000 benchmark results across 3 different machines, including x86 and Arm. The test and benchmark suite is the culmination of 2 years of research and work.

rustbot assigned jhpratt Apr 16, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 16, 2024

rustbot assigned thomcc and unassigned jhpratt Apr 16, 2024

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 16, 2024

fbstj reviewed Apr 16, 2024

View reviewed changes

library/alloc/src/slice/tests.rs Show resolved Hide resolved

This comment has been minimized.

Sign in to view

This comment was marked as outdated.

Sign in to view

rustbot assigned joboet and thomcc and unassigned thomcc and joboet Apr 17, 2024

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Apr 17, 2024

Amanieu added the I-libs-nominated The issue / PR has been nominated for discussion during a libs team meeting. label May 15, 2024

joboet mentioned this pull request May 15, 2024

Replace impl Fn with fn in sort::merge_sort::RunVec #121505

Open

Amanieu removed the I-libs-nominated The issue / PR has been nominated for discussion during a libs team meeting. label May 15, 2024

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024

Voultapher added 5 commits May 16, 2024 17:08

Fix tidy errors

1a6b0e4

Move BufGuard impl outside of function

d20b119

Turn bare links into automatic links

8300f67

Fix linkchecker doc errors

88fb5ed

Also includes small doc fixes.

Voultapher force-pushed the a-new-sort branch from 2d82ce7 to 88fb5ed Compare May 16, 2024 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace sort implementations #124032

Replace sort implementations #124032

Voultapher commented Apr 16, 2024 •

edited

rustbot commented Apr 16, 2024

Voultapher commented Apr 16, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

matthiaskrgr commented Apr 16, 2024

This comment has been minimized.

This comment has been minimized.

This comment was marked as outdated.

Kobzol commented Apr 17, 2024

bors commented Apr 17, 2024

bors commented Apr 17, 2024

This comment has been minimized.

rust-timer commented Apr 17, 2024

Voultapher commented Apr 17, 2024 •

edited

orlp commented Apr 17, 2024

Amanieu commented May 15, 2024

Kobzol commented May 15, 2024

This comment has been minimized.

bors commented May 15, 2024

bors commented May 15, 2024

This comment has been minimized.

rust-timer commented May 15, 2024

nnethercote commented May 15, 2024

Voultapher commented May 16, 2024 •

edited

Kobzol commented May 16, 2024

Voultapher commented May 16, 2024

Voultapher commented May 16, 2024 •

edited

Amanieu commented May 16, 2024

Voultapher commented May 16, 2024

nnethercote commented May 16, 2024

orlp commented May 16, 2024 •

edited

nnethercote commented May 17, 2024

Voultapher commented May 17, 2024

Replace sort implementations #124032

Are you sure you want to change the base?

Replace sort implementations #124032

Conversation

Voultapher commented Apr 16, 2024 • edited

Why should we change the sort implementations?

Why is this one big PR and not multiple?

rustbot commented Apr 16, 2024

Voultapher commented Apr 16, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

matthiaskrgr commented Apr 16, 2024

This comment has been minimized.

This comment has been minimized.

This comment was marked as outdated.

Kobzol commented Apr 17, 2024

bors commented Apr 17, 2024

bors commented Apr 17, 2024

This comment has been minimized.

rust-timer commented Apr 17, 2024

Overall result: ❌ regressions - ACTION NEEDED

Voultapher commented Apr 17, 2024 • edited

orlp commented Apr 17, 2024

Amanieu commented May 15, 2024

Kobzol commented May 15, 2024

This comment has been minimized.

bors commented May 15, 2024

bors commented May 15, 2024

This comment has been minimized.

rust-timer commented May 15, 2024

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

nnethercote commented May 15, 2024

Voultapher commented May 16, 2024 • edited

Kobzol commented May 16, 2024

Voultapher commented May 16, 2024

Voultapher commented May 16, 2024 • edited

Amanieu commented May 16, 2024

Voultapher commented May 16, 2024

nnethercote commented May 16, 2024

orlp commented May 16, 2024 • edited

nnethercote commented May 17, 2024

Voultapher commented May 17, 2024

Voultapher commented Apr 16, 2024 •

edited

Voultapher commented Apr 17, 2024 •

edited

Voultapher commented May 16, 2024 •

edited

Voultapher commented May 16, 2024 •

edited

orlp commented May 16, 2024 •

edited