Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace sort implementations #124032

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Voultapher
Copy link
Contributor

@Voultapher Voultapher commented Apr 16, 2024

This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for slice::sort while improving it for slice::sort_unstable. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic.

  • slice::sort -> driftsort design document, includes detailed benchmarks and analysis.

  • slice::sort_unstable -> ipnsort design document, includes detailed benchmarks and analysis.

Why should we change the sort implementations?

In the 2023 Rust survey, one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both.

Why is this one big PR and not multiple?

  • The current documentation gives performance recommendations for slice::sort and slice::sort_unstable. If for example only one of them were to be changed, this advise would be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code.
  • driftsort and ipnsort share a substantial part of their implementations.
  • The implementation of select_nth_unstable uses internals of slice::sort_unstable, which makes it impractical to split changes.

This PR is a collaboration with @orlp.

@rustbot
Copy link
Collaborator

rustbot commented Apr 16, 2024

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 16, 2024
@Voultapher
Copy link
Contributor Author

r? thomcc

@rustbot rustbot assigned thomcc and unassigned jhpratt Apr 16, 2024
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@matthiaskrgr
Copy link
Member

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 16, 2024
@rust-log-analyzer

This comment has been minimized.

@Voultapher

This comment was marked as outdated.

@rustbot rustbot assigned joboet and thomcc and unassigned thomcc and joboet Apr 17, 2024
@Kobzol
Copy link
Contributor

Kobzol commented Apr 17, 2024

@bors try

(looks like the try build didn't get through)

@bors
Copy link
Contributor

bors commented Apr 17, 2024

⌛ Trying commit eeee5de with merge a4132fa...

bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 17, 2024
Replace sort implementations

This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic.

* `slice::sort` -> driftsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md)

* `slice::sort_unstable` -> ipnsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md)

#### Why should we change the sort implementations?

In the [2023 Rust survey](https://blog.rust-lang.org/2024/02/19/2023-Rust-Annual-Survey-2023-results.html#challenges), one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both.

#### Why is this one big PR and not multiple?

* The current documentation gives performance recommendations for `slice::sort` and `slice::sort_unstable`. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code.
* driftsort and ipnsort share a substantial part of their implementations.
* The implementation of `select_nth_unstable` uses internals of `slice::sort_unstable`, which makes it impractical to split changes.

---

This PR is a collaboration with `@orlp.`
@bors
Copy link
Contributor

bors commented Apr 17, 2024

☀️ Try build successful - checks-actions
Build commit: a4132fa (a4132fa857b71c2cec6719753cc4f31d6f7ea3c4)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (a4132fa): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [0.2%, 18.4%] 44
Regressions ❌
(secondary)
1.1% [0.2%, 4.3%] 9
Improvements ✅
(primary)
-0.4% [-0.5%, -0.3%] 7
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.9% [-0.5%, 18.4%] 51

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.6% [1.0%, 7.3%] 10
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.2% [-3.8%, -2.0%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.1% [-3.8%, 7.3%] 13

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.6% [0.9%, 19.0%] 24
Regressions ❌
(secondary)
3.2% [2.2%, 4.2%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.6% [0.9%, 19.0%] 24

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.5% [0.1%, 4.7%] 75
Regressions ❌
(secondary)
2.5% [0.8%, 4.1%] 76
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.5% [0.1%, 4.7%] 75

Bootstrap: 679.119s -> 683.166s (0.60%)
Artifact size: 316.14 MiB -> 317.49 MiB (0.43%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Apr 17, 2024
@Voultapher
Copy link
Contributor Author

Voultapher commented Apr 17, 2024

The binary-size results are inline with what we expected, given that slice::sort is more prevalent than slice::sort_unstable it makes sense that the regressions outweigh the improvements.

Regarding compile-times, we looked at those in our analysis here and here. The instruction count metric does not account for gains via parallelizability, right?

@orlp
Copy link
Contributor

orlp commented Apr 17, 2024

One thing that I find particularly strange is a 4.31% regression in helloworld-tiny, which doesn't call sort. Is this just due to the sort call in backtrace-rs?

@Amanieu Amanieu added the I-libs-nominated The issue / PR has been nominated for discussion during a libs team meeting. label May 15, 2024
@Amanieu
Copy link
Member

Amanieu commented May 15, 2024

We discussed this in the library team meeting today: we're happy to add this despite the compilation time regressions, however it would be nice to investigate the regression in html5ever in more detail to see if there is a low-hanging fruit to avoid the regression.

cc @rust-lang/wg-compiler-performance

@Amanieu Amanieu removed the I-libs-nominated The issue / PR has been nominated for discussion during a libs team meeting. label May 15, 2024
@Kobzol
Copy link
Contributor

Kobzol commented May 15, 2024

Let's do a fresh run to get more updated results.

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request May 15, 2024
Replace sort implementations

This PR replaces the sort implementations with tailor-made ones that strike a balance of run-time, compile-time and binary-size, yielding run-time and compile-time improvements. Regressing binary-size for `slice::sort` while improving it for `slice::sort_unstable`. All while upholding the existing soft and hard safety guarantees, and even extending the soft guarantees, detecting strict weak ordering violations with a high chance and reporting it to users via a panic.

* `slice::sort` -> driftsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md)

* `slice::sort_unstable` -> ipnsort [design document](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md)

#### Why should we change the sort implementations?

In the [2023 Rust survey](https://blog.rust-lang.org/2024/02/19/2023-Rust-Annual-Survey-2023-results.html#challenges), one of the questions was: "In your opinion, how should work on the following aspects of Rust be prioritized?". The second place was "Runtime performance" and the third one "Compile Times". This PR aims to improve both.

#### Why is this one big PR and not multiple?

* The current documentation gives performance recommendations for `slice::sort` and `slice::sort_unstable`. If for example only one of them were to changed, this advise may be misleading for some Rust versions. By replacing them atomically, the advise remains largely unchanged, and users don't have to change their code.
* driftsort and ipnsort share a substantial part of their implementations.
* The implementation of `select_nth_unstable` uses internals of `slice::sort_unstable`, which makes it impractical to split changes.

---

This PR is a collaboration with `@orlp.`
@bors
Copy link
Contributor

bors commented May 15, 2024

⌛ Trying commit 2d82ce7 with merge 5b76fc7...

@bors
Copy link
Contributor

bors commented May 15, 2024

☀️ Try build successful - checks-actions
Build commit: 5b76fc7 (5b76fc7c5bc3549c9069da214ae6095eb65aacac)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (5b76fc7): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.9% [0.2%, 18.1%] 52
Regressions ❌
(secondary)
2.1% [0.9%, 3.6%] 3
Improvements ✅
(primary)
-0.8% [-1.2%, -0.4%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.8% [-1.2%, 18.1%] 54

Max RSS (memory usage)

Results (primary 0.9%, secondary 4.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.4% [1.2%, 3.7%] 9
Regressions ❌
(secondary)
4.2% [4.2%, 4.2%] 1
Improvements ✅
(primary)
-6.0% [-9.1%, -3.0%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.9% [-9.1%, 3.7%] 11

Cycles

Results (primary 3.4%, secondary 3.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.4% [0.5%, 18.9%] 27
Regressions ❌
(secondary)
3.0% [2.2%, 3.8%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.4% [0.5%, 18.9%] 27

Binary size

Results (primary 1.4%, secondary 2.4%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.5% [0.0%, 4.0%] 76
Regressions ❌
(secondary)
2.4% [0.7%, 4.1%] 76
Improvements ✅
(primary)
-0.5% [-0.8%, -0.3%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.4% [-0.8%, 4.0%] 78

Bootstrap: 678.357s -> 686.997s (1.27%)
Artifact size: 316.14 MiB -> 319.70 MiB (1.13%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024
@nnethercote
Copy link
Contributor

Worth noting that there are binary size regressions, too :(

- `slice::sort` -> driftsort
  https://github.com/Voultapher/sort-research-rs/blob/main/writeup/driftsort_introduction/text.md

- `slice::sort_unstable` -> ipnsort
  https://github.com/Voultapher/sort-research-rs/blob/main/writeup/ipnsort_introduction/text.md

Replaces the sort implementations with tailor made ones that strike a
balance of run-time, compile-time and binary-size, yielding run-time and
compile-time improvements. Regressing binary-size for `slice::sort`
while improving it for `slice::sort_unstable`. All while upholding the
existing soft and hard safety guarantees, and even extending the soft
guarantees, detecting strict weak ordering violations with a high chance
and reporting it to users via a panic.

In addition the implementation of `select_nth_unstable` is also adapted
as it uses `slice::sort_unstable` internals.
Also includes small doc fixes.
@Voultapher
Copy link
Contributor Author

Voultapher commented May 16, 2024

@nnethercote, indeed that's an unfortunate effect of this change. As discussed in the design documents, slice::sort_unstable should generally see improvements in binary-size, and slice::sort regressions. Orson and I invested a serious amount of effort in getting slice::sort as small as it is. Competitive designs in this space like glidesort are 5x larger. Given the 2023 Rust survey results, binary-size was the third lowest priority, and run-time performance the second highest. So I'd argue it's a worthwhile tradeoff.

@Kobzol
Copy link
Contributor

Kobzol commented May 16, 2024

CC @diondokter - this might be an interesting use-case for "optimize-for-size" stdlib feature.

@Voultapher
Copy link
Contributor Author

Users that care about binary-size and or compile-times over everything else, are probably better served with https://github.com/Voultapher/tiny-sort-rs.

@Voultapher
Copy link
Contributor Author

Voultapher commented May 16, 2024

@Amanieu I've investigated the html5ever regression. And here are my findings:

After rebasing this branch on master I ran hyperfine --min-runs 5 --prepare 'cargo clean' 'cargo build --release' with different versions of rustc, testing the 8412575 commit of html5ever. All tests were performed on my Zen 3 machine.

Results using rustc 1.78.0-nightly (516b6162a 2024-03-03)

Time (mean ± σ):      3.093 s ±  0.012 s    [User: 10.846 s, System: 2.155 s]
Range (min … max):    3.080 s …  3.107 s    5 runs

Locally built stage2 branch master

Time (mean ± σ):      6.315 s ±  0.039 s    [User: 19.684 s, System: 2.282 s]
Range (min … max):    6.265 s …  6.364 s    5 runs

Locally built stage2 branch a-new-sort

Time (mean ± σ):      6.462 s ±  0.085 s    [User: 20.048 s, System: 2.384 s]
Range (min … max):    6.407 s …  6.610 s    5 runs

image

Looking at the rustc perf suite results, we see the regression can be attributed entirely to the backend code generation. Which would match our own testing that significantly more LLVM-IR is generated, because the slice::sort impl goes from ~210 to ~850 LoC. The ~18% backend regressions accounts for a ~4.6% total compile-time regression in this scenario. My testing suggests ~2.3% wall-clock regression in this scenario, which was the highest measured regression in the rustc perf suite, suggesting other crates are likely less impacted.

@Amanieu
Copy link
Member

Amanieu commented May 16, 2024

Does cargo llvm-lines reveal any interesting hotspots or opportunities to reduce the number of monomorphizations?

@Voultapher
Copy link
Contributor Author

image

Left side old, right new. Other than the fact that sort already played a major role in the LLVM-IR output, I see no clear hotspot.

@nnethercote
Copy link
Contributor

I want to make sure I understand the current situation.

strike a balance of run-time, compile-time and binary-size,

AFAICT currently:

  • run-time: microbenchmarks show speed ups in a lot of cases, but also some slowdowns
  • compile-time: rustc-perf results are all worse for instructions, cycles, and wall-time
  • binary size: the same rustc-perf results show universal regressions

Am I overlooking anything?

Orson and I invested a serious amount of effort in getting slice::sort as small as it is. Competitive designs in this space like glidesort are 5x larger. Given the 2023 Rust survey results, binary-size was the third lowest priority, and run-time performance the second highest. So I'd argue it's a worthwhile tradeoff.

I appreciate the effort taken to make this implementation smaller, but the comparison with glidesort doesn't seem relevant; the relevant comparison is the current sort implementation.

Overall, I don't think these results are a worthwhile trade-off at the moment :(

@orlp
Copy link
Contributor

orlp commented May 16, 2024

@nnethercote I think you are underselling the magnitude of "run-time: microbenchmarks show speed ups in a lot of cases, but also some slowdowns". A lot of the benchmarks in that list are just table-stakes tests against regressions (you can only 'sort' an already ascending input so fast - obviously you won't see a significant improvement in that). It's not just about how many tests you improve/regress on, but also by how much.

Please allow me to highlight the important (in my opinion) tests from that microbenchmark list:

 name                                            diff %  speedup 
 slice::sort_large_random                       -65.49%   x 2.90 
 slice::sort_large_strings                      -37.75%   x 1.61 
 slice::sort_medium_random                      -47.89%   x 1.92 
 slice::sort_small_random                        11.11%   x 0.90 
 slice::sort_unstable_large_random              -47.57%   x 1.91 
 slice::sort_unstable_large_strings             -25.19%   x 1.34 
 slice::sort_unstable_medium_random             -22.15%   x 1.28 
 slice::sort_unstable_small_random              -15.79%   x 1.19

Also note that we have tons more benchmarks in the writeups (1, 2), with relevant real-world distributions.

For example, driftsort has a massive speedup compared to the current slice::sort for input distributions with only few elements (e.g. random_d20):

image

We are talking speedups that can be an order of magnitude or more... Are we really saying that it's okay the default [T]::sort function you use is 10x slower than what it could be because it regressed compile times on average by ~0.5% and binary size by... ~0.5-2%?

image

@nnethercote
Copy link
Contributor

Thanks for the additional data, that is helpful. I didn't realize that the "design document" links in the PR description included measurements of this sort, so I latched on to the only runtime measurements I could see in this PR.

Based on that, I'm happy to defer to the libs team's judgment.

@Voultapher
Copy link
Contributor Author

I'd appreciate it if participants in this conversation would read the design documents. They include comprehensive analysis and reasoning why this change makes sense in the author's eyes. Among other things they include 6000 benchmark results across 3 different machines, including x86 and Arm. The test and benchmark suite is the culmination of 2 years of research and work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet