Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small perf regression in #69635 #69710

Closed
ecstatic-morse opened this issue Mar 4, 2020 · 14 comments
Closed

Small perf regression in #69635 #69710

ecstatic-morse opened this issue Mar 4, 2020 · 14 comments
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Mar 4, 2020

#69635 (a rollup) caused a small performance regression. Since the regression has persisted through the last half dozen perf runs, it is likely not spurious. There are no outstanding perf runs near the regression (2020-03-02), so I am fairly confident in this attribution. That rollup consists of the following PRs:

I would not expect any of them to negatively impact performance, although I did do a perf run with an unchecked_add in Layout::repeat to no avail (see #69679). The regression is only noticeable in clean/patched incremental builds, so I don't think this is high priority, but it would be nice to know what's going on here.

cc @nnethercote @jonas-schievink @Mark-Simulacrum

@ecstatic-morse ecstatic-morse added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Mar 4, 2020
@Mark-Simulacrum Mark-Simulacrum added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-compiletime Issue: Problems and improvements with respect to compile times. labels Mar 4, 2020
@ecstatic-morse ecstatic-morse removed the I-slow Issue: Problems and improvements with respect to performance of generated code. label Mar 4, 2020
@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Mar 4, 2020

The perf results for #69241, the original revert of #67174, are basically the mirror of the regression mentioned above. #67220, the first rollup to include #67174, also caused a performance regression, but it was attributed to a different PR in that rollup.

Therefore, I'm confident that #67174/#69544 is the root cause of this regression.

#69679 suggests that the optimizer does not benefit from being able to assume that the addition at the start of Layout::repeat does not wrap. Assuming that the invariant cited in #67174 does in fact hold, my only hypothesis is that removing the ? operator causes Layout::repeat to become eligible for inlining, which in turn makes one of its callees large enough to exceed the inlining threshold.

In any case, this seems like it should interest @nnethercote, since it indicates that the code in Layout is hot enough to benefit from micro-optimization. TBH, I'd also like to learn what approach they would use to investigate this kind of issue.

@nnethercote
Copy link
Contributor

Here is some output from a Callgrind profile that I had lying around. This is for a Check-CleanIncr run of ucd, which saw a 4.6% regression:

        .               pub fn repeat(&self, n: usize) -> Result<(Self, usize), LayoutErr> {
        .                   // Warning, removing the checked_add here led to segfaults in #67174. Further
        .                   // analysis in #69225 seems to indicate that this is an LTO-related
        .                   // miscompilation, so #67174 might be able to be reapplied in the future.
        .                   let padded_size = self
        .                       .size()
        .                       .checked_add(self.padding_needed_for(self.align()))
        .                       .ok_or(LayoutErr { private: () })?;
1,313,433 ( 0.01%)          let alloc_size = padded_size.checked_mul(n).ok_or(LayoutErr { private: () })?;
        .
        .                   unsafe {
        .                       // self.align is already known to be valid and alloc_size has been
        .                       // padded already.
        .                       Ok((Layout::from_size_align_unchecked(alloc_size, self.align()), padded_size))
        .                   }
        .               }

It suggests two things:

  • This function is not at all hot, so it's surprising that changing it would have a large perf effect.
  • It (the old version) is getting inlined. (If it wasn't inlined, there would be non-zero instruction counts on the fn line and the closing brace line.)

I can do a proper analysis tomorrow if that would be helpful. That will likely involve running Cachegrind on ucd-Check-CleanIncr with both versions of the compiler, and then doing a diff with cg_diff, and seeing what that finds.

But it might just be simpler to revert the change again? First it caused miscompilation, then it caused a perf regression... maybe this code should just be left alone :)

@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Mar 5, 2020

I'll open a PR that reverts #69544. I'd like to spend some time investigating it as well, since I have no idea why this would affect performance and it seems like a good opportunity to learn. If you feel the urge to look into this, however, don't let me stop you! I'll be pretty slow.

@ecstatic-morse

This comment has been minimized.

@rust-timer

This comment has been minimized.

@ecstatic-morse
Copy link
Contributor Author

@rust-timer build 4bfc61c90b9066ee396739e074d6d05abcfda04e

@rust-timer
Copy link
Collaborator

Queued 4bfc61c90b9066ee396739e074d6d05abcfda04e with parent c79f5f0, future comparison URL.

@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Mar 7, 2020

For anyone who's curious, here are the results of cg_diff master rerevert-67174 on a clean ucd-check run. Negative numbers means the count decreased after reverting #67174.

--------------------------------------------------------------------------------
Ir          I1mr   ILmr   Dr          D1mr      DLmr    Dw          D1mw    DLmw    
--------------------------------------------------------------------------------
-16,399,587 48,614 -1,909 -15,203,345 1,677,474 -18,799 -14,933,944 939,273 145,091  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir          I1mr   ILmr  Dr          D1mr      DLmr    Dw          D1mw    DLmw     file:function
--------------------------------------------------------------------------------
-12,941,229    644    12 -14,697,113    22,667  33,296 -14,697,093 196,198 103,505  /usr/src/debug/glibc-2.30-34-g994e529a37/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
  5,834,640    309   116   1,361,416    22,558   8,013   1,944,880       0       0  ???:_ZN4core3ptr13drop_in_place17h2d334451ff604121E.llvm.3047232575757559832
 -5,834,640   -309   -93  -1,361,416   -22,305  -8,147  -1,944,880       0       0  ???:_ZN4core3ptr13drop_in_place17h2d334451ff604121E.llvm.14951046493754649665
 -4,704,001 -6,786  -275  -1,031,214   -29,888   6,301    -802,849  13,182   4,175  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_malloc
  2,660,698  2,227     0     958,237        -6       0     481,776      20       0  ???:_ZN5cargo4util8progress5State5print17h25c4eb5dec857e24E.llvm.17714790172995471728
   -826,762  1,803   -15    -153,015   -82,049 -46,742    -153,358   1,001     647  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:malloc_consolidate
   -806,184   -651   -24    -201,566   -64,090       9      52,412  -6,989     -10  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:malloc
   -591,222  5,241   -46    -148,617    -4,060     753      84,645  10,875    -157  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_free
    403,708     40    -2         510         0       0         255       0       0  ???:core::intrinsics::is_nonoverlapping
   -301,845   -275   -26    -115,970   -43,771 -44,619     -33,584       0       0  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:unlink_chunk.isra.0
    196,931    677   108      54,384    -2,842      -2      43,974   2,948     -19  ???:alloc::raw_vec::RawVec<T,A>::reserve
    186,666    802   -34      40,289    -4,337    -187      14,859     -53      -6  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:realloc
    146,690 -1,963   -44      31,542     7,476    -668      22,101     -61      -4  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_realloc
    131,682    342     0      33,170        13       0      32,318      -1       0  ???:_ZN5cargo4util8progress5State4tick17h24f78f2baccd5adbE.llvm.17714790172995471728
    -58,890   -820   203     -14,193    58,285  -2,498      -3,741  -3,269      -2  ???:hashbrown::map::RawEntryBuilderMut<K,V,S>::from_hash
    -31,850    777   -10      -7,350      -721     -38           0       0       0  ???:hashbrown::raw::RawTable<T>::bucket
     25,476      0     0      10,615         0       0           0       0       0  /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a//src/libstd/sys/unix/alloc.rs:__rdl_realloc
     16,984      1     0           0         0       0      10,615       0       0  /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a//src/libstd/alloc.rs:__rdl_realloc

No idea how to attribute this. Maybe it's worth noting that alloc::RawVec::reserve calls Layout::repeat via the following chain:

alloc::RawVec::reserve
alloc::RawVec::reserve_internal
Layout::array
Layout::repeat

@nnethercote
Copy link
Contributor

The Cachegrind diff basically says that the checked_add removal caused additional memcpy and malloc calls. I'm not sure what to make of that, I can't think of an explanation.

(BTW, I usually run Cachegrind with --cache-sim=no because 99% of the time I find instruction counts (Ir) to be the only useful metric.)

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit 4bfc61c90b9066ee396739e074d6d05abcfda04e, comparison URL.

@tmiasko
Copy link
Contributor

tmiasko commented Mar 7, 2020

The different performance might be strictly about the fact that a different
code is actually being compiled and optimized, since the Layout is part of the
standard library.

It would be interesting to test compile time with different core but exactly
the same rustc.

bors added a commit that referenced this issue Mar 10, 2020
Rerevert "Remove checked_add in Layout::repeat"

This change, which originated in #67174 and was reapplied in #69544, seems to have caused a noticeable slowdown in patched/clean incremental builds (see #69710). Revert it for now while we investigate the underlying issue.

r? @nnethercote
@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Mar 11, 2020

The regression ultimately disappeared after #69879 (likely #69799).

@TimDiekmann It appears that minor changes to innocent-looking code in alloc.rs can result in relatively large perf changes. I suspect this will continue as alloc.rs is refactored. I think it would be a good idea to rollup=never or at least do a perf run on PRs that touch the allocator traits so we can more precisely attribute perf regressions. To clarify, this does not mean that changes should be blocked if they do indeed cause a regression.

@TimDiekmann
Copy link
Member

The next upcoming refactoring of AllocRef also regressed the performance slightly: #69889

@nikic
Copy link
Contributor

nikic commented Mar 12, 2021

As far as I can tell, there is nothing really actionable left in this issue, so I'm going to close it.

@nikic nikic closed this as completed Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants