Performance optimization 😅 #1

wHo-EM-i · 2021-11-28T10:22:46Z

No description provided.

Pretend that there is never room to grow the heap in order to prevent usage of these unsafe legacy functions. There are likely no users of these in practice as it is inherently broken to use them outside of malloc. Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>

Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>

In some cases, it can be useful to load libraries from zip files that are only available by fd reference. For example, file descriptors of APKs containing native libraries may be sent via Binder IPC for clients to use. Unfortunately, while this linker does support loading libraries from file descriptors using android_dlopen_ext, using that API is not an option because our dlopen calls originate from JNI loadLibrary requests in ART. This is necessary for compatibility with Google Play Services' dynamic module system (Dynamite) without weakening the SELinux sandbox to allow other apps to open module APKs from /data/user_de/0/com.google.android.gms/app_chimera/m. Change-Id: If44d5c3faf4f50e4704688b520b197ff151ae05a

This reduces entropy of the canary from 64-bit to 56-bit in exchange for mitigating non-terminated C string overflows. Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>

The hosts file is normally searched linearly. This is very slow when the file is large. To mitigate this, read the hosts file and sort the entries in an in-memory cache. When an address is requested via gethostbyname or getaddrinfo, binary search the cache. In case where the cache is not available, return a suitable error code and fall back to the existing lookup code. This has been written to behave as much like the existing lookup code as possible. But note bionic and glibc differ in behavior for some corner cases. Choose the most standard compliant behavior for these where possible. Otherwise choose the behavior that seems most reasonable. RM-290 Change-Id: I3b322883cbc48b0d76a0ce9d149b59faaac1dc58 (cherry picked from commit ed4c3a6bd449a4ed70645071a440ae146f194116)

If an exact name is not found in the hosts file and the host name contains at least one dot, search for entries of the form "*.domain", where domain is the portion of the host name after the first dot. If that is not found, repeat using the domain. Example: a.b.c.example.com would search for the following in turn: a.b.c.example.com *.b.c.example.com *.c.example.com *.example.com *.com Change-Id: I4b0bb81699151d5b371850daebf785e35ec9b180

…A57/Denver/Krait Benchmarked on Nextbit Robin (MSM8992) Before: iterations ns/op BM_string_strlen/8 50M 75 0.106 GiB/s BM_string_strlen/64 10M 159 0.400 GiB/s BM_string_strlen/512 2M 819 0.625 GiB/s BM_string_strlen/1024 1000k 1547 0.662 GiB/s BM_string_strlen/8Ki 200k 12327 0.665 GiB/s BM_string_strlen/16Ki 100k 24579 0.667 GiB/s BM_string_strlen/32Ki 50k 48950 0.669 GiB/s BM_string_strlen/64Ki 20k 97195 0.674 GiB/s After: iterations ns/op BM_string_strlen/8 50M 13 0.574 GiB/s BM_string_strlen/64 1000k 23 2.703 GiB/s BM_string_strlen/512 20M 115 4.414 GiB/s BM_string_strlen/1024 10M 206 4.954 GiB/s BM_string_strlen/8Ki 1000k 1528 5.359 GiB/s BM_string_strlen/16Ki 1000k 2946 5.561 GiB/s BM_string_strlen/32Ki 500k 5910 5.544 GiB/s BM_string_strlen/64Ki 200k 11842 5.534 GiB/s Signed-off-by: Bernhard Rosenkraenzer <Bernhard.Rosenkranzer@linaro.org> Signed-off-by: Jake Weinstein <xboxlover360@gmail.com> Signed-off-by: Vishalcj17 <vishalcj@aospa.co> Change-Id: I1e74557046c901afd1356e8ebf3a6c39b0850b87

Change-Id: Ia82860326de114d48dea0e8cd93b836f1e826e15 Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>

Overall, jemalloc performs significantly better than Scudo in Bionic's real-world memory_replay traces (all times in milliseconds): +----------------+-------+----------+ | Trace | Scudo | jemalloc | +----------------+-------+----------+ | SQL | 27 | 21 | | Angry Birds 2 | 2236 | 1501 | | Camera | 4251 | 979 | | Candy Crush | 2197 | 1325 | | Gmail | 594 | 463 | | Maps | 434 | 344 | | Photos | 1330 | 477 | | PUBG | 666 | 416 | | surfaceflinger | 221 | 192 | | system_server | 1921 | 1416 | | SystemUI | 102 | 79 | | YouTube | 363 | 294 | +----------------+-------+----------+ jemalloc also tends to use slightly less memory than Scudo for most traces. These tests were conducted on desktop x86 Linux with glibc and the latest stable version of each allocator, but they should still be relevant. RSS values in KiB: +----------------+--------+----------+ | Trace | Scudo | jemalloc | +----------------+--------+----------+ | Angry Birds 2 | 793948 | 746992 | | Camera | 219372 | 251888 | | Candy Crush | 548288 | 550148 | | Gmail | 195236 | 193048 | | Maps | 159860 | 159816 | | Photos | 175436 | 171872 | | PUBG | 233752 | 223572 | | surfaceflinger | 94736 | 107068 | | system_server | 471048 | 484392 | | SystemUI | 54432 | 60740 | | YouTube | 139376 | 142252 | +----------------+--------+----------+ While not representative of real-world usage, jemalloc also performs fairly well in synthetic benchmarks (all times in nanoseconds): +-----------------+---------+----------+ | Benchmark | Scudo | jemalloc | +-----------------+---------+----------+ | alloc 8 | 87.9 | 60.1 | | alloc 16 | 87.9 | 60 | | alloc 32 | 88.6 | 60.7 | | alloc 64 | 88.6 | 59.7 | | alloc 512 | 89.2 | 60 | | alloc 1024 | 89.4 | 59.8 | | alloc 8192 | 89.8 | 65.2 | | alloc 16384 | 92.7 | 69.1 | | alloc 32768 | 97.2 | 74 | | alloc 65536 | 109 | 83.8 | | alloc 131072 | 41536 | 42720 | | alloc40x 8 | 2156 | 2556 | | alloc40x 16 | 2155 | 2244 | | alloc40x 32 | 2234 | 2312 | | alloc40x 64 | 2234 | 2289 | | alloc40x 512 | 2274 | 8171 | | alloc40x 1024 | 2397 | 2162 | | alloc40x 8192 | 3550 | 78880 | | alloc40x 16384 | 3732 | 124454 | | alloc40x 32768 | 3849 | 275460 | | alloc40x 65536 | 4987 | 727598 | | alloc40x 131072 | 2745207 | 3067980 | | alloc8192 1x | 464 | 454 | | alloc8192 2x | 510 | 488 | | alloc8192 3x | 587 | 523 | | alloc8192 4x | 665 | 557 | | alloc8192 5x | 742 | 598 | | alloc8192 6x | 818 | 633 | | alloc8192 7x | 884 | 669 | | alloc8192 8x | 960 | 699 | | alloc8192 9x | 1045 | 734 | | alloc8192 10x | 1131 | 770 | | alloc8192 11x | 1207 | 806 | | alloc8192 12x | 1282 | 841 | | alloc8192 13x | 1363 | 877 | | alloc8192 14x | 1442 | 912 | | alloc8192 15x | 1512 | 944 | | alloc8192 16x | 1587 | 978 | | alloc8192 24x | 2256 | 21195 | | alloc8192 32x | 2867 | 45446 | | alloc8192 40x | 3522 | 71618 | | alloc8192 48x | 4126 | 89740 | | alloc8192 56x | 4786 | 114990 | | alloc8192 64x | 5412 | 141082 | | alloc8192 72x | 6049 | 170742 | | alloc8192 80x | 6712 | 198480 | | alloc8192 88x | 7331 | 221557 | | alloc8192 96x | 7976 | 251462 | | alloc8192 104x | 8581 | 281626 | | alloc8192 112x | 9245 | 313164 | | alloc8192 120x | 9914 | 353147 | | alloc8192 128x | 10514 | 376625 | | alloc8192 136x | 11187 | 408194 | | alloc8192 144x | 11802 | 445694 | | alloc8192 160x | 13083 | 514547 | | alloc8192 176x | 14414 | 582501 | | alloc8192 192x | 15746 | 654346 | | alloc8192 208x | 17044 | 712620 | | alloc8192 224x | 18405 | 769963 | | alloc8192 240x | 19744 | 843969 | | alloc8192 256x | 21160 | 917803 | +-----------------+---------+----------+ Scudo performs fairly well for a hardened memory allocator, but we're optimizing for performance. Full benchmark data with graphs: https://docs.google.com/spreadsheets/d/1LG_kxaK5cI14gGtnyM-nNNmfpMdV9Vh-LtYoq7H5J4s/edit Change-Id: Ia4901eedfaa2c9779678c5b6532979de4919ee01

A decent chunk of the logcat profile is spent formatting the timestamps for each line, and most of that time was going to snprintf(3). We should find all the places that could benefit from a lighter-weight "format an integer" and share something between them, but this is easy for now. Before: ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_time_strftime 781 ns 775 ns 893102 After: ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_time_strftime 170 ns 168 ns 4139487 Much of the remaining time is in tzset() which seems unfortunate. Test: treehugger Change-Id: Ie0f7ee462ff1b1abea6f87d4a9a996d768e51056

clang was getting in the way of a strftime(3) optimization, and smaller hammers weren't working, and this seems like the right choice for libc anyway? If we have code that can usefully be optimized, we should do it in the source. In general, though, no libc/libm author should be ignorant of memset(3) or memcpy(3), and would have used it themselves if it made sense. (And the compiler isn't using profiling data or anything; it's just always assuming it should use the functions, and doesn't consider whether the cost of the calls can be amortized or not.) Test: treehugger Change-Id: Ia7e22623e47bfbfcfe46c1af0d95ef7e8669c0f6

@plt

From a logcat profile: ``` |--95.06%-- convertPrintable(char*, char const*, unsigned long) | |--13.95%-- [hit in function] | | | |--35.96%-- mbrtoc32 | | |--82.72%-- [hit in function] | | | | | |--11.07%-- mbsinit | | | | | |--5.96%-- @plt ``` I think we'd assumed that mbsinit() would be inlined, but since these functions aren't all in wchar.cpp it wasn't being. This change moves the implementation into a (more clearly named) inline function so we can trivially reclaim that 11%+6%. Benchmarks before: ``` ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_stdlib_mbrtowc_1 8.03 ns 7.95 ns 87144997 BM_stdlib_mbrtowc_2 22.0 ns 21.8 ns 32002437 BM_stdlib_mbrtowc_3 30.0 ns 29.7 ns 23517699 BM_stdlib_mbrtowc_4 37.4 ns 37.1 ns 18895204 BM_stdlib_mbstowcs_ascii 792373 ns 782484 ns 890 bytes_per_second=609.389M/s BM_stdlib_mbstowcs_wide 15836785 ns 15678316 ns 44 bytes_per_second=30.4138M/s ``` Benchmarks after: ``` ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_stdlib_mbrtowc_1 5.76 ns 5.72 ns 121863813 BM_stdlib_mbrtowc_2 17.1 ns 16.9 ns 41487260 BM_stdlib_mbrtowc_3 24.2 ns 24.0 ns 29141629 BM_stdlib_mbrtowc_4 30.3 ns 30.1 ns 23229291 BM_stdlib_mbstowcs_ascii 783506 ns 775389 ns 903 bytes_per_second=614.965M/s BM_stdlib_mbstowcs_wide 12787003 ns 12672642 ns 55 bytes_per_second=37.6273M/s ``` Bug: http://b/206523398 Test: treehugger Change-Id: If8c6c39880096ddd2cbd323c68dca82e9849ace6 Signed-off-by: Vishalcj17 <vishalcj@aospa.co>

This saves a couple of syscalls in the common case, and also lets static binaries run in a chroot without /dev/null as long as stdin/stdout/stderr are actually connected to something (which the toybox maintainer tried to do). Test: manual with strace Change-Id: Ic9a28896a07304a3bd428acfd9ddca9d22015f6e Signed-off-by: Vishalcj17 <vishalcj@aospa.co>

This way, critical string functions are always at the start of a cacheline. Change-Id: I049e88d88a043911093641f44d9846fa5f6f3982 Suggested-By: Wilco Dijkstra <wilco.dijkstra@arm.com> Test: TBD Signed-off-by: Vishalcj17 <vishalcj@aospa.co>

This optimization is extracted from cortex-strings and bionic-ized, and applied to arm-v7a cpus. Stringbench results https://android.git.linaro.org/gitweb/platform/external/stringbench.git ----------------------------------------------------------------------- Nexus 6P (MSM8994): Before: 15000 chars: 154.493394 seconds 5000 chars: 51.545608 seconds After: 15000 chars: 18.374188 seconds (88% improvement) 5000 chars: 7.494449 seconds (85% improvement) ------------------------------------------------------------------------ OnePlus 3 (MSM8996): Before: 15000 chars:166.577121 seconds 5000 chars: 60.121731 seconds After: 15000 chars: 13.684960 seconds (91% improvement) 5000 chars: 5.413961 seconds (90% improvemend) ------------------------------------------------------------------------ Razer Phone (MSM8998) Before: 15000 chars: 215.932986 seconds 5000 chars: 72.147830 seconds After: 15000 chars: 17.342402 seconds (92% improvement) 5000 chars: 4.397512 seconds (94% improvement) ------------------------------------------------------------------------- Change-Id: I1c3fb0c89ce2b3ee7e44f492367b6caf6db58ccf Signed-off-by: Yingshiuan Pan <yingshiuan.pan@linaro.org> Signed-off-by: Vishalcj17 <vishalcj@aospa.co>

The first time should_trace() returns true, bionic_trace_begin() calls open() on trace_marker. The problem is that open() can call bionic_trace_begin(). We've observed this happening, for example when: * fdtrack is enabled. dlopen("libfdtrack.so") can be used to enable fdtrack on a process. * ThreadA is busy unwinding inside fdtrack and is holding an fdtrack internal mutex. * ThreadB calls bionic_trace_begin() for the first time since the property "debug.atrace.tags.enableflags" contains ATRACE_TAG_BIONIC. * ThreadB calls open("/sys/kernel/tracing/trace_marker"). Since fdtrack is enabled, ThreadB tries to do unwinding as well. * ThreadB, inside fdtrack's unwinding tries to grab the same mutex that ThreadA is holding. * Mutex contention is reported using bionic_systrace, therefore bionic_trace_begin() is called again on ThreadB. * ThreadB tries to grab g_lock in bionin_systrace.cpp, but that's already held by ThreadB itself, earlier on the stack. Therefore ThreadB is stuck. I managed to reproduce the above scenario by manually pausing ThreadA inside unwinding with a debugger and letting ThreadB hitting bionic_trace_begin() for the first time. We could avoid using g_lock while calling open() (either by releasing g_lock and reacquiring it later, or by using atomics), but bionic_trace_begin() would try to call open() again. In my tests, open() does not call bionic_trace_begin() a third time, because fdtrack has reentrancy protection, but there might be another code path inside open that calls bionic_trace_begin again (it could be racy or only happen in certain configurations). This commit fixes the problem by implementing reentrancy protection in bionic_systrace. Sample callstack from ThreadA deadlocked before the fix: ``` * frame #0: 0x0000007436db077c libc.so`syscall at syscall.S:41 frame CAF-Extended#1: 0x0000007436db0ba0 libc.so`bionic_trace_begin(char const*) [inlined] __futex(ftx=0x000000743737a548, op=<unavailable>, value=2, timeout=0x0000000000000000, bitset=-1) at bionic_futex.h:45:16 frame #2: 0x0000007436db0b8c libc.so`bionic_trace_begin(char const*) [inlined] __futex_wait_ex(ftx=0x000000743737a548, value=2) at bionic_futex.h:66:10 frame #3: 0x0000007436db0b78 libc.so`bionic_trace_begin(char const*) [inlined] Lock::lock(this=0x000000743737a548) at bionic_lock.h:67:7 frame #4: 0x0000007436db0b74 libc.so`bionic_trace_begin(char const*) [inlined] should_trace() at bionic_systrace.cpp:38:10 frame #5: 0x0000007436db0b74 libc.so`bionic_trace_begin(message="Contending for pthread mutex") at bionic_systrace.cpp:59:8 frame #6: 0x0000007436e193e4 libc.so`NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*) [inlined] NonPI::NormalMutexLock(mutex=0x0000007296cae9f0, shared=0, use_realtime_clock=false, abs_timeout_or_null=0x0000000000000000) at pthread_mutex.cpp:592:17 frame #7: 0x0000007436e193c8 libc.so`NonPI::MutexLockWithTimeout(mutex=0x0000007296cae9f0, use_realtime_clock=false, abs_timeout_or_null=0x0000000000000000) at pthread_mutex.cpp:719:16 frame #8: 0x0000007436e1912c libc.so`::pthread_mutex_lock(mutex_interface=<unavailable>) at pthread_mutex.cpp:839:12 [artificial] frame #9: 0x00000071a4e5b290 libfdtrack.so`std::__1::mutex::lock() [inlined] std::__1::__libcpp_mutex_lock(__m=<unavailable>) at __threading_support:256:10 frame #10: 0x00000071a4e5b28c libfdtrack.so`std::__1::mutex::lock(this=<unavailable>) at mutex.cpp:31:14 frame #11: 0x00000071a4e32634 libfdtrack.so`unwindstack::Elf::Step(unsigned long, unwindstack::Regs*, unwindstack::Memory*, bool*, bool*) [inlined] std::__1::lock_guard<std::__1::mutex>::lock_guard(__m=0x0000007296cae9f0) at __mutex_base:104:27 frame #12: 0x00000071a4e32618 libfdtrack.so`unwindstack::Elf::Step(this=0x0000007296cae9c0, rel_pc=66116, regs=0x0000007266ca0470, process_memory=0x0000007246caa130, finished=0x0000007ff910efb4, is_signal_frame=0x0000007ff910efb0) at Elf.cpp:206:31 frame #13: 0x00000071a4e2b3b0 libfdtrack.so`unwindstack::LocalUnwinder::Unwind(this=0x00000071a4ea1528, frame_info=<unavailable>, max_frames=34) at LocalUnwinder.cpp:102:22 frame #14: 0x00000071a4e2a3ec libfdtrack.so`fd_hook(event=<unavailable>) at fdtrack.cpp:119:18 frame #15: 0x0000007436dbf684 libc.so`::__open_2(pathname=<unavailable>, flags=<unavailable>) at open.cpp:72:10 frame #16: 0x0000007436db0a04 libc.so`bionic_trace_begin(char const*) [inlined] open(pathname=<unavailable>, flags=524289) at fcntl.h:63:12 frame #17: 0x0000007436db09f0 libc.so`bionic_trace_begin(char const*) [inlined] get_trace_marker_fd() at bionic_systrace.cpp:49:25 frame #18: 0x0000007436db09c0 libc.so`bionic_trace_begin(message="pthread_create") at bionic_systrace.cpp:63:25 ``` Bug: 213642769 Change-Id: I10d331859045cb4a8609b007f5c6cf2577ff44df

thestinger and others added 10 commits December 12, 2021 10:00

fix undefined out-of-bounds accesses in sched.h

c82a662

Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>

replace VLA formatting with dprintf-like function

484b173

Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>

on 64-bit, zero the leading stack canary byte

902babb

This reduces entropy of the canary from 64-bit to 56-bit in exchange for mitigating non-terminated C string overflows. Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>

RELAND: bionic: Enable -O3

c89bba8

Change-Id: Ia82860326de114d48dea0e8cd93b836f1e826e15 Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>

wHo-EM-i force-pushed the 12.0 branch from abf4cca to 0c23b75 Compare December 12, 2021 04:30

enh-google and others added 6 commits January 9, 2022 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimization 😅 #1

Performance optimization 😅 #1

wHo-EM-i commented Nov 28, 2021

Performance optimization 😅 #1

Are you sure you want to change the base?

Performance optimization 😅 #1

Conversation

wHo-EM-i commented Nov 28, 2021