Linux: FlushProcessWriteBuffers using membarrier when available #20949
Conversation
@tmds thank you a lot for implementing this. The change itself looks good, but I would like to understand its performance impact on the architectures we support before merging it. The reason is that @sdmaclea has experimentally made a similar change in #12259, but the performance hit on ARM64 was so bad that the time to run CoreCLR Pri1 tests went up by 50%. |
Looks like this is using a different flag than #12259. Hopefully it is more performant. It may also be that this could still be enabled in cases where mlock fails even if performance was worse by making it a fallback instead of the preferred mechanism. |
I have just realized there is a problem. We build all portable Linux x64 builds on CentOS 7, but that one has an old kernel 3.x, which doesn't include the syscall and the sys/syscall.h doesn't include the __NR_membarrier definition. So we will need to define the __NR_membarrier value if it is not defined to make our portable tarballs use it. |
I've pushed a change to define it on x64 when undefined. |
14de832
to
fd7bbfa
Compare
ubuntu build failed with:
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test please |
src/pal/src/thread/process.cpp
Outdated
@@ -61,6 +61,14 @@ SET_DEFAULT_DEBUG_CHANNEL(PROCESS); // some headers have code with asserts, so d | |||
#include <stdint.h> | |||
#include <dlfcn.h> | |||
|
|||
#ifdef __linux__ | |||
#include <sys/syscall.h> // __NR_membarrier | |||
// Ensure __NR_membarrier is defined for x64 portable build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just amd64
? The portable builds are meant to work everywhere, not just on x64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see - the syscall ids are different between architectures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking that we would add it for all the four architectures we support. But let me check what is our build system for portable builds for ARM and ARM64, maybe that's not necessary for the others.
This benchmarks these method against each other: #include <unistd.h>
#include <sys/syscall.h>
#include <time.h>
#include <stdio.h>
#include <pthread.h>
#include <assert.h>
#include <sys/mman.h>
# define membarrier(...) syscall(__NR_membarrier, __VA_ARGS__)
enum membarrier_cmd
{
MEMBARRIER_CMD_QUERY = 0,
MEMBARRIER_CMD_GLOBAL = (1 << 0),
MEMBARRIER_CMD_GLOBAL_EXPEDITED = (1 << 1),
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED = (1 << 2),
MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3),
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 5),
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 6)
};
void InterlockedIncrement(int* ptr)
{
__sync_add_and_fetch(ptr, 1);
}
int GetVirtualPageSize()
{
return 4096;
}
long long timediff(struct timespec* start, struct timespec* finish)
{
return (finish->tv_sec - start->tv_sec) * 1000000LL + (finish->tv_nsec - start->tv_nsec) / 1000;
}
int main()
{
int status;
struct timespec start, finish;
// MEMBARRIER_CMD_PRIVATE_EXPEDITED
int mask = membarrier(MEMBARRIER_CMD_QUERY, 0);
if (mask >= 0 &&
mask & MEMBARRIER_CMD_PRIVATE_EXPEDITED)
{
// Register intent to use the private expedited command.
status = membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0);
assert (status == 0);
}
const int count = 1000;
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < count; i++)
{
status = membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0);
assert (status == 0);
}
clock_gettime(CLOCK_MONOTONIC, &finish);
printf("MEMBARRIER_CMD_PRIVATE_EXPEDITED: Execution took %lld us\n", timediff(&start, &finish));
// MEMBARRIER_CMD_GLOBAL
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < count; i++)
{
status = membarrier(MEMBARRIER_CMD_GLOBAL, 0);
assert (status == 0);
}
clock_gettime(CLOCK_MONOTONIC, &finish);
printf("MEMBARRIER_CMD_GLOBAL: Execution took %lld us\n", timediff(&start, &finish));
// Dirty page
int* helperPage = (int*)(mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0));
assert(helperPage != MAP_FAILED);
status = mlock(helperPage, GetVirtualPageSize());
assert (status == 0);
pthread_mutex_t flushProcessWriteBuffersMutex;
status = pthread_mutex_init(&flushProcessWriteBuffersMutex, NULL);
assert (status == 0);
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < count; i++)
{
status = pthread_mutex_lock(&flushProcessWriteBuffersMutex);
assert (status == 0);
// Changing a helper memory page protection from read / write to no access
// causes the OS to issue IPI to flush TLBs on all processors. This also
// results in flushing the processor buffers.
status = mprotect(helperPage, GetVirtualPageSize(), PROT_READ | PROT_WRITE);
assert (status == 0);
// Ensure that the page is dirty before we change the protection so that
// we prevent the OS from skipping the global TLB flush.
InterlockedIncrement(helperPage);
status = mprotect(helperPage, GetVirtualPageSize(), PROT_NONE);
assert (status == 0);
status = pthread_mutex_unlock(&flushProcessWriteBuffersMutex);
assert (status == 0);
}
clock_gettime(CLOCK_MONOTONIC, &finish);
printf("Dirty page: Execution took %lld us\n", timediff(&start, &finish));
} On my x64 laptop this gives:
MEMBARRIER_CMD_PRIVATE_EXPEDITED is the fastest and MEMBARRIER_CMD_GLOBAL is much much slower. |
@tmds thank you for the benchmark! I have tried to run it on my RPi3 running arm32 version of Raspbian upgraded to the latest version of packages that fortunately included kernel 4.14. The results are:
|
@tmds I have also looked at our portable builds for arm / arm64 Linux. The builds for arm are targeting Ubuntu 14.04 and they don't have the __NR_membarrier defined in the headers. The arm64 ones are targeting ubuntu 16.04 and they have the __NR_membarrier defined in the headers. So we need to do what you did for amd64 also for arm. But to keep things consistent, I'd prefer doing that for arm64 and x86 too. |
# define __NR_membarrier 389 | ||
# elif defined(__aarch64__) | ||
# define __NR_membarrier 283 | ||
# elif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took these numbers from: https://fedora.juszkiewicz.com.pl/syscalls.html.
The table is auto generated based on kernel source. On amd64 and i386 they match what I have in my headers. I haven't verified the arm and arm64 values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm the arm64 value, it matches what I can see in the headers of the rootfs that we use for the lab builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
# elif defined(__aarch64__) | ||
# define __NR_membarrier 283 | ||
# elif | ||
# error Unknown architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to verify CI is matching one of the above.
@janvorli according to this article, |
We would need to do figure out impact on the ecosystem before doing this. As far as I know, we have not traced any crash on ARM back to FlushProcessWriteBuffers not working. Having a bug that is maybe hit once in a blue moon may be better than having slow or non-working system. |
@tmds Thank you for doing this! |
Based on my historical experience with ARM microkernel development, I believe that mprotect still needs to generate IPI to flush TLB on ARM. ARM64 is a different story though. |
…VATE_EXPEDITED if available Basically a port of dotnet#20949 to GCToOSInterface
…VATE_EXPEDITED if available Basically a port of dotnet#20949 to GCToOSInterface
…VATE_EXPEDITED if available Basically a port of dotnet#20949 to GCToOSInterface
…VATE_EXPEDITED if available (#23778) Basically a port of dotnet/coreclr#20949 to GCToOSInterface
…VATE_EXPEDITED if available (#23778) Basically a port of dotnet/coreclr#20949 to GCToOSInterface
…VATE_EXPEDITED if available (#23778) Basically a port of dotnet/coreclr#20949 to GCToOSInterface
…et/coreclr#20949) * Linux: FlushProcessWriteBuffers using membarrier when available * Ensure __NR_membarrier is defined for x64 portable build. * Define __NR_membarrier on all portable architectures Commit migrated from dotnet/coreclr@baee9b6
…VATE_EXPEDITED if available (dotnet/coreclr#23778) Basically a port of dotnet/coreclr#20949 to GCToOSInterface Commit migrated from dotnet/coreclr@2c6c6c9
Implements https://github.com/dotnet/coreclr/issues/1563
Fixes https://github.com/dotnet/coreclr/issues/18634 https://github.com/dotnet/coreclr/issues/20215 https://github.com/dotnet/corefx/issues/33030
@janvorli ptal