Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Linux: FlushProcessWriteBuffers using membarrier when available #20949

Merged
merged 3 commits into from Nov 20, 2018

Conversation

@janvorli
Copy link
Member

@tmds thank you a lot for implementing this. The change itself looks good, but I would like to understand its performance impact on the architectures we support before merging it. The reason is that @sdmaclea has experimentally made a similar change in #12259, but the performance hit on ARM64 was so bad that the time to run CoreCLR Pri1 tests went up by 50%.
We should get some idea on that from the CI test runs.

@sdmaclea
Copy link

Looks like this is using a different flag than #12259. Hopefully it is more performant.

It may also be that this could still be enabled in cases where mlock fails even if performance was worse by making it a fallback instead of the preferred mechanism.

@janvorli
Copy link
Member

I have just realized there is a problem. We build all portable Linux x64 builds on CentOS 7, but that one has an old kernel 3.x, which doesn't include the syscall and the sys/syscall.h doesn't include the __NR_membarrier definition. So we will need to define the __NR_membarrier value if it is not defined to make our portable tarballs use it.

@tmds
Copy link
Member Author

tmds commented Nov 12, 2018

I've pushed a change to define it on x64 when undefined.

@tmds
Copy link
Member Author

tmds commented Nov 13, 2018

ubuntu build failed with:

09:48:32 Error compiling /mnt/j/workspace/dotnet_coreclr/master/x64_checked_ubuntu_innerloop_prtest/Tools/Microsoft.DotNet.Build.Tasks.dll: Could not load file or assembly 'Microsoft.Build.Framework, Version=15.1.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. The system cannot find the file specified.
09:48:32 (Exception from HRESULT: 0x80070002)
09:48:32 Error: file "/mnt/j/workspace/dotnet_coreclr/master/x64_checked_ubuntu_innerloop_prtest/Tools/Microsoft.DotNet.Build.Tasks.dll" or one of its dependencies was not found
0

@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test please
@dotnet-bot test
Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) please

@@ -61,6 +61,14 @@ SET_DEFAULT_DEBUG_CHANNEL(PROCESS); // some headers have code with asserts, so d
#include <stdint.h>
#include <dlfcn.h>

#ifdef __linux__
#include <sys/syscall.h> // __NR_membarrier
// Ensure __NR_membarrier is defined for x64 portable build.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just amd64? The portable builds are meant to work everywhere, not just on x64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see - the syscall ids are different between architectures

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking that we would add it for all the four architectures we support. But let me check what is our build system for portable builds for ARM and ARM64, maybe that's not necessary for the others.

@tmds
Copy link
Member Author

tmds commented Nov 15, 2018

This benchmarks these method against each other:

#include <unistd.h>
#include <sys/syscall.h>
#include <time.h>
#include <stdio.h>
#include <pthread.h>
#include <assert.h>
#include <sys/mman.h>

# define membarrier(...)  syscall(__NR_membarrier, __VA_ARGS__)

enum membarrier_cmd
{
    MEMBARRIER_CMD_QUERY                                 = 0,
    MEMBARRIER_CMD_GLOBAL                                = (1 << 0),
    MEMBARRIER_CMD_GLOBAL_EXPEDITED                      = (1 << 1),
    MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED             = (1 << 2),
    MEMBARRIER_CMD_PRIVATE_EXPEDITED                     = (1 << 3),
    MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED            = (1 << 4),
    MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE           = (1 << 5),
    MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE  = (1 << 6)
};

void InterlockedIncrement(int* ptr)
{
    __sync_add_and_fetch(ptr, 1);
}

int GetVirtualPageSize()
{
    return 4096;
}

long long timediff(struct timespec* start, struct timespec* finish)
{
    return (finish->tv_sec - start->tv_sec) * 1000000LL + (finish->tv_nsec - start->tv_nsec) / 1000;
}

int main()
{
    int status;
    struct timespec start, finish;
    // MEMBARRIER_CMD_PRIVATE_EXPEDITED
    int mask = membarrier(MEMBARRIER_CMD_QUERY, 0);
    if (mask >= 0 &&
        mask & MEMBARRIER_CMD_PRIVATE_EXPEDITED)
    {
        // Register intent to use the private expedited command.
        status = membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0);
        assert (status == 0);
    }
    const int count = 1000;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < count; i++)
    {
        status = membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0);    
        assert (status == 0);
    }
    clock_gettime(CLOCK_MONOTONIC, &finish);
    printf("MEMBARRIER_CMD_PRIVATE_EXPEDITED: Execution took %lld us\n", timediff(&start, &finish));

    // MEMBARRIER_CMD_GLOBAL
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < count; i++)
    {
        status = membarrier(MEMBARRIER_CMD_GLOBAL, 0);
        assert (status == 0);
    }
    clock_gettime(CLOCK_MONOTONIC, &finish);
    printf("MEMBARRIER_CMD_GLOBAL: Execution took %lld us\n", timediff(&start, &finish));

    // Dirty page
    int* helperPage = (int*)(mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0));
    assert(helperPage != MAP_FAILED);
    status = mlock(helperPage, GetVirtualPageSize());
    assert (status == 0);
    pthread_mutex_t flushProcessWriteBuffersMutex;
    status = pthread_mutex_init(&flushProcessWriteBuffersMutex, NULL);
    assert (status == 0);
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < count; i++)
    {
        status = pthread_mutex_lock(&flushProcessWriteBuffersMutex);
        assert (status == 0);

        // Changing a helper memory page protection from read / write to no access
        // causes the OS to issue IPI to flush TLBs on all processors. This also
        // results in flushing the processor buffers.
        status = mprotect(helperPage, GetVirtualPageSize(), PROT_READ | PROT_WRITE);
        assert (status == 0);

        // Ensure that the page is dirty before we change the protection so that
        // we prevent the OS from skipping the global TLB flush.
        InterlockedIncrement(helperPage);

        status = mprotect(helperPage, GetVirtualPageSize(), PROT_NONE);
        assert (status == 0);

        status = pthread_mutex_unlock(&flushProcessWriteBuffersMutex);
        assert (status == 0);
    }
    clock_gettime(CLOCK_MONOTONIC, &finish);
    printf("Dirty page: Execution took %lld us\n", timediff(&start, &finish));
}

On my x64 laptop this gives:

MEMBARRIER_CMD_PRIVATE_EXPEDITED: Execution took 1623 us
MEMBARRIER_CMD_GLOBAL: Execution took 8523616 us
Dirty page: Execution took 5409 us

MEMBARRIER_CMD_PRIVATE_EXPEDITED is the fastest and MEMBARRIER_CMD_GLOBAL is much much slower.

@janvorli
Copy link
Member

@tmds thank you for the benchmark! I have tried to run it on my RPi3 running arm32 version of Raspbian upgraded to the latest version of packages that fortunately included kernel 4.14.

The results are:

MEMBARRIER_CMD_PRIVATE_EXPEDITED: Execution took 1748 us
MEMBARRIER_CMD_GLOBAL: Execution took 50147106 us
Dirty page: Execution took 7569 us

@janvorli
Copy link
Member

@tmds I have also looked at our portable builds for arm / arm64 Linux. The builds for arm are targeting Ubuntu 14.04 and they don't have the __NR_membarrier defined in the headers. The arm64 ones are targeting ubuntu 16.04 and they have the __NR_membarrier defined in the headers.

So we need to do what you did for amd64 also for arm. But to keep things consistent, I'd prefer doing that for arm64 and x86 too.

# define __NR_membarrier 389
# elif defined(__aarch64__)
# define __NR_membarrier 283
# elif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took these numbers from: https://fedora.juszkiewicz.com.pl/syscalls.html.
The table is auto generated based on kernel source. On amd64 and i386 they match what I have in my headers. I haven't verified the arm and arm64 values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm the arm64 value, it matches what I can see in the headers of the rootfs that we use for the lab builds.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

# elif defined(__aarch64__)
# define __NR_membarrier 283
# elif
# error Unknown architecture
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to verify CI is matching one of the above.

@tmds
Copy link
Member Author

tmds commented Nov 19, 2018

@janvorli according to this article, mprotect doesn't generate IPIs on ARM. Maybe (on ARM) we should fall back to MEMBARRIER_CMD_GLOBAL and fail if neither membarrier cmds are available? That means kernels before 4.3 (Nov 2015) won't work and before 4.14 (Nov 2017) use the slow option. wdyt?

@jkotas
Copy link
Member

jkotas commented Nov 20, 2018

Maybe (on ARM) we should fall back to MEMBARRIER_CMD_GLOBAL and fail if neither membarrier cmds are available?

We would need to do figure out impact on the ecosystem before doing this.

As far as I know, we have not traced any crash on ARM back to FlushProcessWriteBuffers not working. Having a bug that is maybe hit once in a blue moon may be better than having slow or non-working system.

@jkotas jkotas merged commit baee9b6 into dotnet:master Nov 20, 2018
@jkotas
Copy link
Member

jkotas commented Nov 20, 2018

@tmds Thank you for doing this!

@janvorli
Copy link
Member

Based on my historical experience with ARM microkernel development, I believe that mprotect still needs to generate IPI to flush TLB on ARM. ARM64 is a different story though.

VSadov added a commit to VSadov/coreclr that referenced this pull request Apr 6, 2019
…VATE_EXPEDITED if available

Basically a port of  dotnet#20949   to  GCToOSInterface
VSadov added a commit to VSadov/coreclr that referenced this pull request Apr 6, 2019
…VATE_EXPEDITED if available

Basically a port of  dotnet#20949   to  GCToOSInterface
VSadov added a commit to VSadov/coreclr that referenced this pull request Apr 6, 2019
…VATE_EXPEDITED if available

Basically a port of  dotnet#20949   to  GCToOSInterface
jkotas pushed a commit that referenced this pull request Apr 6, 2019
…VATE_EXPEDITED if available (#23778)

Basically a port of  #20949   to  GCToOSInterface
Suchiman pushed a commit to Suchiman/corert that referenced this pull request Jun 10, 2019
…VATE_EXPEDITED if available (#23778)

Basically a port of  dotnet/coreclr#20949   to  GCToOSInterface
Suchiman pushed a commit to Suchiman/corert that referenced this pull request Jun 11, 2019
…VATE_EXPEDITED if available (#23778)

Basically a port of  dotnet/coreclr#20949   to  GCToOSInterface
jkotas pushed a commit to dotnet/corert that referenced this pull request Jun 11, 2019
…VATE_EXPEDITED if available (#23778)

Basically a port of  dotnet/coreclr#20949   to  GCToOSInterface
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…et/coreclr#20949)

* Linux: FlushProcessWriteBuffers using membarrier when available

* Ensure __NR_membarrier is defined for x64 portable build.

* Define __NR_membarrier on all portable architectures


Commit migrated from dotnet/coreclr@baee9b6
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…VATE_EXPEDITED if available (dotnet/coreclr#23778)

Basically a port of  dotnet/coreclr#20949   to  GCToOSInterface

Commit migrated from dotnet/coreclr@2c6c6c9
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants