Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump and Reset stats data on demand using SIGUSR1 signal #1857

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

TejaswineeL
Copy link
Contributor

@TejaswineeL TejaswineeL commented Apr 23, 2024

Description of the changes

Currently Gramine does not support to display the stats values if user wants to collect only for a particular duration (e.g., only during perf benchmarking runs). This PR features displaying and resetting the stats data on demand using SIGUSR1 signal. That way user can collect data only for a particular duration. This also eliminates the noise added in the stats data during the start and stop of the process.

How to test this PR?

Please follow the instruction here to enable SGX stats. Now execute the workload in Gramine SGX. Send SIGUSR1 signal using command kill -SIGUSR1 <pid> to dump and reset the stats. This will show the current stats values on the command line and then will reset the stats values. Example:
1. Add below manifest options to gramine/CI-Examples/redis/redis-server.manifest.template.
sgx.debug = true
sgx.enable_stats=true
2. Build the redis example: make clean && make SGX=1
3. Run redis server: gramine-sgx redis-server &
4. Get the redis server pid and run kill -SIGUSR1 <pid> to dump and reset the stats.
5. Immediately run the benchmarks using command src/src/redis-benchmark
6. Once benchmark run is complete, run kill -SIGUSR1 <pid> again to get the stats values corresponding to only the
benchmark execution.
8. Summation(total) value of stats values of all the threads are displayed at the end.
9. That’s how we get the stats values and the summation of them corresponding to the benchmark run. The stats are
displayed on the command line.

Same steps can be tried with MongoDB, MySQL etc..


This change is Reviewable

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 12 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @TejaswineeL)


pal/src/host/linux-sgx/host_exception.c line 18 at r1 (raw file):

                     */

#include <dirent.h>

I feel like you added a lot of unnecessary includes. Please remove the not-needed ones.


pal/src/host/linux-sgx/host_exception.c line 35 at r1 (raw file):

#include "ucontext.h"

#define MAX_DBG_THREADS 4096

But this macro is declared somewhere else, why not #include <where-it-declared.h>?


pal/src/host/linux-sgx/host_exception.c line 38 at r1 (raw file):

static const int ASYNC_SIGNALS[] = {SIGTERM, SIGCONT};
static int send_sigusr1_signal_to_children(void);

What's the point of this declaration?


pal/src/host/linux-sgx/host_exception.c line 199 at r1 (raw file):

}

static int send_sigusr1_signal_to_children() {

Need (void) here


pal/src/host/linux-sgx/host_exception.c line 203 at r1 (raw file):

    for (size_t i = 1; i < MAX_DBG_THREADS; i++) {
        int child_tid = ((struct enclave_dbginfo*)DBGINFO_ADDR)->thread_tids[i];

TID can legitimately be 0. Please use another struct field to determine whether this array item is a legit thread.


pal/src/host/linux-sgx/host_exception.c line 222 at r1 (raw file):

    static const uint64_t SLEEP_STEP   = 1000000;   /* 100 steps before capped */

    if(g_sgx_enable_stats) {

Please change this cascade of IFs to early-returns:

if (!g_sgx_enable_stats)
    return;

if (DO_SYSCALL(gettid) != g_host_pid) {
    /* non-main threads, just dump thread-local stats */
}

pal/src/host/linux-sgx/host_exception.c line 229 at r1 (raw file):

            uint64_t sleep_time    = 0;

            while((no_of_children) > (__atomic_load_n(&no_of_children_visited, __ATOMIC_RELAXED))) {

You can't use RELAXED if you use ACK_REL elsewhere


pal/src/host/linux-sgx/host_exception.c line 234 at r1 (raw file):

                        sleep_time += SLEEP_STEP;
                    struct timespec tv = {.tv_sec = 0, .tv_nsec = sleep_time};
                    (void)DO_SYSCALL(nanosleep, &tv, /*rem=*/NULL);

All these constants seem irrelevant, I suggest to just do sched_yield() instead of nanosleep. And remove all constants with LOOP and SLEEP.


pal/src/host/linux-sgx/host_exception.c line 240 at r1 (raw file):

                }
            }
            update_and_print_stats(true);

Please add /*process_wide=*/


pal/src/host/linux-sgx/host_exception.c line 240 at r1 (raw file):

                }
            }
            update_and_print_stats(true);

But this means that main thread does not reset his local stats?


pal/src/host/linux-sgx/host_thread.c line 75 at r1 (raw file):

        __atomic_exchange_n(&g_aex_cnt, 0, __ATOMIC_ACQ_REL);
        __atomic_exchange_n(&g_sync_signal_cnt, 0, __ATOMIC_ACQ_REL);
        __atomic_exchange_n(&g_async_signal_cnt, 0, __ATOMIC_ACQ_REL);

Why exchange, and not store?


pal/src/host/linux-sgx/host_thread.c line 107 at r1 (raw file):

    int ret;

    /* set GS reg of this thread to thread's TCB; */

Why is this needed?

@TejaswineeL TejaswineeL changed the title Debug statsAdd a support for debug-stats dumping and resetting on demand dump reset Dump and reset stats data on demand using SIGUSR1 signal Apr 25, 2024
@TejaswineeL TejaswineeL changed the title Dump and reset stats data on demand using SIGUSR1 signal Dump and Reset stats data on demand using SIGUSR1 signal Apr 25, 2024
Signed-off-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
@TejaswineeL TejaswineeL force-pushed the debug_stats_dump_reset branch 2 times, most recently from 8caf722 to 5fcbfed Compare April 25, 2024 08:35
Signed-off-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
Signed-off-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 12 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/host/linux-sgx/host_exception.c line 203 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

TID can legitimately be 0. Please use another struct field to determine whether this array item is a legit thread.

Approach 1:

As other structures storing children thread ids are not accessible to host_exception.c (based on the makefile),
I propose the below:

at (https://github.com/gramineproject/gramine/blob/53f5511123d51e7baf9cf98ffc7d3ad6096e6c6b/pal/src/host/linux-sgx/host_thread.c#L165)
we have g_enclave_thread_num keeping track of the count of tid assignment of the array thread_tids[]
I have an idea of using this variable to decide legit thread_tids[] inorder to send SIGUSR1 signal
we will be doing below changes if we go ahead with this idea.

https://github.com/TejaswineeL/gramine/blob/14a8be3adcb801ed4a9a1ff65afb178448111d86/pal/src/host/linux-sgx/host_exception.c#L197
/* using g_enclave_thread_num in the for loop condition check */
spinlock\_lock(&g\_enclave\_thread\_map\_lock);
    for (size\_t i = 1; i < g\_enclave\_thread\_num; i++) {
        int child\_tid = ((struct enclave\_dbginfo\*)DBGINFO\_ADDR)->thread\_tids\[i\];
        log\_always("child tid with g\_enclave\_thread\_num %d", child\_tid);
            DO\_SYSCALL(tkill, child\_tid, SIGUSR1);
            signal\_counter++;
    }
spinlock\_unlock(&g\_enclave\_thread\_map\_lock);

static size_t g_enclave_thread_num = 0;

/* removing static keyword so that we can use these variables in host_exception.c file */
spinlock\_t g\_enclave\_thread\_map\_lock = INIT\_SPINLOCK\_UNLOCKED;
size\_t g\_enclave\_thread\_num = 0;

stats_PR/changes_stats_or/gramine/pal/src/host/linux-sgx/pal_tcb.h
/* adding below lines to the file included in host_exception.c*/
extern size\_t g\_enclave\_thread\_num;
extern spinlock\_t g\_enclave\_thread\_map\_lock;

Approach 2:

All the children thread ids are present at location /proc/pid/task. This location only stores the main process id and children thread ids.
I can validate thread ids of the array thread_tids[] with the tids present at /proc//task during execution
Below snippet gives us the children thread ids

-defining the filepath whre we can find the children thread ids
snprintf(filepath, sizeof(filepath), "/proc/%d/task/", g\_host\_pid);

-Fetching the children tids from the filepath location with files

   struct dirent \*entry;
   dir = opendir(filename);
    if (dir == NULL) {
        perror("Error opening directory");
        return -1;
    }
    while ((entry = readdir(dir)) != NULL) {
        if (entry->d\_type == DT\_DIR && atoi(entry->d\_name) != g\_host\_pid) {
           if (strcmp(entry->d\_name, ".") == 0 || strcmp(entry->d\_name, "..") == 0) {
                // Skip current directory and parent directory
                continue;
            }
            printf(" thread id from proc %s", entry->d\_name);
         }
    }
    closedir(dir);

Kindly, Let us know which of the above approaches would you like.

Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 12 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/host/linux-sgx/host_exception.c line 203 at r1 (raw file):

Previously, TejaswineeL (Tejaswinee) wrote…

Approach 1:

As other structures storing children thread ids are not accessible to host_exception.c (based on the makefile),
I propose the below:

at (https://github.com/gramineproject/gramine/blob/53f5511123d51e7baf9cf98ffc7d3ad6096e6c6b/pal/src/host/linux-sgx/host_thread.c#L165)
we have g_enclave_thread_num keeping track of the count of tid assignment of the array thread_tids[]
I have an idea of using this variable to decide legit thread_tids[] inorder to send SIGUSR1 signal
we will be doing below changes if we go ahead with this idea.

https://github.com/TejaswineeL/gramine/blob/14a8be3adcb801ed4a9a1ff65afb178448111d86/pal/src/host/linux-sgx/host_exception.c#L197
/* using g_enclave_thread_num in the for loop condition check */
spinlock\_lock(&g\_enclave\_thread\_map\_lock);
    for (size\_t i = 1; i < g\_enclave\_thread\_num; i++) {
        int child\_tid = ((struct enclave\_dbginfo\*)DBGINFO\_ADDR)->thread\_tids\[i\];
        log\_always("child tid with g\_enclave\_thread\_num %d", child\_tid);
            DO\_SYSCALL(tkill, child\_tid, SIGUSR1);
            signal\_counter++;
    }
spinlock\_unlock(&g\_enclave\_thread\_map\_lock);

static size_t g_enclave_thread_num = 0;

/* removing static keyword so that we can use these variables in host_exception.c file */
spinlock\_t g\_enclave\_thread\_map\_lock = INIT\_SPINLOCK\_UNLOCKED;
size\_t g\_enclave\_thread\_num = 0;

stats_PR/changes_stats_or/gramine/pal/src/host/linux-sgx/pal_tcb.h
/* adding below lines to the file included in host_exception.c*/
extern size\_t g\_enclave\_thread\_num;
extern spinlock\_t g\_enclave\_thread\_map\_lock;

Approach 2:

All the children thread ids are present at location /proc/pid/task. This location only stores the main process id and children thread ids.
I can validate thread ids of the array thread_tids[] with the tids present at /proc//task during execution
Below snippet gives us the children thread ids

-defining the filepath whre we can find the children thread ids
snprintf(filepath, sizeof(filepath), "/proc/%d/task/", g\_host\_pid);

-Fetching the children tids from the filepath location with files

   struct dirent \*entry;
   dir = opendir(filename);
    if (dir == NULL) {
        perror("Error opening directory");
        return -1;
    }
    while ((entry = readdir(dir)) != NULL) {
        if (entry->d\_type == DT\_DIR && atoi(entry->d\_name) != g\_host\_pid) {
           if (strcmp(entry->d\_name, ".") == 0 || strcmp(entry->d\_name, "..") == 0) {
                // Skip current directory and parent directory
                continue;
            }
            printf(" thread id from proc %s", entry->d\_name);
         }
    }
    closedir(dir);

Kindly, Let us know which of the above approaches would you like.

Approach1 logs
Sharing below the logs showing threads_tids[] array contents and value of g_enclave_thread_num used in approach1,

image.png

Here, threads_tids[0] always denotes the main thread tid (which we dont have to send SIGUSR1 signal to)threads_tids[1] to threads_tids[5] shows the children tids. threads_tids[6] and threads_tids[7] are empty.

Approach2
Sharing below the contents of dir /proc/pid/task showing children tids

image copy 1.png

Plan is to validate children tids got from threads_tids[] with children tids of /proc/pid/task

Kindly, Let us know which approach would you prefer.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 22 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @TejaswineeL)


-- commits line 12 at r4:
We don't use fixups to fixups. Each next fixup is supposed to be a fixup to the first (main) commit. Please check other PRs from other maintainers, who they do it.


pal/src/host/linux-sgx/host_exception.c line 18 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I feel like you added a lot of unnecessary includes. Please remove the not-needed ones.

Is dirent.h needed? This header is about file system directories, but you don't use them in your patch?


pal/src/host/linux-sgx/host_exception.c line 203 at r1 (raw file):

Previously, TejaswineeL (Tejaswinee) wrote…

Approach1 logs
Sharing below the logs showing threads_tids[] array contents and value of g_enclave_thread_num used in approach1,

image.png

Here, threads_tids[0] always denotes the main thread tid (which we dont have to send SIGUSR1 signal to)threads_tids[1] to threads_tids[5] shows the children tids. threads_tids[6] and threads_tids[7] are empty.

Approach2
Sharing below the contents of dir "/proc/pid/task " showing children tids

image copy 1.png

Plan is to validate children tids got from threads_tids[] with children tids of '/proc/pid/task'

Kindly, Let us know which approach would you prefer.

I'm sorry that I confused you. I now realized that TID 0 (= PID 0) is the kernel PID on the host Linux. So actually we're safe here, and you can consider tid == 0 as "no actual Gramine thread".

So your code is correct, and please excuse me that you had to go research the possibilities :(


pal/src/host/linux-sgx/host_exception.c line 240 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

But this means that main thread does not reset his local stats?

Could you answer my question? I still think that the main thread doesn't reset its local stats.


pal/src/host/linux-sgx/host_exception.c line 194 at r4 (raw file):

static int send_sigusr1_signal_to_children(void) {
    int signal_counter= 0;

= -- add space


pal/src/host/linux-sgx/host_exception.c line 194 at r4 (raw file):

static int send_sigusr1_signal_to_children(void) {
    int signal_counter= 0;

That's a bad variable name. Please rename similar to how you have it in another func: no_of_children


pal/src/host/linux-sgx/host_exception.c line 194 at r4 (raw file):

static int send_sigusr1_signal_to_children(void) {
    int signal_counter= 0;

We don't like to use int in cases of clearly unsigned "counter" integers. So please use size_t, that's the safest option.


pal/src/host/linux-sgx/host_exception.c line 196 at r4 (raw file):

    int signal_counter= 0;

    for (size_t i = 1; i < MAX_DBG_THREADS; i++) {

Why do you start with 1?

If you do this to NOT send the signal to yourself, then please add a func argument and use it for checking:

static int send_sigusr1_signal_to_children(int my_tid) {
    ...
    for (size_t i = 0; i < MAX_DBG_THREADS; i++) {
        int child_tid = ((struct enclave_dbginfo*)DBGINFO_ADDR)->thread_tids[i];
        if (child_tid == my_tid)
            continue;
}

pal/src/host/linux-sgx/host_exception.c line 198 at r4 (raw file):

    for (size_t i = 1; i < MAX_DBG_THREADS; i++) {
        int child_tid = ((struct enclave_dbginfo*)DBGINFO_ADDR)->thread_tids[i];
        if(child_tid > 0) {

No need for > 0, simply:

if (child_tid) {

pal/src/host/linux-sgx/host_exception.c line 208 at r4 (raw file):

static void dump_and_reset_stats(void)
{

Please use our coding style in this PR. In particular, here { should be on the same line as the func declaration:

static void dump_and_reset_stats(void) {

pal/src/host/linux-sgx/host_exception.c line 209 at r4 (raw file):

static void dump_and_reset_stats(void)
{
    static atomic_int no_of_children_visited = 0;

You don't need atomic_int if you use __atomic_xxx() operations. These are two different and unrelated ways of using atomics.

Here, you can just use int.


pal/src/host/linux-sgx/host_exception.c line 210 at r4 (raw file):

{
    static atomic_int no_of_children_visited = 0;
    static const uint64_t LOOP_ATTEMPTS_MAX = 10000;   /* rather arbitrary */

Why you keep this const? Just remove it and always do sched_yield() (and don't do CPU_RELAX() at all)


pal/src/host/linux-sgx/host_exception.c line 212 at r4 (raw file):

    static const uint64_t LOOP_ATTEMPTS_MAX = 10000;   /* rather arbitrary */

    if(DO_SYSCALL(gettid) == g_host_pid) {

if ( -- please add space in between. Also, please check all other such places in your PR.


pal/src/host/linux-sgx/host_exception.c line 214 at r4 (raw file):

    if(DO_SYSCALL(gettid) == g_host_pid) {
        int no_of_children = send_sigusr1_signal_to_children();
        uint64_t loop_attempts = 0;

Useless variable.


pal/src/host/linux-sgx/host_exception.c line 216 at r4 (raw file):

        uint64_t loop_attempts = 0;

        /* Wait here until all the children are done processing the signal. */

Remove this comment, it's obvious what the code does


pal/src/host/linux-sgx/host_exception.c line 217 at r4 (raw file):

        /* Wait here until all the children are done processing the signal. */
        while((no_of_children) > (__atomic_load_n(&no_of_children_visited, __ATOMIC_ACQUIRE))) {

Please swap these two expressions places, so it's more readable ("while number of visited children is still less than the total number of children")


pal/src/host/linux-sgx/host_exception.c line 229 at r4 (raw file):

        __atomic_store_n(&no_of_children_visited, 0, __ATOMIC_RELEASE);
    } else {
        log_always("----- DUMPTING and RESETTING SGX STATS -----");

You sure we want to print this out? Sounds rather useless, because update_and_print_stats() will also print an information line.


pal/src/host/linux-sgx/host_exception.c line 235 at r4 (raw file):

    PAL_HOST_TCB* tcb = pal_get_host_tcb();
    int ret = pal_host_tcb_reset_stats(tcb);

Why do you do pal_get_host_tcb() in this function and pass it to pal_host_tcb_reset_stats()? You can move that into pal_host_tcb_reset_stats() itself, just like is done in update_nad_print_stats().


pal/src/host/linux-sgx/host_exception.c line 237 at r4 (raw file):

    int ret = pal_host_tcb_reset_stats(tcb);
    if(ret < 0)
        return;

There's no need for return as this is a void function and you're at the end of this function.

Since you don't need return, you also don't need ret.


pal/src/host/linux-sgx/host_exception.c line 245 at r4 (raw file):

    __UNUSED(uc);

    if(g_sgx_enable_stats)

if ( -- space


pal/src/host/linux-sgx/host_exception.c line 248 at r4 (raw file):

        dump_and_reset_stats();

    return;

There's no need in return in a void function. Remove this line.


pal/src/host/linux-sgx/host_thread.c line 107 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why is this needed?

I still don't see a need in this arch_prctl() syscall. Why did you add it here?


pal/src/host/linux-sgx/host_thread.c line 69 at r4 (raw file):

                   "  # of async signals:  %lu",
                   pid, g_eenter_cnt, g_eexit_cnt, g_aex_cnt,
                   g_sync_signal_cnt, g_async_signal_cnt);

All these g_ variables must be read via __atomic_load_n().

Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 22 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/host/linux-sgx/host_exception.c line 18 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Is dirent.h needed? This header is about file system directories, but you don't use them in your patch?

That's right. Its not needed. I will remove in the upcoming patch.


pal/src/host/linux-sgx/host_exception.c line 35 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

But this macro is declared somewhere else, why not #include <where-it-declared.h>?

The Macro is declared in #include "gdb_integration/sgx_gdb.h. This is already included.


pal/src/host/linux-sgx/host_exception.c line 240 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Could you answer my question? I still think that the main thread doesn't reset its local stats.

Main thread does reset its local stats. Unconditional code snippet shown below does that.
https://github.com/gramineproject/gramine/pull/1857/files#diff-36ddf4f0901a94224ef8f08faa82ad332c9afc3ed952d5950e5597f110464ec7R235

Here's the logs of the main thread stats dump and reset
image.png


pal/src/host/linux-sgx/host_exception.c line 196 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why do you start with 1?

If you do this to NOT send the signal to yourself, then please add a func argument and use it for checking:

static int send_sigusr1_signal_to_children(int my_tid) {
    ...
    for (size_t i = 0; i < MAX_DBG_THREADS; i++) {
        int child_tid = ((struct enclave_dbginfo*)DBGINFO_ADDR)->thread_tids[i];
        if (child_tid == my_tid)
            continue;
}

Yes, thread_tids[0] is the main thread id. It is the same tid to which we send kill -SIGUSR1 signal via command line.
I will add the check you suggested I will use var name as main_tid instead of my_tid.


pal/src/host/linux-sgx/host_exception.c line 229 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You sure we want to print this out? Sounds rather useless, because update_and_print_stats() will also print an information line.

The intention is to differentiate usual display of stats (as a result of sgx.enable_stats=true) from 'dump and reset' of stats (with kill -SIGUSR1). Also, it will acknowledge the receival of kill -SIGUSR1 signal


pal/src/host/linux-sgx/host_thread.c line 107 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I still don't see a need in this arch_prctl() syscall. Why did you add it here?

I see that without adding arch_prctl(), the stats are getting reset successfully.
Removing arch_prctl() in the upcoming patch.

Signed-off-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
@TejaswineeL TejaswineeL requested a review from dimakuv May 6, 2024 11:38
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @TejaswineeL)

a discussion (no related file):
Please update the Performance tuning and analysis page in the documentation accordingly.

You can take inspiration from this patch: https://github.com/gramineproject/gramine/pull/1751/files



pal/src/host/linux-sgx/host_exception.c line 240 at r1 (raw file):

Previously, TejaswineeL (Tejaswinee) wrote…

Main thread does reset its local stats. Unconditional code snippet shown below does that.
https://github.com/gramineproject/gramine/pull/1857/files#diff-36ddf4f0901a94224ef8f08faa82ad332c9afc3ed952d5950e5597f110464ec7R235

Here's the logs of the main thread stats dump and reset
image.png

Ok, you're right, pal_host_tcb_reset_stats() is called unconditionally


pal/src/host/linux-sgx/host_exception.c line 229 at r4 (raw file):

Previously, TejaswineeL (Tejaswinee) wrote…

The intention is to differentiate usual display of stats (as a result of 'sgx.enable_stats=true') from 'dump and reset' of stats (with 'kill -SIGUSR1'). Also, it will acknowledge the receival of 'kill -SIGUSR1' signal

Ok. Don't we want the same message on the main thread?


pal/src/host/linux-sgx/host_exception.c line 248 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

There's no need in return in a void function. Remove this line.

You forgot about this comment.


pal/src/host/linux-sgx/host_exception.c line 192 at r5 (raw file):

static size_t send_sigusr1_signal_to_children(pid_t main_tid) {
    size_t no_of_signal_sent = 0;

no_of_signal_sent -> no_of_signals_sent


pal/src/host/linux-sgx/host_exception.c line 213 at r5 (raw file):

        size_t no_of_children = send_sigusr1_signal_to_children(g_host_pid);

        while ((__atomic_load_n(&no_of_children_visited, __ATOMIC_ACQUIRE)) < (no_of_children)) {

There's no reason to have no_of_children in parentheses, please remove them


pal/src/host/linux-sgx/host_exception.c line 219 at r5 (raw file):

        __atomic_store_n(&no_of_children_visited, 0, __ATOMIC_RELEASE);
    } else {
        log_always("----- DUMPTING and RESETTING SGX STATS -----");

DUMPING (you have a typo)


pal/src/host/linux-sgx/host_thread.c line 70 at r5 (raw file):

                   pid, __atomic_load_n(&g_eenter_cnt, __ATOMIC_ACQUIRE), __atomic_load_n(&g_eexit_cnt, __ATOMIC_ACQUIRE),
                   __atomic_load_n(&g_aex_cnt, __ATOMIC_ACQUIRE), __atomic_load_n(&g_sync_signal_cnt, __ATOMIC_ACQUIRE),
                   __atomic_load_n(&g_async_signal_cnt, __ATOMIC_ACQUIRE));

Please now re-wrap these lines to 100-chars-per-line limit.

Signed-off-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 4 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/host/linux-sgx/host_exception.c line 229 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Ok. Don't we want the same message on the main thread?

Yes we want. Adding it.

@TejaswineeL TejaswineeL requested a review from dimakuv May 10, 2024 11:26
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @TejaswineeL)


Documentation/performance.rst line 106 at r6 (raw file):

   trends.

It is also possible to reset the performance statistics interactively, using

to reset the performance statistics -> to dump and reset SGX-related statistics


Documentation/performance.rst line 108 at r6 (raw file):

It is also possible to reset the performance statistics interactively, using
``SIGUSR1`` signal. This helps to collect performance statistics only for a
particular period e.g., skipping the Gramine startup and application

Please add a comma before e.g.


Documentation/performance.rst line 111 at r6 (raw file):

initialization time and concentrating only on the actual application processing.
Send ``SIGUSR1`` using command ``kill -SIGUSR1 <gramine-pid>``  (note the
minus sign before <gramine-pid>). Sending multiple ``SIGUSR1`` will result

Please remove the (note the minus sign ... ) sentence, it is not relevant here.


Documentation/performance.rst line 113 at r6 (raw file):

minus sign before <gramine-pid>). Sending multiple ``SIGUSR1`` will result
in a sequential dump and reset of the statistics, each dump and reset of
statistics will be done after the previous ``SIGUSR1``.

The whole Sending multiple ... sentence seems redundant, as it is clear that "dump and reset" means that the statistics will be reset, and one can send a next SIGUSR1.

Please just remove the whole sentence.


pal/src/host/linux-sgx/host_exception.c line 215 at r6 (raw file):

        while ((__atomic_load_n(&no_of_children_visited, __ATOMIC_ACQUIRE)) < no_of_children) {
            DO_SYSCALL(sched_yield);
        }

Please add an empty line after this line


pal/src/host/linux-sgx/host_exception.c line 221 at r6 (raw file):

    } else {
        log_always("----- DUMPING and RESETTING SGX STATS -----");
        update_and_print_stats(/*process_wide=*/false);

Actually, when you have many child threads, won't their outputs be mixed? Looks like we need some locking here...


pal/src/host/linux-sgx/host_exception.c line 287 at r6 (raw file):

    ret = set_signal_handler(SIGUSR1, handle_sigusr1);
    if (ret < 0)
        goto err;

Actually, can you move this set of signal handler to right-after set_signal_handler(SIGCONT)? There it would make more sense (the order of signals will be more readable).

Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 4 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/host/linux-sgx/host_exception.c line 221 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually, when you have many child threads, won't their outputs be mixed? Looks like we need some locking here...

Yes, that's right, their outputs get mixed. I added a change, where "DUMPING and RESETTING SGX STATS" log is displayed only once. The log is displayed once per SIGUSR1 signal, at the start of the dumping and resetting operation.

@TejaswineeL TejaswineeL requested a review from dimakuv May 14, 2024 10:18
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @TejaswineeL)


-- commits line 27 at r7:
Please don't perform git merge (and don't allow atomatic git merge). That's for the future.


pal/src/host/linux-sgx/host_exception.c line 221 at r6 (raw file):

Previously, TejaswineeL (Tejaswinee) wrote…

Yes, that's right, their outputs get mixed. I added a change, where "DUMPING and RESETTING SGX STATS" log is displayed only once. The log is displayed once per SIGUSR signal, at the start of the dumping and resetting operation.

Ok, this looks reasonable to me.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @TejaswineeL)

a discussion (no related file):
I followed the steps in the PR description and successfully tested the Redis example. For me, the PR is ready. Thanks for this effort @TejaswineeL !



-- commits line 27 at r7:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please don't perform git merge (and don't allow atomatic git merge). That's for the future.

Accidentally put a blocking comment. Unblocking now.

@dimakuv
Copy link
Contributor

dimakuv commented May 14, 2024

Jenkins, test this please

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @TejaswineeL)


pal/src/host/linux-sgx/host_thread.c line 35 at r7 (raw file):

    static atomic_ulong g_aex_cnt          = 0;
    static atomic_ulong g_sync_signal_cnt  = 0;
    static atomic_ulong g_async_signal_cnt = 0;

I didn't notice this before, but all these variables use atomic_ (C99 atomics syntax). We decided to not use it in new code, so please modify all these to be simple uint64_t:

    static uint64_t g_eenter_cnt       = 0;
    ...
    static uint64_t g_async_signal_cnt = 0;

@TejaswineeL TejaswineeL requested a review from dimakuv May 16, 2024 05:40
Signed-off-by: TejaswineeL <tejaswinee.ramdas.langhe@intel.com>
@TejaswineeL TejaswineeL force-pushed the debug_stats_dump_reset branch 3 times, most recently from 531a84e to 003f356 Compare May 16, 2024 06:03
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

@dimakuv
Copy link
Contributor

dimakuv commented May 16, 2024

Jenkins, test this please

Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimakuv , Could you please help me understand how to fix the 2 jenkins build test failure.

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kailun-qin I am not able to see the jenkins failure reports
Could you please help me understand how to fix the 2 jenkins build test failure

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two pipelines are not working properly yet, please ignore. They should have been disabled, sorry.

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

Copy link
Contributor Author

@TejaswineeL TejaswineeL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkow , No worries. Dimitri has approved the changes. Could you please have a look at the PR and let me know your comments.

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants