Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness / performance bug in PooledThreadExecutor #2744

Open
hunjmes opened this issue Nov 9, 2023 · 3 comments
Open

Liveness / performance bug in PooledThreadExecutor #2744

hunjmes opened this issue Nov 9, 2023 · 3 comments
Labels
bug This issue is a bug. p2 This is a standard priority issue pending-release This issue will be fixed by an approved PR that hasn't been released yet.

Comments

@hunjmes
Copy link
Contributor

hunjmes commented Nov 9, 2023

Describe the bug

My team noticed a performance / "liveness" bug when using the PooledThreadExecutor to handle lots of requests to S3. Our repro is reliable, but unfortunately too complex include with this report, so a description will have to suffice.

We observed lots of "producer" threads waiting for the SDK to complete requests that they had added to the PooledThreadExecutor's queue, while almost all of the PooledThreadExecutor's "consumer" ThreadTasks were idle, waiting at this line.

The simplest description of the bug is that function ThreadTask::MainTaskRunner() doesn't use or hold the same mutex, between checking m_executor.HasTasks() and when it waits on m_executor.m_sync.WaitOne(). This allows a concurrent "producer" thread to add a task to the queue, after ThreadTask::MainTaskRunner() has concluded that the queue is empty, but before ThreadTask::MainTaskRunner() waits on the Semaphore.

In such a case, the queue has a task to be performed, but the ThreadTask does not realize this.

Expected Behavior

When pushing a task to the PooledThreadExecutor's queue, if a ThreadTask is available, then it should execute that task.

Current Behavior

PooledThreadExecutor's queue is non-empty, but ThreadTasks remain idle.

In my team's reopro, overall throughput drops dramatically, with hundreds of threads all waiting, ultimately, on the PooledThreadExecutor's Semaphore. (Eventually, in our repro, the "producer" threads fill their queues, so they end up blocked on the ThreadTasks, which are not executing the work on the PooledThreadExecutor's queue.)

Reproduction Steps

Not feasible to repro at scale, here.

Possible Solution

Function ThreadTask::MainTaskRunner() needs to hold the mutex until it waits on the Semaphore. Unfortunately, the SDK's Semaphore class doesn't take a mutex parameter, so the function would probably have to use std::condition_variable, instead. (Note that this code already uses std::lock_guard and std::mutex, so also using std::unique_lock and std::condition_variable should not be a problem.)

Possible solution, something like this (with corresponding changes made elsewhere):

void ThreadTask::MainTaskRunner()
{
    while (m_continue)
    {        
        std::unique_lock lock(m_executor.m_queueLock);
        while (m_continue && m_executor.HasTasks() /*must not acquire m_queueLock anymore*/)
        {      
            auto fn = m_executor.PopTask();
            lock.unlock();
            if(fn)
            {
                (*fn)();
                Aws::Delete(fn);               
            }
            lock.lock();
        }
     
        if(m_continue)
        {
            m_executor.m_cv.wait(lock, ...);
        }
    }
}

Additional Information/Context

No response

AWS CPP SDK version used

1.11.x

Compiler and Version used

gcc 8

Operating System and version

Linux 5.4.258

@hunjmes hunjmes added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 9, 2023
@hunjmes
Copy link
Contributor Author

hunjmes commented Nov 10, 2023

@jmklix
Copy link
Member

jmklix commented Nov 14, 2023

Thanks for opening this issue and creating a PR! Reviewing the PR

@jmklix jmklix added pending-release This issue will be fixed by an approved PR that hasn't been released yet. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Nov 14, 2023
@jmklix jmklix assigned jmklix and unassigned jmklix Nov 14, 2023
@prm-james-hill
Copy link

I'm also seeing this on tag 1.11.68

@jmklix is there an update on this? The PR still looks unreviewed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue pending-release This issue will be fixed by an approved PR that hasn't been released yet.
Projects
None yet
Development

No branches or pull requests

3 participants