Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DicomServer performance regression #1776

Closed
gofal opened this issue Apr 30, 2024 · 6 comments · Fixed by #1777
Closed

DicomServer performance regression #1776

gofal opened this issue Apr 30, 2024 · 6 comments · Fixed by #1777
Labels

Comments

@gofal
Copy link
Contributor

gofal commented Apr 30, 2024

Describe the bug
Since the commit a6e0d0a the performance of DicomServer has a regression, which has impact on high-performance system.

We have a DicomServer running where there approx 8000 Associations per hour in average (Sectra sends each series in a separate association). After updating to a version newer than August 2023, the performance decreases dramatically.

To Reproduce
Start a server on a high-load environment :-)

Environment
Fellow Oak DICOM version: 5.1.2
OS: Windows Server 2019
Platform: .net core 6

@gofal gofal added new bug and removed new labels Apr 30, 2024
@gofal
Copy link
Contributor Author

gofal commented Apr 30, 2024

I figured out two reasons.

First in DicomServer.ListForconnectionsAsync there is one loopt, that accepts new associations, creates a Scp instance, notifies that a service has started and then waits for the next connection.
So the next connection is blocked until everything above is done. This line of code for example

                    // We don't actually care about the values inside the channel, they just serve as a notification that a service has connected
                    await _servicesChannel.Writer.WriteAsync(numberOfServices, _cancellationToken);

asynchronously writes a value into a channel and blocks the loop and therefor prevents from new clients to connect. Additionally, the async method waited until the thread, which started the WriteAsync method, is available again. So the first step was to add ".ConfigureAwait(false)" to this call, which already improved a little bit.

@gofal
Copy link
Contributor Author

gofal commented Apr 30, 2024

But then finally I realized, that this notification is only important for the cleanup loop, but it is not necessary at all, to await it. Therefore I called it with fire-and-forget. Everything seemed to work fine, and the performance became significant better, but still not as good as it was before.

@gofal
Copy link
Contributor Author

gofal commented Apr 30, 2024

The other thing I noticed was, that the client strems where closed some time later. Sectra starts its next clients, after the previous were closed.
In DicomServer.RemoveUnusedServicesAsync there is this endless loop, where the current clients as well as reading from the notification-channel are awaited. If there are some new connections - which are very often in a high load scenario, then the whole loop starts again, inclusive locking the _services collection etc.

                    var tasks = new List<Task>(numberOfDicomServices + 1);
                    var anotherServiceHasStarted = _servicesChannel.Reader.ReadAsync(_cancellationToken).AsTask();
                    tasks.Add(anotherServiceHasStarted);
                    tasks.AddRange(runningDicomServices.Select(s => s.Task));
                    var winner = await Task.WhenAny(tasks).ConfigureAwait(false);
                    if (winner == anotherServiceHasStarted)
                    {
                        try
                        {
                            await anotherServiceHasStarted;
                        }
                        catch(OperationCanceledException)
                        {
                            // If the server is disposed while we were waiting, deal with that gracefully
                            break;
                        }
                        catch (ChannelClosedException)
                        {
                            // If the server is disposed while we were waiting, deal with that gracefully
                            break;
                        }
                        
                        // If another service started, we must restart the Task.WhenAny with the new set of running service tasks
                        Logger.LogDebug("Another DICOM service has started while the cleanup was waiting for one or more DICOM services to complete");
                    }
                    else
                    {
                        Logger.LogDebug("One or more running DICOM services have completed");
                        break;
                    }

So it often takes longer until the servicesToDispose list is iteated and the services are actually disposed.

Additionally, the Channel behaves different than the previous AsyncManualResetEvent.
If in this endless loop the Task to await the AsyncManualResetEvent is started, but the winner is a client which has finished, then in background there is still this one task that awaits the WaitAsync(). If then after the cleanup is done the whole loop starts again from the beginning, then the WaitAsync() immediatelly returns and the new additional service can be cleaned up.

But with the Cannel it is different. Here in this endless loop the client Tasks as well as a Task that reads from the channel is started. if then a client-task winns this race and a cleanup is stratedn, but while that a new client connects, then the started Channelreader takes this entry from the channel. So after the the cleanup there is a new connected client, but the channel is empty because of this Task that was started.
Then when returning to the top of the method, there is an await for the next entry in the channel. And this is awaited even if there is a client connected.
Reading from the channel works only once and then is automatically reset, while the AsyncManualResetEvent can read the state several times.

@gofal
Copy link
Contributor Author

gofal commented Apr 30, 2024

But while it was not so important, to remove the clients from the _services collection, it was far more important, that the network streams were closed as soon as the clients were done. I solved this adding the Dispose-method directly after the Task in RunningDicomService. The Dispose method is also called in RemoveUnusedServiceAsync, but the DicomService is fine if you call Dispose several times.

@gofal
Copy link
Contributor Author

gofal commented Apr 30, 2024

After that change the performance regression was gone. Please review the PR.

@gofal gofal mentioned this issue Apr 30, 2024
5 tasks
@amoerie
Copy link
Collaborator

amoerie commented Apr 30, 2024

Thanks for the elaborate investigation and explanation!

Unfortunately we are still migrating our high traffic ingestion code base away from our v4 fork, so it is hard for me to detect these issues until then. (I can tell you what we do stress the DicomFile reader and writer A LOT already)
The timeline for the ingestion will be somewhere next year.

It would be very nice if we could use the benchmarks to detect these performance regressions, but I'm not sure how that would work yet.

Until then, I'll review your work as soon as I get the chance. Your explanations sound very reasonable at first sight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants