Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long delay from publish to consumer event #1252

Open
lukebakken opened this issue Sep 16, 2022 Discussed in #1249 · 32 comments
Open

Long delay from publish to consumer event #1252

lukebakken opened this issue Sep 16, 2022 Discussed in #1249 · 32 comments
Assignees

Comments

@lukebakken
Copy link
Contributor

Reported by @boskjoett

Discussed in #1249

Originally posted by boskjoett September 9, 2022
We are consistently, but with random frequency, experiencing long delays (more than a second) from the time a publisher delivers a message on the bus with BasicPublish() until a consumer (EventingBasicConsumer) is notified by a Received event.
We have tried both with and without auto-ack and using multiple consumers, but without any effect.
Our message bus is not overloaded, we can reproduce it with very few short messages.
We don't believe it is network delays because we can also reproduce the random delays in a docker-compose stack where both the producer, the consumer and RabbitMQ is running in containers in the same stack and the same docker network.

Code to reproduce:

@lukebakken lukebakken added this to the 6.4.1 milestone Sep 16, 2022
@lukebakken lukebakken self-assigned this Sep 16, 2022
@lukebakken
Copy link
Contributor Author

@bording @bollhals @stebet @danielmarbach - pinging you just to see if anything immediately comes to mind. The issue is that the first call to BasicPublish appears to introduce latency that does not happen with subsequent calls. I suppose this additional latency could be on the consumer side for the first time a message is received. I'll get a packet trace to see if I can determine on "which side" the issue lies.

The time taken to establish the connection is not part of this calculation, of course.

@bollhals
Copy link
Contributor

The first time all the code will have to be jitted by the runtime, this will for sure take some time.

To test this you'd have to send one message, disconnect and build up the connection again in the same process before starting the test sends.

@rabbitmq rabbitmq deleted a comment from bollhals Sep 17, 2022
@michaelklishin
Copy link
Member

A workload that opens a connection per message published should not be considered typical for any RabbitMQ client. Should we spend our time on workloads that use reasonably long-lived connections and services?

@boskjoett
Copy link

boskjoett commented Sep 19, 2022

The first time all the code will have to be jitted by the runtime, this will for sure take some time.

The Native Image Generator (Ngen.exe) tool could be used to verify this assumption
https://learn.microsoft.com/en-us/previous-versions/dotnet/netframework-2.0/6t9t5wcf(v=vs.80)?redirectedfrom=MSDN

@lukebakken
Copy link
Contributor Author

Thanks @bollhals and @boskjoett, I'll start investigating there.

@bording
Copy link
Collaborator

bording commented Sep 19, 2022

@lukebakken Since you mention that the python version of this test doesn't show a delay, have you tried mixing the python and C# versions to see which pairs also demonstrate the problem? Should be able to help narrow down where the problem is that way.

@danielmarbach
Copy link
Collaborator

danielmarbach commented Sep 19, 2022

@lukebakken could it be related to the flow control synchronization?

@lukebakken
Copy link
Contributor Author

have you tried mixing the python and C# versions to see which pairs also demonstrate the problem?

No but that's an excellent suggestion. Thank you.

@boskjoett
Copy link

@lukebakken could it be related to the flow control synchronization?

Indeed it could. It is exactly those lines you highlighted that we suspect are causing the delays.

lukebakken added a commit that referenced this issue Sep 19, 2022
If clients wish to do something on `channel.flow`, the handler should be used. This library should not do anything on its own.

Fixes #1252
@lukebakken
Copy link
Contributor Author

I see the additional latency when combining Python / .NET both ways, but the latency is greater when the publisher is .NET and the consumer is Python.

Removing the wait event around flow control didn't have an effect (commit), but it's something I'm going to remove anyway since no other client tries to handle the channel.flow method in this manner.

I'm going to try using ngen next.

@lukebakken
Copy link
Contributor Author

lukebakken commented Sep 19, 2022

Hm, using crossgen doesn't seem to affect the initial message latency. Here's what I did:

  • Enable PublishReadyToRun in both csproj files (https://github.com/lukebakken/rabbitmq-dotnet-client-1060/blob/boskjoett-main/consumer/consumer.csproj#L8, for instance)
  • In the producer and consumer dirs of the above repo & branch, run the following:
    git clean -xffd
    dotnet publish --configuration Release --use-current-runtime --no-self-contained --verbosity detailed
    
  • See producer-publish.txt for the output which shows crossgen was run.
  • Run consumer then producer (producer run in the same manner):
    $ cd consumer/bin/Release/net6.0/arch-x64/
    $ dotnet consumer.dll
    
  • Same initial latency 🤔
    lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer (boskjoett-main =)
    $ cd bin/Release/net6.0/arch-x64/
    lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer/bin/Release/net6.0/arch-x64 (boskjoett-main =)
    $ dotnet consumer.dll
    CONSUMER: waiting for messages...
    CONSUMER received at 09/19/2022 16:16:43.550872, sent at 09/19/2022 16:16:43.537436 - iteration: 1, delay: 00:00:00.0134368
    CONSUMER received at 09/19/2022 16:16:48.550556, sent at 09/19/2022 16:16:48.548227 - iteration: 2, delay: 00:00:00.0023299
    CONSUMER received at 09/19/2022 16:16:53.551818, sent at 09/19/2022 16:16:53.549104 - iteration: 3, delay: 00:00:00.0027149
    ^CCTRL-C pressed, exiting!
    

@lukebakken
Copy link
Contributor Author

Next I'm going to test what @bollhals suggests in this comment:

To test this you'd have to send one message, disconnect and build up the connection again in the same process before starting the test sends.

@lukebakken
Copy link
Contributor Author

This is the consumer output from this branch where the consumer and producer both do a "first iteration" where a connection/channel are created to send one message, then both closed, then the "real" test begins. I did not use crossgen in this case because I am not running dotnet publish first.

Notice that the first iteration of the real test still shows more latency, but not nearly as much as in the above test or in @boskjoett's initial report.

Seems like the "warm up" of the library has something to do with this. I'm going to port these tests to Java to see if that library and runtime environment shows the same behavior.

lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer (boskjoett-main *=)
$ dotnet run
CONSUMER: waiting for messages...
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration done (message: 09/19/2022 16:44:54.564770), disconnecting and re-connecting...
CONSUMER: waiting for messages...
CONSUMER received at 09/19/2022 16:45:04.595576, sent at 09/19/2022 16:45:04.592378 - iteration: 1, delay: 00:00:00.0031985
CONSUMER received at 09/19/2022 16:45:07.595046, sent at 09/19/2022 16:45:07.593387 - iteration: 2, delay: 00:00:00.0016592
CONSUMER received at 09/19/2022 16:45:10.595347, sent at 09/19/2022 16:45:10.593740 - iteration: 3, delay: 00:00:00.0016079
CONSUMER received at 09/19/2022 16:45:13.595644, sent at 09/19/2022 16:45:13.594106 - iteration: 4, delay: 00:00:00.0015383
CONSUMER received at 09/19/2022 16:45:16.595780, sent at 09/19/2022 16:45:16.594453 - iteration: 5, delay: 00:00:00.0013271
^CCTRL-C pressed, exiting!

@boskjoett
Copy link

boskjoett commented Sep 20, 2022

I have updated my test programs in this repo
https://github.com/boskjoett/rabbitmq-dotnet-client-1060
so that the producer now sends a burst of 1 to 9 messages to the consumer at a time.

It shows two things:

  1. The delivery time of the first burst of messages is 4 to 5 times longer than the subsequent message bursts.
  2. The delivery time of the subsequent bursts varies a lot. More than I would expect. Is this due to TCP flow control?

Output from the test

CONSUMER: waiting 5 seconds to try initial connection
CONSUMER: waiting for messages...
CONSUMER received message 1 at 09-20-2022 09:23:52.473, delay: 30,5195 ms
CONSUMER received message 2 at 09-20-2022 09:23:52.510, delay: 49,3794 ms
CONSUMER received message 3 at 09-20-2022 09:23:52.510, delay: 49,8232 ms
CONSUMER received message 4 at 09-20-2022 09:23:57.469, delay: 5,5229 ms
CONSUMER received message 5 at 09-20-2022 09:23:57.470, delay: 5,2937 ms
CONSUMER received message 6 at 09-20-2022 09:24:02.497, delay: 12,1088 ms
CONSUMER received message 7 at 09-20-2022 09:24:02.502, delay: 9,989 ms
CONSUMER received message 8 at 09-20-2022 09:24:02.505, delay: 8,349 ms
CONSUMER received message 9 at 09-20-2022 09:24:02.512, delay: 15,5252 ms
CONSUMER received message 10 at 09-20-2022 09:24:02.521, delay: 23,2979 ms
CONSUMER received message 11 at 09-20-2022 09:24:02.521, delay: 22,9266 ms
CONSUMER received message 12 at 09-20-2022 09:24:02.522, delay: 22,1546 ms
CONSUMER received message 13 at 09-20-2022 09:24:02.522, delay: 20,4214 ms
CONSUMER received message 14 at 09-20-2022 09:24:07.517, delay: 3,8164 ms
CONSUMER received message 15 at 09-20-2022 09:24:07.518, delay: 2,6562 ms
CONSUMER received message 16 at 09-20-2022 09:24:07.518, delay: 2,971 ms
CONSUMER received message 17 at 09-20-2022 09:24:07.519, delay: 3,1154 ms
CONSUMER received message 18 at 09-20-2022 09:24:12.535, delay: 13,0199 ms
CONSUMER received message 19 at 09-20-2022 09:24:12.541, delay: 12,4516 ms
CONSUMER received message 20 at 09-20-2022 09:24:12.543, delay: 8,7846 ms
CONSUMER received message 21 at 09-20-2022 09:24:12.545, delay: 4,7128 ms
CONSUMER received message 22 at 09-20-2022 09:24:12.549, delay: 3,7252 ms
CONSUMER received message 23 at 09-20-2022 09:24:12.560, delay: 6,016 ms
CONSUMER received message 24 at 09-20-2022 09:24:12.564, delay: 3,1196 ms
CONSUMER received message 25 at 09-20-2022 09:24:12.567, delay: 3,7524 ms

@michaelklishin
Copy link
Member

How do we define "a lot"? Publisher rate will vary, in particular with classic queues v1, because of both flow control and because they move messages to disk in batches due to certain internal metric.

Use a quorum queue or a CQv2 to reduce this variability to primarily internal flow control.

@boskjoett
Copy link

You are right. It is probably not fair to say "a lot" when it is in the milliseconds range. Our queues are not durable and messages are transient, so I don't expect writes to disk.

@lukebakken
Copy link
Contributor Author

Thanks for the additional testing code @boskjoett. I will update my Python code to do the same to see if the same behavior exists.

@lukebakken
Copy link
Contributor Author

As I suspected, the Python code is not affected by the same sort of delays - lukebakken/rabbitmq-dotnet-client-1060@e847411

At this time I'll try @boskjoett's latest code with the main branch of this repo. There's a chance this issue has been addressed. If not, we can discuss next steps.

@bollhals
Copy link
Contributor

As I suspected, the Python code is not affected by the same sort of delays - lukebakken/rabbitmq-dotnet-client-1060@e847411

What was the output for it?

@boskjoett
Copy link

We have discovered that the long random delays (more than a second) we have experienced in our code using RabbitMQ were caused by lack of threads in the .NET Threadpool. Creating new threads on demand can take several hundred milliseconds. Inspired by this line of code in your test application
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/TestApplications/MassPublish/Program.cs#:~:text=ThreadPool.SetMinThreads(16%20*%20Environment.ProcessorCount%2C%2016%20*%20Environment.ProcessorCount)%3B
we increased the Threadpool size and now the long random delays are gone.
There is still the initial delay, when sending the first message, but we can live with that.

@bollhals
Copy link
Contributor

We have discovered that the long random delays (more than a second) we have experienced in our code using RabbitMQ were caused by lack of threads in the .NET Threadpool. Creating new threads on demand can take several hundred milliseconds. Inspired by this line of code in your test application

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/TestApplications/MassPublish/Program.cs#:~:text=ThreadPool.SetMinThreads(16%20*%20Environment.ProcessorCount%2C%2016%20*%20Environment.ProcessorCount)%3B

we increased the Threadpool size and now the long random delays are gone.

Threads are not blocked for the basicpublish method, nor anywhere in the receiving. So if this fixes the issue for you, you either do more than just basicPublish (e.g. QueueDeclare) or they're blocked by something else in your application.

Do you know what blocks them?

@danielmarbach
Copy link
Collaborator

Generally changing the thread pool limits is something that I have rarely ever seen being useful in production. On the flip side I've seen the problems actually getting worse in certain scenarios because people who override it no longer benefit from the default "smarts" of the ramp up of the thread pool and the optimizations done over time in the thread pool. Of course the thread pool is general purpose and mileage may vary but that being said the defaults are battle tested against hundreds of scenarios. There is a reason why there is a caution warning on the page in the remarks section

https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-6.0#remarks

Have you tried switching to the async consumer? Have you tried stopping copying the body and see of things are related to GC pauses? Also how is that synthetic benchmark you have in the sample representative of the problem you are trying to solve? What are you trying to achieve that a few MS of delay would actually matter? In your real prod system what are you planning to do there inside the receiving code?

@boskjoett
Copy link

The threadpool issue is not in the RabbitMQ .NET Client, but in our application code using it, where it sends and receives messages. We have code that waits on ManualResetEvents, etc. You are right that the correct solution is not to fiddle with the Threadpool size, but to use async/await properly.

@lukebakken
Copy link
Contributor Author

@bollhals this is the edited output from the first two batches of messages sent via the Python producer/consumer programs:

CONSUMER received at 2022-09-20 14:12:27.305867, sent at 2022-09-20 14:12:27.305110 - iteration 1, delay: 0:00:00.000757
CONSUMER received at 2022-09-20 14:12:27.306376, sent at 2022-09-20 14:12:27.305300 - iteration 2, delay: 0:00:00.001076
CONSUMER received at 2022-09-20 14:12:27.306736, sent at 2022-09-20 14:12:27.305443 - iteration 3, delay: 0:00:00.001293
CONSUMER received at 2022-09-20 14:12:27.306982, sent at 2022-09-20 14:12:27.305633 - iteration 4, delay: 0:00:00.001349
CONSUMER received at 2022-09-20 14:12:27.307101, sent at 2022-09-20 14:12:27.305835 - iteration 5, delay: 0:00:00.001266
CONSUMER received at 2022-09-20 14:12:27.307210, sent at 2022-09-20 14:12:27.306037 - iteration 6, delay: 0:00:00.001173
CONSUMER received at 2022-09-20 14:12:27.307314, sent at 2022-09-20 14:12:27.306245 - iteration 7, delay: 0:00:00.001069
CONSUMER received at 2022-09-20 14:12:27.307416, sent at 2022-09-20 14:12:27.306449 - iteration 8, delay: 0:00:00.000967
CONSUMER received at 2022-09-20 14:12:27.307517, sent at 2022-09-20 14:12:27.306659 - iteration 9, delay: 0:00:00.000858
CONSUMER received at 2022-09-20 14:12:32.314942, sent at 2022-09-20 14:12:32.312136 - iteration 10, delay: 0:00:00.002806
CONSUMER received at 2022-09-20 14:12:32.316449, sent at 2022-09-20 14:12:32.313078 - iteration 11, delay: 0:00:00.003371
CONSUMER received at 2022-09-20 14:12:32.317333, sent at 2022-09-20 14:12:32.313794 - iteration 12, delay: 0:00:00.003539
CONSUMER received at 2022-09-20 14:12:32.318486, sent at 2022-09-20 14:12:32.314793 - iteration 13, delay: 0:00:00.003693
CONSUMER received at 2022-09-20 14:12:32.318616, sent at 2022-09-20 14:12:32.315836 - iteration 14, delay: 0:00:00.002780
CONSUMER received at 2022-09-20 14:12:32.318728, sent at 2022-09-20 14:12:32.316799 - iteration 15, delay: 0:00:00.001929
CONSUMER received at 2022-09-20 14:12:32.318833, sent at 2022-09-20 14:12:32.317764 - iteration 16, delay: 0:00:00.001069

@boskjoett
Copy link

You may close this issue. It is not a real issue, as long as it only occurs during startup.

@lukebakken
Copy link
Contributor Author

This is something to investigate for 7.0

@lukebakken lukebakken modified the milestones: 6.4.1, 7.0.0 Nov 17, 2022
@lukebakken
Copy link
Contributor Author

I'm re-opening this just to be sure that version 7.0 fixes this issue:

https://groups.google.com/g/rabbitmq-users/c/V_foZnLFVjs

@lukebakken lukebakken reopened this Jun 2, 2023
@lukebakken
Copy link
Contributor Author

lukebakken commented Nov 20, 2023

Hi everyone,

I have just tagged version 7.0.0-alpha.1, which includes async methods for the full public API -

https://github.com/rabbitmq/rabbitmq-dotnet-client/releases/tag/v7.0.0-alpha.1

Once this is pushed to NuGet by GitHub Actions, it would be really really great if you could confirm that the async API resolves this issue in your environment (provided that it's relatively easy to update your test code to the new API).

I will take some time today to re-run my test project with 7.0.0-alpha.1 - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/tree/main

Thanks!

@lukebakken
Copy link
Contributor Author

It looks like the initial delay is still there using 7.0.0-alpha.1. The delay is reduced as @bollhals theorized here and as can be demonstrated in this branch.

It looks like this initial latency is due to runtime startup, so I'm going to close this issue.

@lukebakken lukebakken closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2023
@danielmarbach
Copy link
Collaborator

Have you tried playing around with the JIT settings to further investigate if it is indeed the JIT overhead?

https://learn.microsoft.com/en-us/dotnet/core/runtime-config/compilation

You can also control tiered JIT as part of the csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TieredCompilation>false</TieredCompilation>
  </PropertyGroup>

</Project>

@lukebakken lukebakken reopened this Nov 21, 2023
@lukebakken
Copy link
Contributor Author

I have not! I'll give that a try.

@lukebakken lukebakken removed this from the 7.0.0 milestone May 9, 2024
@lukebakken
Copy link
Contributor Author

Removing from the version 7 milestone since it is not version 7 specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants