Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMQPTimeoutException occurs during socket interruptions #1094

Open
mrtus opened this issue Jun 5, 2023 · 14 comments
Open

AMQPTimeoutException occurs during socket interruptions #1094

mrtus opened this issue Jun 5, 2023 · 14 comments
Assignees
Milestone

Comments

@mrtus
Copy link

mrtus commented Jun 5, 2023

Version(s) affected: v3.5.2

Description
When a SIGNAL occurs, the exception AMQPTimeoutException is thrown at AMQPIOReader::wait::102
Since the channel technically did not timeout, but was interrupted... It is hard to differentiate between stream interruptions and those timeouts, which could be a symptom of something completely different.

We rely on scaling our workers based on CPU usage & available messages on the queue. Whenever consumers are being scaled down a SIGNAL is sent, we try to handle those but of course when a connection is open and interupted... the situation becomes a little bit harder.

How to reproduce
Have a channel waiting for a message
Send a SIGNAL to interrupt the stream
Exception is thrown

Possible Solution
Instead of throwing a "timeout" exception, I would like to propose to instead throw a AMQPConnectionInterruptedException or something a-like.

We can most likely throwing such an exception based on calculating if the timeout is really a timeout (the time a wait took is longer than the given timeout param). Or am I missing some insights here?

Let me hear your thoughts on this issue.
Or if there would be a better solution to handle a SIGNAL?

Thank you

@ramunasd
Copy link
Member

I guess there might be another exception for this case, something like AMQPWaitInterruptedException. But we can introduce it only in next major version, because it can be considered as breaking change.

For now You can use signal handler to save some kind of flag about interruption. When you have it, then AMQPTimeoutException can be handled differently.

@mrtus
Copy link
Author

mrtus commented Jun 20, 2023

Hi @ramunasd thank you for your reply

I guess there might be another exception for this case, something like AMQPWaitInterruptedException. But we can introduce it only in next major version, because it can be considered as breaking change.

That would be awesome, yes!
An alternative solution could be to extend AMQPWaitInterruptedException with the existing AMQPTimeoutException, but maybe that is cheating versioning and entirely not correct 😛.

For now You can use signal handler to save some kind of flag about interruption. When you have it, then AMQPTimeoutException can be handled differently.

That is exactly how I resolved the use-case for now, and it is working nicely too.

@matheusjohannaraujo
Copy link

Is this the error you are having???

I'm having this error when I go to read the queue, basically every time the anonymous sub callback takes more than 10 seconds to execute, this disconnect happens on localhost.

[2023-06-24 21:21:17] Error: Connection reset by peer in /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Wire/IO/StreamIO.php line 266
Trace: #0 /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Connection/AbstractConnection.php(410): PhpAmqpLib\Wire\IO\StreamIO->write('\x01\x00\x01\x00\x00\x00\r\x00<\x00P\x00\x00\x00\x00...')
#1 /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Connection/AbstractConnection.php(540): PhpAmqpLib\Connection\AbstractConnection->write('\x01\x00\x01\x00\x00\x00\r\x00<\x00P\x00\x00\x00\x00...')
#2 /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Channel/AbstractChannel.php(245): PhpAmqpLib\Connection\AbstractConnection->send_channel_method_frame(1, Array, Object(PhpAmqpLib\Wire\AMQPWriter))
#3 /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Channel/AMQPChannel.php(789): PhpAmqpLib\Channel\AbstractChannel->send_method_frame(Array, Object(PhpAmqpLib\Wire\AMQPWriter))
#4 /opt/lampp/htdocs/lib/SimpleRabbitMQ.php(160): PhpAmqpLib\Channel\AMQPChannel->basic_ack(4)
#5 [internal function]: Lib\SimpleRabbitMQ->Lib\{closure}(Object(PhpAmqpLib\Message\AMQPMessage))
#6 /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Channel/AMQPChannel.php(1048): call_user_func(Object(Closure), Object(PhpAmqpLib\Message\AMQPMessage))
#7 [internal function]: PhpAmqpLib\Channel\AMQPChannel->basic_deliver(Object(PhpAmqpLib\Wire\AMQPBufferReader), Object(PhpAmqpLib\Message\AMQPMessage))
#8 /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Channel/AbstractChannel.php(217): call_user_func(Array, Object(PhpAmqpLib\Wire\AMQPBufferReader), Object(PhpAmqpLib\Message\AMQPMessage))
#9 /opt/lampp/htdocs/vendor/php-amqplib/php-amqplib/PhpAmqpLib/Channel/AbstractChannel.php(363): PhpAmqpLib\Channel\AbstractChannel->dispatch('60,60', '\x1Famq.ctag-gFEq7...', Object(PhpAmqpLib\Message\AMQPMessage))
#10 /opt/lampp/htdocs/lib/SimpleRabbitMQ.php(169): PhpAmqpLib\Channel\AbstractChannel->wait()

@ascheja
Copy link

ascheja commented Jul 6, 2023

Seems like we ran into this too with our consumers. In my opinion instead of throwing an exception in case of an interrupted select I think this library should just call pcntl_signal_dispatch() (if available) and then resume the select for the remaining timeout.

The workaround I found for now is essentially blocking all signals via pcntl_sigprocmask() and then at certain "safe-points" unblock the signals, call pcntl_signal_dispatch(), block the signals again and then continue with consuming messages.

@matheusjohannaraujo
Copy link

As this problem was bothering me and I needed a quick solution as my queues are quite large, I changed the approach to using enqueue/amqp-lib. Now I don't have that problem anymore.

@vanessasoutoc
Copy link

Hi, i have too problem. Help..

local.ERROR: The connection timed out after 3 sec while awaiting incoming data vendor/php-amqplib/php-amqplib/PhpAmqpLib/Wire/AMQPIOReader.php:102

@matheusjohannaraujo
Copy link

Vê o enqueue/amqp-lib Vanessa. Talvez ajude você assim como me ajudou.

@mrtus
Copy link
Author

mrtus commented Sep 11, 2023

My recommendation is to loop Channel::wait until your condition to loop has expired rather than rely on the fact that nothing can happen while waiting for a response.

while ($conditionToQuit !== false && $channel->is_open()) {
  try {
    $channel->wait(null, false, self::WAIT_TIMEOUT);
  } catch (AMQPTimeoutException) {
  }
}

$channel->close();

Or when you need to be able to handle shutdown SIGNALs gracefully

pcntl_signal(SIGQUIT, fn() => $interrupted = true);

while ($conditionToQuit !== false && $channel->is_open() && $interrupted !== true) {
  try {
    $channel->wait(null, false, self::WAIT_TIMEOUT);
  } catch (AMQPTimeoutException) {
  }
}

$channel->close();

This solution will work for both of you @ascheja @vanessasoutoc, depending on your needs of course.
I hope that helps?!

@matheusjohannaraujo the issue you are describing (based on your exception stack trace) is something completely different and has nothing to do with mentioned timeouts nor signals.

But to be clear, the issue I am describing is about differentiating between the real and correct AMQPTimeoutException and the same exception that is triggered by an OS SIGNAL which is not correctly represented.

@vanessasoutoc
Copy link

Great @mrtus, but solution not work for me.
I followed the tip @matheusjohannaraujo and not use this lib.
Thanks.

@ramunasd
Copy link
Member

@mrtus Can You test latest master?

@mrtus
Copy link
Author

mrtus commented Oct 11, 2023

@mrtus Can You test latest master?

Yes, I will take some time to test this out but from the looks of it the change will get rid of the incorrect timeout exception 👏.

I do have a couple of early thoughts based on the changes that were made:

  • Let's take an example where e.g. a timeout would be set to 3600s, In a K8S stack, two signals will occur, a SIGQUIT and after a grace period (let's say 20s) a SIGTERM and then a force stop of the process occurs without mercy.
    This now means that the ->wait timeout must be at most equal to the grace period to still be able to have a clean shutdown of the script.
    There will be applications that run on different infrastructure stack which are relying on the signal to have the ->wait interrupted (with the exception as a result of course). With that reasoning, applications with configuration from the above example will run for another 3600s at most.

  • It takes the control of signals away from the application and forces a lower timeout to be configured to compensate for the change.

  • Signals are meant to interrupt, so this will have a big impact on numerous applications.

With those insights, and the, in my opinion, changed behaviour of the ->wait method, this is probably a breaking change for many.
On another note, one could very much argue this could simply be labeled as a bug too, but alas as we all know, one persons bug is another persons feature.

I'm curious about your thoughts on these insights?

@ramunasd
Copy link
Member

Thanks for questions. They are indeed interesting. The thing is that this library is low level tool which can be used in many different ways and not all of them are correct ones. This library is not super sophiscated solution which works out of the box in exactly expected way. Wait loop is most important part for consumers and there is numerous ways how to implement it. I suspect You have assumption that everyone will use simple endless loop without signal callbacks. That's for sure not right.

First point is specific edge case when there is no network activity, no heartbeats and super long timeouts. Yes, this way script can be killed while waiting for network packets. Everyone who setups library in this way should understand what they are doing. Long timeouts and long polling intervals are bad practice.

Second point, we do not take control of signals. There is no changes regarding signal handlers. You can catch signal and stop Your loop in exactly same way. Moreover, if You need high responsiveness, then You can use non blocking wait() and end loop right away. BTW, we already have methods which behaves as Your recommendation - AMQPChannel::consume() and AMQPChannel::stopConsume()

Regarding third point - this change will affect only some applications and in specific edge cases. I could think only about situations when there is no network activity and somebody wants to interrupt library ASAP. That would not work anymore, but everything else statys the same.

Lastly, I believe this change will improve stability regardless of changed behaviour. The better solution would required more refactorings and more braeking changes. We can plan it for future, but for now let's try this way.

@mrtus
Copy link
Author

mrtus commented Oct 12, 2023

The thing is that this library is low level tool which can be used in many different ways and not all of them are correct ones. This library is not super sophiscated solution which works out of the box in exactly expected way. Wait loop is most important part for consumers and there is numerous ways how to implement it. I suspect You have assumption that everyone will use simple endless loop without signal callbacks. That's for sure not right.

And I agree with you, however as I mentioned, "one persons bug is another persons feature." this is sadly the truth for many.
I do not assume everyone uses endless loops, however we do have to assume there will be a lot of people.

Second point, we do not take control of signals. There is no changes regarding signal handlers. You can catch signal and stop Your loop in exactly same way. Moreover, if You need high responsiveness, then You can use non blocking wait() and end loop right away. BTW, we already have methods which behaves as Your recommendation - AMQPChannel::consume() and AMQPChannel::stopConsume()

No we do not take control of signals in a literal way, it does however happen implicitly by ignoring the signal interruption.

Nontheless I am perfectly okay with the applied changes, I simply wanted to point out the impact that this change could have and I do believe the same that this change will indeed improve stability.

@ramunasd
Copy link
Member

Fix is released in version 3.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants