Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RD_KAFKA_RESP_ERR__TIMED_OUT to be able to differentiate cases #343

Open
moukoublen opened this issue Feb 5, 2020 · 11 comments
Open

Comments

@moukoublen
Copy link

rkmessage_tmp.err = RD_KAFKA_RESP_ERR__TIMED_OUT;

Hello,
This is a question mostly rather than an issue 馃槉

I'd like to ask why the default behavior of librdkafka (return null if no message appears during the specified timeout) has been replaced by returning the RD_KAFKA_RESP_ERR__TIMED_OUT error code.

To my understanding having a null message (no message) after the specified timeout is not an error case. It is something expected isn't it?

Especially in cases where a small timeout is being used the application could have many RD_KAFKA_RESP_ERR__TIMED_OUT error code every minute.

Could this "hide" a timeout error that could arrive from rd_kafka_consumer_poll pointing out actual timeout during broker communication?

Thanks 鈽猴笍

@nick-zh
Copy link
Collaborator

nick-zh commented Feb 5, 2020

hey @moukoublen

Sry i adjusted my answer i misunderstood. I am actually not sure, maybe it had something to do with differentiating eof and timeout. I would need to dig into this, but not sure if that is worth the while :D
And yeah timeout is definitely not an error, naming is maybe not the best, same goes for RD_KAFKA_RESP_ERR__NO_ERROR

Cheers

@moukoublen
Copy link
Author

moukoublen commented Feb 5, 2020

Hi

I checked librdkafka examples
https://github.com/edenhill/librdkafka/blob/36283d1f8870be0824d97940abcf04d6d202a0ec/examples/consumer.c#L210

                rkm = rd_kafka_consumer_poll(rk, 100);
                if (!rkm)
                        continue; /* Timeout: no message within 100ms,
                                   *  try again. This short timeout allows
                                   *  che

and it seems to me that when the timeout passes without message rd_kafka_consumer_poll just returns a null message (I also tested locally).

That's why I am confused. 馃槙

EDIT:
So with the current implementation of php rdkafka during a "quiet" period with no messages the application could have many RD_KAFKA_RESP_ERR__TIMED_OUT.

And how could we separate the timeout from php rdkafka which means no message arrived with a possible timeout coming actually from rd_kafka_consumer_poll pointing out an actual timeout with broker communication.

@nick-zh
Copy link
Collaborator

nick-zh commented Feb 5, 2020

@moukoublen the differentiation right now is not possible, both will end in a timeout. So if the brokers recovers in a certain amount of time, you might not get an error at all, just timeouts.

@moukoublen
Copy link
Author

@nick-zh thanks for the answers so far 馃槃

So right now in php-rdkafka there is a differentiation from the default behavior of librdkafka

In (librdkafka) rd_kafka_consumer_poll when the timeout passes without any message arriving a null message pointer is being returned.

But in php-rdkafka in case of null message pointer the RD_KAFKA_RESP_ERR__TIMED_OUT is being returned specifically.

    /* ... php-rdkafka - kafka_consumer.c#L398 ... */
    rkmessage = rd_kafka_consumer_poll(intern->rk, timeout_ms);

    if (!rkmessage) {
        rkmessage_tmp.err = RD_KAFKA_RESP_ERR__TIMED_OUT;
        rkmessage = &rkmessage_tmp;
    }

By doing this -in php application level- we cannot separate an actual RD_KAFKA_RESP_ERR__TIMED_OUT coming from rd_kafka_consumer_poll (which could possible occur because of a timeout in the communication with the broker) from the normal scenario where no message has arrived so far.

Wouldn't it be better if php-rdkafka returns another error code in this case? Or perhaps no error code at all - I don't know.

What do you think?

@nick-zh
Copy link
Collaborator

nick-zh commented Feb 5, 2020

@moukoublen thanks for taking the time as well, i think this is a good discussion.
I totally agree, that it would be preferable, to being able to properly differentiate both cases:

  • broker timeout
  • poll timeout

I am not really sure how we can achieve this, from what i see, when there is nothing to consume, librdkafka internally also sets timed out as last error:

https://github.com/edenhill/librdkafka/blob/133bfb8336fba80a881fb5d2f7cc8e7268c9fec4/src/rdkafka.c#L2847

and the comment of the const is also rather general:

	/** Operation timed out */
	RD_KAFKA_RESP_ERR__TIMED_OUT = -185,

I am not that deep into the lib itself and i am unsure if both cases can be differentiatet properly, do you know that?

Overall, when i see it from an application point of view, if i have a short timeout because of network, i probably
would still treat it the same as if i had received no message and just continue polling, what do you think?

@Steveb-p
Copy link
Contributor

Steveb-p commented Feb 5, 2020

@nick-zh yes, librdkafka does return the time out (in source code), but sample that @moukoublen linked (sample) is an implementation example that we - maybe - should follow.

Since this causes difficulty in determining whether we have a broken connection or just no messages it would be in our interest to find a way to decide for ourselves 馃槃

@nick-zh
Copy link
Collaborator

nick-zh commented Feb 5, 2020

I haven't found the code in librdkafka, where a broker connection issue results in RD_KAFKA_RESP_ERR__TIMED_OUT and will be returned as such.
If librdkafka does not differentiate those cases, i question if we should / need to.
If we want to do it, we probably need a new const though, since i was unable to find one in librdkafka.
But as i said, since you most likely want to continue polling, not sure if it is worth the while.
If the broker is truly down, after some time, this will result in an error, which we can handle.

@moukoublen
Copy link
Author

moukoublen commented Feb 5, 2020

@nick-zh @Steveb-p Correct me if I'm wrong here but to my understanding in librdkafka, when the poll operation finishes without message it sets the timeout error with rd_kafka_set_last_error but it returns NULL.
https://github.com/edenhill/librdkafka/blob/133bfb8336fba80a881fb5d2f7cc8e7268c9fec4/src/rdkafka.c#L2847-L2852

And because of that rd_kafka_consumer_poll returns NULL also.
https://github.com/edenhill/librdkafka/blob/133bfb8336fba80a881fb5d2f7cc8e7268c9fec4/src/rdkafka.c#L2934

So in normal scenario where "no message arrived during the timeout period" there is no message at all an no error code (apart from the global variable rd_kafka_last_error_code).

I understand what @nick-zh says. But my only concern is that:

If there are not other scenarios where RD_KAFKA_RESP_ERR__TIMED_OUT is being returned inside the message from librdkafka then we can safely ignore this error code inside the message (that is introduced in php-rdkafka layer) in php application level (since it is not an error per se).

But if there are other scenarios where RD_KAFKA_RESP_ERR__TIMED_OUT is being returned inside the message from librdkafka then it would be difficult to decide what to do -in php application layer- when a message with this error code is being received.

Thanks for your answers so far 馃槂

@nick-zh
Copy link
Collaborator

nick-zh commented Feb 5, 2020

@moukoublen yeah i totally agree, the next thing with null is (i might be mistaken), the current behaviour returns NULL for eof as well (unless i set enable.partition.eof). I am sadly only partially familiar with librdkafka internals, so i am not sure if timeout is the only case where RD_KAFKA_RESP_ERR__TIMED_OUT is set.
I also understand, that it would be nice ofc, to differentiate:

  • poll timeout
  • broker timeout
  • eof

I personally can live with the current state (since in the cases i have, i don't need to differentiate the timeouts, i will just keep polling in both cases), but if somebody wants to provide a PR to improve this, please do not hesitate!
I really appreciate the input and the discussion.

Steveb-p added a commit to Steveb-p/php-rdkafka that referenced this issue Feb 16, 2020
This fixes arnaud-lb#343 and fixes arnaud-lb#342. Allows for long-running scripts to continue
consuming messages from broker, while not making user think that there is a
connectivity error. In such case librdkafka will populate `err` property anyway.
Steveb-p added a commit to Steveb-p/php-rdkafka that referenced this issue Feb 17, 2020
This fixes arnaud-lb#343 and fixes arnaud-lb#342. Allows for long-running scripts to continue
consuming messages from broker, while not making user think that there is a
connectivity error. In such case librdkafka will populate `err` property anyway.
Steveb-p added a commit to Steveb-p/php-rdkafka that referenced this issue Feb 17, 2020
This fixes arnaud-lb#343 and fixes arnaud-lb#342. Allows for long-running scripts to continue
consuming messages from broker, while not making user think that there is a
connectivity error. In such case librdkafka will populate `err` property anyway.
@Steveb-p
Copy link
Contributor

I think we can use extern rd_kafka_last_error_code to tell us what was the result of the polling operation?

https://github.com/edenhill/librdkafka/blob/45a1e38f53953ad2a93566ea020289c8d01d407d/src/rdkafka_int.h#L826-L829

extern rd_kafka_resp_err_t RD_TLS rd_kafka_last_error_code;

static RD_UNUSED RD_INLINE
rd_kafka_resp_err_t rd_kafka_set_last_error (rd_kafka_resp_err_t err,
					     int errnox) {
        if (errnox) {
                /* MSVC:
                 * This is the correct way to set errno on Windows,
                 * but it is still pointless due to different errnos in
                 * in different runtimes:
                 * https://social.msdn.microsoft.com/Forums/vstudio/en-US/b4500c0d-1b69-40c7-9ef5-08da1025b5bf/setting-errno-from-within-a-dll?forum=vclanguage/
                 * errno is thus highly deprecated, and buggy, on Windows
                 * when using librdkafka as a dynamically loaded DLL. */
                rd_set_errno(errnox);
        }
	rd_kafka_last_error_code = err;
	return err;
}

Since poll operation set error code to 0,0 if successful, then this way we could reliably tell between actual poll errors and not set this error flag if there is simply no message?

Steveb-p added a commit to Steveb-p/php-rdkafka that referenced this issue Feb 20, 2020
This fixes arnaud-lb#343 and fixes arnaud-lb#342. Allows for long-running scripts to continue
consuming messages from broker, while not making user think that there is a
connectivity error. In such case librdkafka will populate `err` property anyway.
@nick-zh
Copy link
Collaborator

nick-zh commented Jun 2, 2020

Sorry for the late reply, so rd_kafka_last_error is actually only used for the low level api (low level consumer / producer). I will try to find if we can improve this, but imho is low prio right now.

@nick-zh nick-zh changed the title RD_KAFKA_RESP_ERR__TIMED_OUT on no message after poll Improve RD_KAFKA_RESP_ERR__TIMED_OUT to be able to differentiate cases Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants