ErrNoProgress due to parsing/concurrency issue? #1188

zachxu42 · 2023-09-02T02:16:48Z

Describe the bug

We're using the library across hundreds of instances reading from a Kafka cluster. And the rate of this ErrNoProgress errors is alarmingly high, at around 10/s across all the components. I believe they're all from this bit of logic and I noticed a change was made to handle this particular error more gracefully. But nonetheless it'd still close the connection and open a new one. Closing and opening tens of connections per second can add significant load to the Kafka cluster and impact performance.

More concretely, I wonder if the c.concurrency() == 1 check is catching some cases introduced by concurrency/parsing issues/bugs in the library as opposed to actual data corruption on the wire. There are many scenarios that could cause this symptom, for example

A certain goroutine releases the read lock prematurely before properly advancing the buffer for the response meant for it. The next goroutine would interpret the first 8 bytes as the size and correlation id, getting gibberish values.
A certain goroutine incorrectly parses the bytes in the buffer and either leave some bytes unprocessed in the buffer, or advance into the bytes meant for the next response.
A certain goroutine writes the request, but calls leave before processing the response. (Looks like this could be happening already.)
Any message initiated by the cluster that's not a response message and does not have a correlation id. (Still reading through the Kafka protocol; not sure if this is possible)

Something else worth mentioning is that if I add a 50ms sleep between iterations in the readloop, then this error would completely go away, something else that suggests this might be due to some sort of contention.

In other words, there's probably data corruption. But I don't believe it's introduced by the transport layer (TCP), and we should get to the bottom of that, eliminate the root cause, instead of simply dropping the connection and starting over.

Kafka Version

What version(s) of Kafka are you testing against?
2.5.1

What version of kafka-go are you using?
v0.4.42

To Reproduce
Run many consumers and observe the increase of reader errors or, in the older version of the library, the log the kafka reader got an unknown error reading...multiple Read calls return no data or error

Expected Behavior
Very few to zero ErrNoProgress due to corrupt data even when there are many consumers.

Observed Behavior
Tens of reader errors due to ErrNoProgress which lead to frequent reader reconnections.

The text was updated successfully, but these errors were encountered:

zachxu42 · 2023-09-04T04:28:09Z

Spent a day poking around, and found the most promising theory to be the following

A goroutine couldn't finish reading all the bytes meant for it; Leaving unfinished bytes in the buffer would cause data corruption for the next reader goroutine, causing the ErrNoProgress error here.
The first goroutine didn't finish consuming all the bytes in the read buffer because it hit r.empty=true here
r.empty was set to true because highWaterMark == offset here
When highWaterMark == offset, I tried logging size and remaining, and saw that there's clearly bytes in the buffer (e.g. highWaterMark == offset with size 366 and remaining 279). So this goroutine that's currently holding the read lock should not just give up.
I see a previous change was made and reverted around how we handle the case where highwatermark=offset. I tried that change, and all the ErrNoProgress errors went away. Over a couple of hours I saw one burst of UnexpectedEOF errors, which I think is what caused the revert of that change.
Upon further investigation, these UnexpectedEOF errors happen when highwatermark=offset and remain=0. Setting r.empty=true only when both conditions are true feels like the right logic.

zachxu42 · 2023-09-04T04:36:26Z

cc`ing @rhansen2. Sorry I know it's been more than 2 years.. But wanted to keep you in the loop since you're the author of #788. Do you by any chance remember any details regarding your statement in that PR? 🙏

Seems like In newer versions of Kafka the high water mark being equal to the offset does not mean there are no messages in the batch.

aratz-lasa · 2023-09-21T22:19:07Z

Same here. After 24h of debugging I ended up with your same conclusion @zachxu42, highWaterMark == offset is not correct. HighWatermark is the last offset that is available, not the last+1.

Moreover, there are other cases where a similar thing happens. If an empty batch is received, the library does not clear the buffer correctly, and the next message that reads, gets a no progress error.

zachxu42 · 2023-09-22T22:19:51Z

Thanks @aratz-lasa for the comment. Can you please elaborate this part?

HighWatermark is the last offset that is available, not the last+1.

Also I never fully understand the reason behind this highWaterMark == offset check. My understanding is highWaterMark is the offset for the last fully replicated message. And offset is the offset for the next message to read. When they are equal, doesn't that mean there's exactly 1 message to read? Then that should be allowed.

The twist is.. as I mentioned above, sometimes you get highWaterMark == offset AND remain=0. That means there's only 1 message to read, and that message is empty? In my PR linked above I only skip reading in that case. Actually, why checking highWaterMark at all.. If we just return an empty batch when remain=0 wouldn't that be also/more correct?

pracucci · 2023-10-12T10:27:07Z

HighWatermark is the last offset that is available, not the last+1.
Also I never fully understand the reason behind this highWaterMark == offset check. My understanding is highWaterMark is the offset for the last fully replicated message. And offset is the offset for the next message to read. When they are equal, doesn't that mean there's exactly 1 message to read? Then that should be allowed.

I think you both are just saying the same thing.

zachxu42 added the bug label Sep 2, 2023

zachxu42 changed the title ~~ErrNoProgress potentially misfiring?~~ ErrNoProgress due to parsing error? Sep 2, 2023

zachxu42 changed the title ~~ErrNoProgress due to parsing error?~~ ErrNoProgress due to parsing/buffer reading issue? Sep 2, 2023

zachxu42 changed the title ~~ErrNoProgress due to parsing/buffer reading issue?~~ ErrNoProgress due to parsing/concurrency issue? Sep 2, 2023

zachxu42 linked a pull request Sep 8, 2023 that will close this issue

check remaining bytes when marking batch as empty #1192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ErrNoProgress due to parsing/concurrency issue? #1188

ErrNoProgress due to parsing/concurrency issue? #1188

zachxu42 commented Sep 2, 2023 •

edited

zachxu42 commented Sep 4, 2023 •

edited

zachxu42 commented Sep 4, 2023 •

edited

aratz-lasa commented Sep 21, 2023

zachxu42 commented Sep 22, 2023 •

edited

pracucci commented Oct 12, 2023

ErrNoProgress due to parsing/concurrency issue? #1188

ErrNoProgress due to parsing/concurrency issue? #1188

Comments

zachxu42 commented Sep 2, 2023 • edited

zachxu42 commented Sep 4, 2023 • edited

zachxu42 commented Sep 4, 2023 • edited

aratz-lasa commented Sep 21, 2023

zachxu42 commented Sep 22, 2023 • edited

pracucci commented Oct 12, 2023

zachxu42 commented Sep 2, 2023 •

edited

zachxu42 commented Sep 4, 2023 •

edited

zachxu42 commented Sep 4, 2023 •

edited

zachxu42 commented Sep 22, 2023 •

edited