Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ErrNoProgress due to parsing/concurrency issue? #1188

Open
zachxu42 opened this issue Sep 2, 2023 · 5 comments · May be fixed by #1192
Open

ErrNoProgress due to parsing/concurrency issue? #1188

zachxu42 opened this issue Sep 2, 2023 · 5 comments · May be fixed by #1192
Labels

Comments

@zachxu42
Copy link

zachxu42 commented Sep 2, 2023

Describe the bug

We're using the library across hundreds of instances reading from a Kafka cluster. And the rate of this ErrNoProgress errors is alarmingly high, at around 10/s across all the components. I believe they're all from this bit of logic and I noticed a change was made to handle this particular error more gracefully. But nonetheless it'd still close the connection and open a new one. Closing and opening tens of connections per second can add significant load to the Kafka cluster and impact performance.

More concretely, I wonder if the c.concurrency() == 1 check is catching some cases introduced by concurrency/parsing issues/bugs in the library as opposed to actual data corruption on the wire. There are many scenarios that could cause this symptom, for example

  • A certain goroutine releases the read lock prematurely before properly advancing the buffer for the response meant for it. The next goroutine would interpret the first 8 bytes as the size and correlation id, getting gibberish values.
  • A certain goroutine incorrectly parses the bytes in the buffer and either leave some bytes unprocessed in the buffer, or advance into the bytes meant for the next response.
  • A certain goroutine writes the request, but calls leave before processing the response. (Looks like this could be happening already.)
  • Any message initiated by the cluster that's not a response message and does not have a correlation id. (Still reading through the Kafka protocol; not sure if this is possible)

Something else worth mentioning is that if I add a 50ms sleep between iterations in the readloop, then this error would completely go away, something else that suggests this might be due to some sort of contention.

In other words, there's probably data corruption. But I don't believe it's introduced by the transport layer (TCP), and we should get to the bottom of that, eliminate the root cause, instead of simply dropping the connection and starting over.

Kafka Version

  • What version(s) of Kafka are you testing against?
    2.5.1
  • What version of kafka-go are you using?
    v0.4.42

To Reproduce
Run many consumers and observe the increase of reader errors or, in the older version of the library, the log the kafka reader got an unknown error reading...multiple Read calls return no data or error

Expected Behavior
Very few to zero ErrNoProgress due to corrupt data even when there are many consumers.

Observed Behavior
Tens of reader errors due to ErrNoProgress which lead to frequent reader reconnections.

@zachxu42 zachxu42 added the bug label Sep 2, 2023
@zachxu42 zachxu42 changed the title ErrNoProgress potentially misfiring? ErrNoProgress due to parsing error? Sep 2, 2023
@zachxu42 zachxu42 changed the title ErrNoProgress due to parsing error? ErrNoProgress due to parsing/buffer reading issue? Sep 2, 2023
@zachxu42 zachxu42 changed the title ErrNoProgress due to parsing/buffer reading issue? ErrNoProgress due to parsing/concurrency issue? Sep 2, 2023
@zachxu42
Copy link
Author

zachxu42 commented Sep 4, 2023

Spent a day poking around, and found the most promising theory to be the following

  • A goroutine couldn't finish reading all the bytes meant for it; Leaving unfinished bytes in the buffer would cause data corruption for the next reader goroutine, causing the ErrNoProgress error here.
  • The first goroutine didn't finish consuming all the bytes in the read buffer because it hit r.empty=true here
  • r.empty was set to true because highWaterMark == offset here
  • When highWaterMark == offset, I tried logging size and remaining, and saw that there's clearly bytes in the buffer (e.g. highWaterMark == offset with size 366 and remaining 279). So this goroutine that's currently holding the read lock should not just give up.
  • I see a previous change was made and reverted around how we handle the case where highwatermark=offset. I tried that change, and all the ErrNoProgress errors went away. Over a couple of hours I saw one burst of UnexpectedEOF errors, which I think is what caused the revert of that change.
  • Upon further investigation, these UnexpectedEOF errors happen when highwatermark=offset and remain=0. Setting r.empty=true only when both conditions are true feels like the right logic.

@zachxu42
Copy link
Author

zachxu42 commented Sep 4, 2023

cc`ing @rhansen2. Sorry I know it's been more than 2 years.. But wanted to keep you in the loop since you're the author of #788. Do you by any chance remember any details regarding your statement in that PR? 🙏

Seems like In newer versions of Kafka the high water mark being equal to the offset does not mean there are no messages in the batch.

@zachxu42 zachxu42 linked a pull request Sep 8, 2023 that will close this issue
@aratz-lasa
Copy link

Same here. After 24h of debugging I ended up with your same conclusion @zachxu42, highWaterMark == offset is not correct. HighWatermark is the last offset that is available, not the last+1.

Moreover, there are other cases where a similar thing happens. If an empty batch is received, the library does not clear the buffer correctly, and the next message that reads, gets a no progress error.

@zachxu42
Copy link
Author

zachxu42 commented Sep 22, 2023

Thanks @aratz-lasa for the comment. Can you please elaborate this part?

HighWatermark is the last offset that is available, not the last+1.

Also I never fully understand the reason behind this highWaterMark == offset check. My understanding is highWaterMark is the offset for the last fully replicated message. And offset is the offset for the next message to read. When they are equal, doesn't that mean there's exactly 1 message to read? Then that should be allowed.

The twist is.. as I mentioned above, sometimes you get highWaterMark == offset AND remain=0. That means there's only 1 message to read, and that message is empty? In my PR linked above I only skip reading in that case. Actually, why checking highWaterMark at all.. If we just return an empty batch when remain=0 wouldn't that be also/more correct?

@pracucci
Copy link

HighWatermark is the last offset that is available, not the last+1.
Also I never fully understand the reason behind this highWaterMark == offset check. My understanding is highWaterMark is the offset for the last fully replicated message. And offset is the offset for the next message to read. When they are equal, doesn't that mean there's exactly 1 message to read? Then that should be allowed.

I think you both are just saying the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants