New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ErrNoProgress due to parsing/concurrency issue? #1188
Comments
Spent a day poking around, and found the most promising theory to be the following
|
cc`ing @rhansen2. Sorry I know it's been more than 2 years.. But wanted to keep you in the loop since you're the author of #788. Do you by any chance remember any details regarding your statement in that PR? 🙏
|
Same here. After 24h of debugging I ended up with your same conclusion @zachxu42, Moreover, there are other cases where a similar thing happens. If an empty batch is received, the library does not clear the buffer correctly, and the next message that reads, gets a no progress error. |
Thanks @aratz-lasa for the comment. Can you please elaborate this part?
Also I never fully understand the reason behind this The twist is.. as I mentioned above, sometimes you get |
I think you both are just saying the same thing. |
Describe the bug
We're using the library across hundreds of instances reading from a Kafka cluster. And the rate of this ErrNoProgress errors is alarmingly high, at around 10/s across all the components. I believe they're all from this bit of logic and I noticed a change was made to handle this particular error more gracefully. But nonetheless it'd still close the connection and open a new one. Closing and opening tens of connections per second can add significant load to the Kafka cluster and impact performance.
More concretely, I wonder if the
c.concurrency() == 1
check is catching some cases introduced by concurrency/parsing issues/bugs in the library as opposed to actual data corruption on the wire. There are many scenarios that could cause this symptom, for exampleleave
before processing the response. (Looks like this could be happening already.)Something else worth mentioning is that if I add a 50ms sleep between iterations in the readloop, then this error would completely go away, something else that suggests this might be due to some sort of contention.
In other words, there's probably data corruption. But I don't believe it's introduced by the transport layer (TCP), and we should get to the bottom of that, eliminate the root cause, instead of simply dropping the connection and starting over.
Kafka Version
To Reproduce
Run many consumers and observe the increase of reader errors or, in the older version of the library, the log
the kafka reader got an unknown error reading...multiple Read calls return no data or error
Expected Behavior
Very few to zero ErrNoProgress due to corrupt data even when there are many consumers.
Observed Behavior
Tens of reader errors due to ErrNoProgress which lead to frequent reader reconnections.
The text was updated successfully, but these errors were encountered: