Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka producer does not report _ISR_INSUFF as error code when ISR count is less; instead reports _MSG_TIMED_OUT #1711

Open
5 of 7 tasks
prashantochatterjee opened this issue Feb 19, 2024 · 2 comments
Assignees

Comments

@prashantochatterjee
Copy link

prashantochatterjee commented Feb 19, 2024

Description

The issue is seen when a node is brought down in a 3-node cluster wherein one of the topic partitions had an ISR count of 1 against a min-ISR setting of 2. The Kafka producer client is configured to report errors using error_cb callback setting. It is observed that the error code is reported as _MSG_TIMED_OUT instead of a more intuitive _ISR_INSUFF leading me to believe that a message was being sent to the node that was down.

How to reproduce

  • Bring down a node in a minimal 3 node cluster and to a state where the ISR count for one/more topic partitions is less than the minimum configured ISR setting for the topic/broker.
  • Create a producer with acks=all setting
  • Try to produce messages to a topic-partition with the min-ISR insufficiency

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()):

confluent_kafka.version()
('2.3.0', 33751040)
confluent_kafka.libversion()
('2.3.0', 33751295)

  • Apache Kafka broker version: 3.6.0

  • Client configuration: { # Error reporting 'error_cb': self._error_notification, #'debug': 'broker,topic,msg', 'debug': 'broker,topic,msg', # Use compression 'compression.codec': 'lz4', # Set acks for reliability 'acks': 'all', 'max.in.flight.requests.per.connection': 1, # Configuration settings for HA 'topic.metadata.refresh.interval.ms': 120000 }

  • Operating system: Debian GNU/Linux 11 (bullseye)

  • Provide client logs (with 'debug': '..' as necessary)
    2024-02-18 09:14:18,997 [PID=1:TID=Thread-6482:cfx.rda_messaging.kafka_client:_error_notification:788] ERROR - KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://rda-kafka-controller-2.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/2: Connection setup timed out in state CONNECT (after 30025ms in state CONNECT, 1 identical error(s) suppressed)"}
    2024-02-18 09:14:18,998 [PID=1:TID=Thread-6482:cfx.rda_messaging.kafka_client:_delivery_notification:759] ERROR - Message delivery failed: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}! Message=b'{"type": "SOURCE-EVENT", "sourceEventId": "f2b4d3de-46a6-4c7c-9243-f57c2489e18b", "id": "f2b4d3de-46a6-4c7c-9243-f57c2489e18b", "sourceSystemId": "56795ec7-f19b-429b-913b-b69bdc4a6baf", "sourceSystemName": "Kafka", "projectId": "337ddf08-cc96-11ee-8d32-0a26cc0b6dec", "eventCategory": "alerts", "customerId": "ac0c16424518460c8c0ef0f632515731", "createdat": 1708244619015.706, "sourceReceivedAt": 1708244619015.6702, "parentEventId": null, "payload": "{\"customerId\": \"58c5fe78-ca58-46a8-8db9-a172fec7d2bc\", \"assetId\": \"021456a7-b56a-4e0b-93a0-017f60a5d75f\", \"componentId\": \"f21056f7-cd02-43f6-b44c-c2fe2bcd12d1\", \"alertCategory\": \"Disk Partition\", \"alertType\": \"Disk Partition\", \"sourceSystemId\": \"CFX_PULSE\", \"projectId\": \"814ab092-83c3-4d0c-8805-063936ba3666\", \"environmentId\": \"a74b7e03-5d48-45fc-95fc-8df9ec47f3b2\", \"severity\": \"MAJOR\", \"assetType\": \"infra-account\", \"assetIpAddress\": \"10.199.0.118\", \"assetName\": \"SIM-m0.0-1-250\", \"componentName\": \"1.1\", \"sourceMechanism\": \"SNMP\", \"message\": \"Disk Partition \\\"flash[Disk:flash]\\\" Utilization [80.95%] exceeded configured value 25.0%\", \"raisedAt\": 1708241018415.1145, \"minimumOccurrence\": 0, \"sourceSystemName\": \"Syslog_udp_Notifications\"}", "status": "Completed"}', Key=b'f2b4d3de-46a6-4c7c-9243-f57c2489e18b'
    .................
    2024-02-19 05:26:45,329 [PID=1:TID=Thread-70:cfx.rda_messaging.kafka_client:_error_notification:789] ERROR - KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://rda-kafka-broker-2.rda-kafka-broker-headless.rda-fabric.svc.cluster.local:9093/bootstrap: Connection setup timed out in state CONNECT (after 30028ms in state CONNECT)"} {"sourceevent": {"id": "06497124-e2fb-4cd3-a348-2cf14b3d849a", "received": 1708320405328.9392}}
    %7|1708320417.127|PRODUCE|rdkafka#producer-3| [thrd:sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless]: sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/0: ac0c16424518460c8c0ef0f632515731.ingestion-tracker [12]: Produce MessageSet with 1 message(s) (1025 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, lz4)
    %7|1708320417.170|REQERR|rdkafka#producer-3| [thrd:sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless]: sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/0: ProduceRequest failed: Broker: Not enough in-sync replicas: explicit actions Retry,MsgNotPersisted
    %7|1708320417.170|MSGSET|rdkafka#producer-3| [thrd:sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless]: sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/0: ac0c16424518460c8c0ef0f632515731.ingestion-tracker [12]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) encountered error: Broker: Not enough in-sync replicas (actions Retry,MsgNotPersisted)

  • Provide broker log excerpts

  • Critical issue

@pranavrth
Copy link
Member

The error _ISR_INSUFF is internally retried in the producer to produce the messages till message.timeout.ms is reached after which _MSG_TIMED_OUT error is thrown. This is the reason, _MSG_TIMED_OUT is thrown and not _ISR_INSUFF. This is by design like this.

@pranavrth pranavrth self-assigned this May 22, 2024
@prashantochatterjee
Copy link
Author

@pranavrth Thanks for your explanation. Is there any way the _ISR_INSUFF code can take precedence over timeout? What's happening is that the real issue then escapes attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants