feat: DLQ unprocessable messages on ingest-events #66236

lynnagara · 2024-03-04T18:44:47Z

This PR depends on getsentry/sentry-kafka-schemas#230

It attempts to be somewhat cautious about DLQing and not DLQ
anything that is even potentially retriable. This could be tweaked
later.

If a non-retriable exception is raised, the consumer tries to further determine whether the message
is actually bad (or it's some transient retriable error) by validating the message against
the schema. The message will be DLQed only if that also fails.

This PR attempts to be somewhat cautious about DLQing and not DLQ anything that is even potentially retriable. This could be tweaked later. An alternative (and imo better) approach would be to validate the schema of failed messages and only then put them into the DLQ. However no schema is currently registered for this topic so this cannot be done easily.

lynnagara · 2024-03-04T18:46:47Z

src/sentry/ingest/consumer/simple_event.py

-    )
-    message: IngestMessage = msgpack.unpackb(raw_payload, use_list=False)
+    try:
+        raw_payload = raw_message.payload.value


none of this code changed, it's just indented in try/except block now

lynnagara · 2024-03-04T18:47:39Z

src/sentry/ingest/consumer/simple_event.py

+
+            return process_event(message, project, reprocess_only_stuck_events)
+        except Exception as exc:
+            raise Retriable(exc)


just to be safe and avoid dlqing anything potentially retriable, we don't dlq any exception from the process_event function

untitaker · 2024-03-04T18:49:46Z

src/sentry/ingest/consumer/simple_event.py


-    return process_event(message, project, reprocess_only_stuck_events)
+    except Exception as exc:
+        # Non retriable exceptions raise InvalidMessage, which Arroyo will DLQ.


not sure this makes a ton of sense. why put retrieable errors in the DLQ if nobody can retry those DLQ items by consuming that dlq?

It's the non-retriable errors not retriable ones that go in the DLQ, so the consumer does not get stuck on those. The retriable errors are things that might be caused by network issues, temporary unavailability of database, even a deploy, etc and should not be DLQed.

that's what I mean. IMO retriable errors should go into the DLQ as well. the DLQ is meant to be replayed.

I think in an ideal world we'd do that eventually but I'm not sure if we're there yet, and we haven't tried it anywhere. IMO inspection of events and replaying needs to be built so it can be done in a really fast and easy manner before taking this step.

Today DLQing everything in the case of temporary blips might cause a longer and more manual recovery period and comes with it's own risks that I haven't fully thought through yet.

Let's do it as a follow up later.

Would INC-660 messages have been marked as Retriable or InvalidMessage in this case?

In the meantime, I'm going to work on defining a schema for this topic so we can use that to determine what is valid or not.

No, what you have here seems good to start with, just that we should continue thinking about how to turn it into a general purpose failure handling tool.

Yes, agree this is not an end state. Just want to get DLQ in place in and we can tweak what actually gets DLQed later.

During INC-660, we crashed inside process_event, here:

sentry/src/sentry/ingest/consumer/processors.py

Line 129 in 270094f

cache_key = event_processing_store.store(data)

See this sentry error.

The kafka message itself was populated, but the event JSON inside was empty.

@lynnagara How often does the consumer currently crash because of temporary network outages? If it's rare enough (say, once per month), I would personally go ahead and DLQ all exceptions.

I do believe this PR is an improvement (any DLQ is better than none), just pointing out that it would not have helped with INC-660.

Ok, I realised it's going to be a lot of work to figure out what's retriable and what isn't on a consumer by consumer basis. So I rewrote this to check against the schema whenever an exception is raised to decide whether to DLQ it or not. Since there wasn't a schema for ingest-events, I added one here. getsentry/sentry-kafka-schemas#230

codecov · 2024-03-04T19:21:13Z

Codecov Report

Attention: Patch coverage is 90.47619% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 82.99%. Comparing base (7006163) to head (6550ee1).
Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #66236      +/-   ##
==========================================
- Coverage   84.26%   82.99%   -1.28%     
==========================================
  Files        5307     5307              
  Lines      237299   237319      +20     
  Branches    41053    41053              
==========================================
- Hits       199965   196965    -3000     
- Misses      37116    40135    +3019     
- Partials      218      219       +1

Files	Coverage Δ
src/sentry/consumers/__init__.py	`76.74% <ø> (ø)`
src/sentry/ingest/consumer/processors.py	`88.73% <93.75%> (+0.41%)`	⬆️
src/sentry/ingest/consumer/simple_event.py	`90.00% <87.09%> (+7.39%)`	⬆️

... and 281 files with indirect coverage changes

lynnagara · 2024-03-11T21:30:23Z

@untitaker @jjbayer Can I pick on you two to review this? Would be nice to have some eyes from both the ingest + unified consumer perspective.

jjbayer

Looks good to me, but should we add tests for the new code paths? E.g. one with an invalid message and one that mocks a connection error in process_event.

fpacifici · 2024-03-13T20:32:12Z

src/sentry/consumers/__init__.py

@@ -259,6 +259,7 @@ def ingest_events_options() -> list[click.Option]:
        "static_args": {
            "consumer_type": "events",
        },
+        "dlq_topic": Topic.INGEST_EVENTS_DLQ,


Has the topic already been created in all environments ?

fpacifici · 2024-03-13T20:40:23Z

src/sentry/ingest/consumer/simple_event.py

+    "attachments": Topic.INGEST_ATTACHMENTS,
+}
+
+


This does not cover the attachment topic.
The functions for attachments are process_attachments_and_events and decode_and_process_chunks.
I think we should cover all of them. Whether you want to do attachment in a separate PR that's ok.

Yes. One step at a time.

fpacifici · 2024-03-13T20:50:52Z

src/sentry/ingest/consumer/simple_event.py

+            codec.decode(raw_payload, validate=True)
+        except Exception:
+            raw_value = raw_message.value
+            assert isinstance(raw_value, BrokerValue)


Can it be anything else ?

fpacifici · 2024-03-13T20:51:56Z

src/sentry/ingest/consumer/simple_event.py

+        if default_topic != "ingest-events":
+            raise
+
+        codec = sentry_kafka_schemas.get_codec(default_topic)


Isn't this a heavy operation ? Any reason not to instantiate the codec only once ?

it's cached in the library after the first time

fpacifici · 2024-03-13T20:58:54Z

src/sentry/ingest/consumer/simple_event.py

+            raise InvalidMessage(raw_value.partition, raw_value.offset)

-    return process_event(message, project, reprocess_only_stuck_events)
+        raise


I thought we were discussing about inverting the way DLQ classifies retriable errors: Instead of putting the burden on the consumer developer to identify errors that should make us route the message to the DLQ, asking the consumer developers to identify errors that should trigger a crash and treat everything else as InvalidMessage.

Is that still the case and are you planning to deal with this as a separate PR/project ?

This is still case, but in this specific function it's very difficult because of the specific logic / external systems that are depended and things that can fail here. Practically, I think checking for both an exception + then the schema as well is the safest way in this one case.

lynnagara · 2024-03-14T19:47:36Z

@jjbayer I brought back the retriable exception around the processing store blocks to handle the case where message still passes the schema but the inner part of the payload is invalid

sentry-io · 2024-03-17T01:34:13Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ Retriable: [Errno 111] Connection refused ingest_consumer.process_event View Issue

_{Did you find this useful? React with a 👍 or 👎}

This PR depends on getsentry/sentry-kafka-schemas#230 It attempts to be somewhat cautious about DLQing and not DLQ anything that is even potentially retriable. This could be tweaked later. If a non-retriable exception is raised, the consumer tries to further determine whether the message is actually bad (or it's some transient retriable error) by validating the message against the schema. The message will be DLQed only if that also fails.

lynnagara requested review from a team as code owners March 4, 2024 18:44

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Mar 4, 2024

vercel bot deployed to Preview March 4, 2024 18:46 View deployment

lynnagara commented Mar 4, 2024

View reviewed changes

untitaker reviewed Mar 4, 2024

View reviewed changes

lynnagara requested a review from a team March 4, 2024 19:18

use schema to determine what to dlq

9e31d5d

vercel bot deployed to Preview March 6, 2024 04:43 View deployment

Merge branch 'master' into ingest-events-dlq

9a784be

vercel bot deployed to Preview March 8, 2024 18:27 View deployment

fix topic

e71b3fb

vercel bot deployed to Preview March 8, 2024 18:54 View deployment

Merge branch 'master' into ingest-events-dlq

f278266

vercel bot deployed to Preview March 11, 2024 18:31 View deployment

jjbayer reviewed Mar 12, 2024

View reviewed changes

untitaker approved these changes Mar 12, 2024

View reviewed changes

lynnagara added 2 commits March 12, 2024 16:03

add test, also raise if encoding is wrong

c93b85c

improve test and fix bug

3926556

vercel bot deployed to Preview March 13, 2024 01:53 View deployment

jjbayer approved these changes Mar 13, 2024

View reviewed changes

fpacifici reviewed Mar 13, 2024

View reviewed changes

Merge branch 'master' into ingest-events-dlq

6f630d6

vercel bot deployed to Preview March 14, 2024 17:17 View deployment

feedback events

ab4b4a0

vercel bot deployed to Preview March 14, 2024 18:51 View deployment

add back retriable exception

52351dd

vercel bot deployed to Preview March 14, 2024 19:48 View deployment

Merge branch 'master' into ingest-events-dlq

3588f20

vercel bot deployed to Preview March 14, 2024 23:23 View deployment

Merge branch 'master' into ingest-events-dlq

2bf612a

vercel bot deployed to Preview March 15, 2024 19:09 View deployment

test

6550ee1

vercel bot deployed to Preview March 15, 2024 19:55 View deployment

lynnagara merged commit 9ba8562 into master Mar 15, 2024
50 checks passed

lynnagara deleted the ingest-events-dlq branch March 15, 2024 20:25

github-actions bot locked and limited conversation to collaborators Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: DLQ unprocessable messages on ingest-events #66236

feat: DLQ unprocessable messages on ingest-events #66236

lynnagara commented Mar 4, 2024 •

edited

lynnagara Mar 4, 2024

lynnagara Mar 4, 2024

untitaker Mar 4, 2024 •

edited

lynnagara Mar 4, 2024 •

edited

untitaker Mar 4, 2024

lynnagara Mar 4, 2024 •

edited

mwarkentin Mar 4, 2024

lynnagara Mar 4, 2024

mwarkentin Mar 4, 2024

lynnagara Mar 4, 2024

jjbayer Mar 5, 2024

lynnagara Mar 6, 2024

codecov bot commented Mar 4, 2024 •

edited

lynnagara commented Mar 11, 2024 •

edited

jjbayer left a comment

fpacifici Mar 13, 2024

lynnagara Mar 14, 2024

fpacifici Mar 13, 2024

lynnagara Mar 14, 2024

fpacifici Mar 13, 2024

lynnagara Mar 14, 2024

fpacifici Mar 13, 2024

lynnagara Mar 14, 2024 •

edited

fpacifici Mar 13, 2024

lynnagara Mar 14, 2024

lynnagara commented Mar 14, 2024

sentry-io bot commented Mar 17, 2024 •

edited

feat: DLQ unprocessable messages on ingest-events #66236

feat: DLQ unprocessable messages on ingest-events #66236

Conversation

lynnagara commented Mar 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

untitaker Mar 4, 2024 • edited

Choose a reason for hiding this comment

lynnagara Mar 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lynnagara Mar 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 4, 2024 • edited

Codecov Report

lynnagara commented Mar 11, 2024 • edited

jjbayer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lynnagara Mar 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lynnagara commented Mar 14, 2024

sentry-io bot commented Mar 17, 2024 • edited

Suspect Issues

lynnagara commented Mar 4, 2024 •

edited

untitaker Mar 4, 2024 •

edited

lynnagara Mar 4, 2024 •

edited

lynnagara Mar 4, 2024 •

edited

codecov bot commented Mar 4, 2024 •

edited

lynnagara commented Mar 11, 2024 •

edited

lynnagara Mar 14, 2024 •

edited

sentry-io bot commented Mar 17, 2024 •

edited