Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing DataDog events caused by UnknownError #2013

Open
nicbell opened this issue Apr 29, 2024 · 9 comments
Open

Missing DataDog events caused by UnknownError #2013

nicbell opened this issue Apr 29, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@nicbell
Copy link

nicbell commented Apr 29, 2024

Describe the bug

During events analysis at the Datadog console, we've noticed that some RUM custom action events, tracked from our Android app, are missing.

Later, going through Android logcat on a test device, we've noticed a regular stream of error messages like this:

00:04:28.537  E  [_dd.sdk_core.default]: Batch f7771fc7-c67b-4896-9b94-1758a4e6704d [79836 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.
00:04:44.985  E  [_dd.sdk_core.default]: Batch 5378e67d-57be-44ac-99fd-f796050f2fed [116177 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.
00:04:59.789  E  [_dd.sdk_core.default]: Batch 9a127d4a-eccc-4c9e-af8c-04d87ddb7f08 [90146 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.

It looks like some batches are randomly dropped but the errors don't really help us investigate why.

We believe these are coming from this code in the SDK.
https://github.com/DataDog/dd-sdk-android/blob/release/2.8.0/dd-sdk-android-core/src/main/kotlin/com/datadog/android/core/internal/data/upload/UploadStatus.kt#L87

We've seen this on Datadog Android SDK versions 2.6.0, 2.7.0 and 2.8.0.

Could you please help to identify the source of the issue?

Reproduction steps

We don't know how reproduced it.

Logcat logs

No response

Expected behavior

No response

Affected SDK versions

2.6.0, 2.7.0 and 2.8.0.

Latest working SDK version

None

Did you confirm if the latest SDK version fixes the bug?

Yes

Kotlin / Java version

1.9.22

Gradle / AGP version

8.3.1

Other dependencies versions

Tried with work manager 2.9.0 & 2.8.1

Device Information

No response

Other relevant information

No response

@nicbell nicbell added the bug Something isn't working label Apr 29, 2024
@0xnm
Copy link
Contributor

0xnm commented Apr 30, 2024

Hello @nicbell!

In order to better understand the issue, I have the following questions:

  • does it happen only on a particular device/OS version?
  • do you see such errors only for RUM, or, if you are using other features, you see such errors also for, say, Logs, etc.?
  • do you see any successful uploads on this device, or all the RUM uploads are like that?
  • do you see any other errors from Datadog SDK in Logcat?

@MikeFot
Copy link

MikeFot commented May 1, 2024

Hello! I can provide provide additional info on the issue as I am working with @nicbell:

General Info

  • This is happening across multiple devices and supported OS versions in our Production Environment
  • We can verify that the events are being tracked correctly by our other Analytics implementations and we can compare the numbers
  • This is mostly happening when the app is in the background e.g. there is a push notification rendered event that we can verify is tracked in another platform 8300 times but only 2300 times in Datadog, even though they are going through the same in-app funnel so we have verified they are submitted to the analytics integrations identically
    • We have also verified that 8300 is the expected number from other sources
  • Success rate for events showing up on the Datadog Dashboards is much higher when the app is in the foreground but there is still a percentage of missing events

We are looking at ~70% of events lost when the app is in the Background and possibly ~10% in the Foreground. Please note that these are Rough estimates as this is an ongoing investigation.

Additional Investigations Performed

  • As @nicbell mentioned, we have tried out 2.6.0, 2.7.0 and 2.8.0
  • Additionally reverted to SDK version 2.1.0 and noticed that background events were being tracked more reliably
    • There were still missing events in general and the aforementioned error is in the logs but not at the scale that we are observing in the tested versions
  • Downgraded our Work Manager to match the version packaged with the Datadog SDK but there was no improvement

Error Log

This is the only Datadog error we are seeing in the logs, and we have tried with all log levels, and the additional sdk log artifact enabled just in case it provided more info. The error is logged multiple times per session.

[_dd.sdk_core.default]: Batch 238f8a16-f0ea-4eef-a7b0-43bf873d65fb [18036 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.

Example of this occurring consecutively within 1 second with timestamps:

11:51:32.467  E  [_dd.sdk_core.default]: Batch 5da558ea-607e-4191-bbf1-6b0b83b67812 [132314 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.
11:51:40.923  E  [_dd.sdk_core.default]: Batch 42bb555e-24ef-4982-b2c9-50c6eb836bf4 [39492 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.
11:51:48.583  E  [_dd.sdk_core.default]: Batch 07a75b5c-5418-42b3-a1eb-feab9648f5d8 [136818 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.
11:51:55.512  E  [_dd.sdk_core.default]: Batch 3671f32d-2b38-4103-bb09-22a41ae758d4 [415 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.
11:52:08.268  E  [_dd.sdk_core.default]: Batch f22c39ac-8c79-4088-a257-78f4be337672 [653 bytes] (RUM Request) failed because of an unknown error; the batch was dropped.

Work Manager

We can occasionally see the Work Manager succeeding. Have not seen any erroneous Worker results in the logs.

Worker result SUCCESS for Work [ id=4df6f445-729c-44c4-9b52-c7c9b6725bad, tags={ com.datadog.android.core.internal.data.upload.UploadWorker, DatadogBackgroundUpload } ]

Artifacts Implemented

  • com.datadoghq:dd-sdk-android-okhttp
  • com.datadoghq:dd-sdk-android-rum
  • com.datadoghq:dd-sdk-android-trace

and tried out com.datadoghq:dd-sdk-android-logs as well for the sake of debugging the issue.

Additional log warnings

The only warnings we are seeing are listed below.

[INFO:CONSOLE(1)] "Unsatisfied version 5.12.0 from one-host of shared singleton module @datadog/browser-rum (required ^4.50.1)

and in some cases

[_dd.sdk_core.default]: A RUM event was detected, but no view is active. To track views automatically, try calling the RumConfiguration.Builder.useViewTrackingStrategy() method. You can also track views manually using the RumMonitor.startView() and RumMonitor.stopView() methods.

@0xnm
Copy link
Contributor

0xnm commented May 2, 2024

Hello @MikeFot!

Thanks for the details. I don't think it is somehow related to the background events, here we have a some non-successful status code returned by the Datadog intake, meaning something is wrong with the payload.

I've added a small change in the referenced PR above, which is now a part of 2.9.0 release, which will add response status code to the error message. Can you please try then version 2.9.0 of our SDK and tell us what is the status code reported, so that we can proceed with investigation?

@nicbell
Copy link
Author

nicbell commented May 2, 2024

Thanks @0xnm we will try it out and report back.

@MikeFot
Copy link

MikeFot commented May 3, 2024

Thank you very much for adding this log!

Ran the app with the new version today through several scenarios and I can see the status code 200 being reported.

09:33:55.652  E  [_dd.sdk_core.default]: Batch 24f97b16-7bc4-4a69-977c-0fc81202f918 [41625 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:34:06.887  E  [_dd.sdk_core.default]: Batch 58cb65fa-a72b-47b3-9f49-6218e8431f13 [100250 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:34:17.069  E  [_dd.sdk_core.default]: Batch 4dd34e0a-9353-4bb1-b091-dd469e4ea0b2 [59533 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:34:26.265  E  [_dd.sdk_core.default]: Batch 07047f3f-6c01-4eb9-948d-0e847e630337 [48981 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:35:15.274  E  [_dd.sdk_core.default]: Batch 9079f43f-2376-43c7-b5e5-9f17bb283e6c [30183 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:36:34.126  E  [_dd.sdk_core.default]: Batch 2fe7197d-1145-40b9-ad4a-edf335fc0b51 [156802 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:36:53.408  E  [_dd.sdk_core.default]: Batch 318530dc-b14a-4333-bf1f-5a1e2eca6002 [60887 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:37:02.593  E  [_dd.sdk_core.default]: Batch 765d632f-0d38-4da6-9a28-5d64f6516c4c [83955 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:37:10.905  E  [_dd.sdk_core.default]: Batch 7e9f77d8-6785-49c5-84e0-2545bb668fc5 [83834 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:37:18.508  E  [_dd.sdk_core.default]: Batch af856a59-410b-4f7a-88af-5ae13df0e71b [90671 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:37:32.533  E  [_dd.sdk_core.default]: Batch 0a555ffd-85d4-4a5e-a7f4-a7961f55c039 [12886 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.
09:37:46.397  E  [_dd.sdk_core.default]: Batch 5c1db6d5-1be1-41f5-9575-4e76e4cd8f07 [415 bytes] (RUM Request) failed because of an unknown error (status code = 200); the batch was dropped.

@0xnm
Copy link
Contributor

0xnm commented May 3, 2024

@MikeFot Thanks a lot for a quick return. This log indeed makes things a bit more clear, I have additional questions now:

  • Are you using proxy/custom endpoint in Datadog SDK setup? Or events are sent directly to Datadog intake?
  • Which datacenter do you have configured? Is it US1, EU1, etc.?

@MikeFot
Copy link

MikeFot commented May 3, 2024

Hi!

  • RumConfiguration is set up with a custom endpoint
  • We are using DatadogSite.US1 in Configuration

Cheers!

@0xnm
Copy link
Contributor

0xnm commented May 3, 2024

I think here the issue is that your custom endpoint is replying with status code 200, while Datadog intake replies with status code 202 when request is accepted. In this case, there is a difference between the specification of your custom endpoint and Datadog intake.

Ideally, custom endpoint should act as a proxy and forward directly the communication with Datadog intake as-is (that is why it is advised to use the proxy configuration instead of custom endpoint).

I may suggest you to analyze the communication between your custom endpoint and Datadog intake for these failing requests, and if there is indeed a difference in status codes reported, then to adjust endpoint configuration or switch to the proxy setup if it works for you.

@MikeFot
Copy link

MikeFot commented May 3, 2024

Thank you very much! We are going to investigate and let you know 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants