SDK repeatedly retries bad requests #126

theolampert · 2024-02-28T11:03:58Z

Network requests fail with a 400 error code referring to the length of user_id or device_id. Both of these values appear to be set in the request data so I'm unsure why this fails. The client will then repeatedly retry this, in the sampled screenshot it's sent around 800 network requests.

The client will also repeatedly call flush and fill up it's local event cache, in my case the size of this cache is around ~600mb, my colleague's is a few gb. This will peg CPU usage and gradually increase memory usage as well ad disk writes.

This is also occuring in production and reported by by performance monitoring we have in Sentry.

There's a few things going on here but the main issue I see is a bug in the retry logic of the client that I haven't been able to track down.

The text was updated successfully, but these errors were encountered:

theolampert · 2024-02-28T11:23:34Z

@izaaz It seems like handleBadRequestResponse in PersistentStorageResponseHandler has a bug in its retry logic, there are no checks to see if the original request issue has been resolved before adding eventsToRetry back into the event pipeline. I also notice that the config variable flushMaxRetries is never used.

izaaz · 2024-02-28T16:47:01Z

Thanks for raising this issue @theolampert - We are taking a look at this

bwised · 2024-03-03T03:13:53Z

For me the memory overload is not gradual...it's very quick. After about 30 to 45 seconds after launch I get a massive burst of these and then they don't stop.

Background Task 2,174 ("amplitude"), was created over 30 seconds ago. In applications running in the background, this creates a risk of termination. Remember to call UIApplication.endBackgroundTask(_:) for your task in a timely manner to avoid this.

Some stuff is going bonkers.

izaaz · 2024-03-04T22:27:04Z

We have a repro for this issue. We'll keep this thread posted.

bejewel-sangwon · 2024-03-05T07:31:05Z

@izaaz When do you expect this issue to be resolved?

izaaz · 2024-03-07T00:19:31Z

@bejewel-sangwon - there is one other issue that is currently being worked on and this will be worked on right after that. ETA is about 2 weeks from now.

This should only be happening if the client is sending a bad request. Most likely because of a bad userid/device id. Can you please check if you are sending device ids/user ids < 5 characters long? If yes, can you make sure the minIdLength property is set correctly in the config?

bwised · 2024-03-08T14:49:56Z

My user ids are always UUID strings and aren't ever sent empty. So length isn't a problem.

bwised · 2024-03-09T05:17:28Z

I actually am seeing events in a bunch of .tmp files in the app's Documents/amplitude folder that do have user_id set to null. I'm not sure how that is happening, though.

[{"os_version":"17.4","revenue":null,"session_id":1709960900263,"time":1709961064701,"region":null,"device_brand":null,"idfa":null,"location_lng":null,"idfv":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","device_model":"arm64","event_type":"Initialize","dma":null,"user_id":null,"platform":"iOS","library":"amplitude-swift\/1.3.6","os_name":"ios","product_id":null,"adid":null,"app_version":null,"carrier":"Unknown","city":null,"device_manufacturer":"Apple","revenue_type":null,"event_properties":{"Classification":"Analytic","Is Sandbox":"false","What":"Firebase"},"ip":"$remote","location_lat":null,"insert_id":"C1824F8D-ABDF-4230-9699-0CDAFC819BF8","device_id":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","country":null,"language":"en-US","event_id":221,"price":null,"partner_id":null,"version_name":"3.19","quantity":null},{"os_version":"17.4","revenue":null,"session_id":1709960900263,"idfa":null,"time":1709961064710,"device_brand":null,"region":null,"location_lng":null,"idfv":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","device_model":"arm64","event_type":"[Amplitude] Application Opened","user_id":null,"dma":null,"platform":"iOS","library":"amplitude-swift\/1.3.6","os_name":"ios","product_id":null,"adid":null,"carrier":"Unknown","device_manufacturer":"Apple","app_version":null,"city":null,"revenue_type":null,"event_properties":{"[Amplitude] From Background":false,"[Amplitude] Build":"1","[Amplitude] Version":"3.19"},"ip":"$remote","location_lat":null,"device_id":"77512EE1-96DE-4A2D-808A-230EBFDDE79C","insert_id":"BE2ED514-0A67-4AB5-9E40-DF7482019089","country":null,"language":"en-US","event_id":222,"price":null,"partner_id":null,"version_name":"3.19","quantity":null}]

izaaz · 2024-03-09T05:36:30Z

user id being null is ok from Amplitide's perspective. A request to amplitude needs either a user id or device id or both. We only check for length when it's not null.

bwised · 2024-03-09T19:46:13Z

I managed to catch in the debugger today that randomly I get 400's coming back from the backend with the error "Invalid API key". And this is in the same app instance where it's been previously happy uploading events. Then the backend decides the API key is invalid.

It's at this point that the contents of the Documents/amplitude/amplitude-swift-storage--$default_instance.events.index folder explodes.

(And as a side note: Configuration.flushMaxRetries is not used anywhere in the Amplitude code yet)

Edit: I've confirmed that when this happens, HttpClient.getRequestData(events:) is creating the requestPayload with a blank apiKey. I'm continuing to debug.

Edit: I've discovered that I have something that posts an event to the Amplitude SDK before Amplitude is initialized, which causes AmplitudeDestinationPlugin to create an HttpClient with a configuration with no apiKey.

bwised · 2024-03-09T20:26:47Z

Yeah, so the workaround for this problem is to not try to post events before the Amplitude SDK is initialized. LOL

bejewel-changkyu · 2024-03-11T02:51:18Z

The problem with this issue is that even if a trigger occurred due to an incorrect user_id, the event does not disappear but continues to accumulate like zombies, eventually causing the app itself to crash.

If it were an incorrect event, it should have disappeared from the list, but it keeps accumulating, resembling a DDos attack with a huge number of requests, eventually overwhelming the app until it crashes.

I hope there will be a prompt solution. 🙏

I will also attach a video showing the related logs.

default.mp4

izaaz · 2024-03-15T19:25:10Z

@bwised Thanks for bringing that to our attention too.

@bejewel-changkyu Looking at your log, it looks like your event type is empty?

crleona · 2024-03-16T00:13:10Z

@theolampert @bwised @bejewel-changkyu I've landed a fix for the issue, would it be possible to verify this is working on your end? It may take some time for affected devices to stop sending requests, but they should eventually converge to the expected behavior. Thanks!

theolampert · 2024-03-18T17:00:28Z

@crleona Internally this appears to be working for us, we'll release tomorrow and let you know if the issue is completely resolved.

crleona · 2024-03-19T01:47:10Z

Thanks @theolampert, this is now released in 1.4.1. Please reopen this issue if you encounter any issues with the release.

theolampert · 2024-03-19T15:09:34Z

@crleona Hey, testing this internally, unfortunately we're seeing the same issues with 1.4.1 – @schurigeln will reach out later today with more details and screenshots.

theolampert · 2024-03-20T08:06:14Z

It seems to still be related to using user IDs that are below a certain length, logging out of that particular account works correctly.

There are then still two issues here, the backend has an undocumented minimum user ID length and the SDK is still able to get itself into a state in which it continuously retries bad requests.

izaaz · 2024-03-22T19:59:53Z

@theolampert we'll update the docs to make this more explicit.

Wrt the SDK retrying - currently, the response from the server indicates the first event that had bad data and the SDK drops that event and retries the rest. We are working with the backend team to get the full list so that the SDK can drop all bad requests in one go and not have to retry often. The fix that was pushed should at least make sure you end up with too many duplicates event locally. It still has a slow draining issue which we'll address soon.

From a principle perspective, we've always leaned towards not dropping any data on the SDK. Which is why we don't drop the data when we notice that the user id length is less than 5.

Keeping this ticket open while we get both of these resolved.

theolampert added the bug Something isn't working label Feb 28, 2024

izaaz mentioned this issue Feb 28, 2024

Custom instance name appears to cause issues reconciling past events, causing memory leak and crashes #128

Open

izaaz assigned crleona Mar 15, 2024

crleona closed this as completed Mar 19, 2024

crleona reopened this Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK repeatedly retries bad requests #126

SDK repeatedly retries bad requests #126

theolampert commented Feb 28, 2024 •

edited

theolampert commented Feb 28, 2024

izaaz commented Feb 28, 2024

bwised commented Mar 3, 2024 •

edited

izaaz commented Mar 4, 2024

bejewel-sangwon commented Mar 5, 2024 •

edited

izaaz commented Mar 7, 2024 •

edited

bwised commented Mar 8, 2024 •

edited

bwised commented Mar 9, 2024

izaaz commented Mar 9, 2024

bwised commented Mar 9, 2024 •

edited

bwised commented Mar 9, 2024

bejewel-changkyu commented Mar 11, 2024

izaaz commented Mar 15, 2024

crleona commented Mar 16, 2024

theolampert commented Mar 18, 2024

crleona commented Mar 19, 2024

theolampert commented Mar 19, 2024

theolampert commented Mar 20, 2024

izaaz commented Mar 22, 2024

SDK repeatedly retries bad requests #126

SDK repeatedly retries bad requests #126

Comments

theolampert commented Feb 28, 2024 • edited

theolampert commented Feb 28, 2024

izaaz commented Feb 28, 2024

bwised commented Mar 3, 2024 • edited

izaaz commented Mar 4, 2024

bejewel-sangwon commented Mar 5, 2024 • edited

izaaz commented Mar 7, 2024 • edited

bwised commented Mar 8, 2024 • edited

bwised commented Mar 9, 2024

izaaz commented Mar 9, 2024

bwised commented Mar 9, 2024 • edited

bwised commented Mar 9, 2024

bejewel-changkyu commented Mar 11, 2024

izaaz commented Mar 15, 2024

crleona commented Mar 16, 2024

theolampert commented Mar 18, 2024

crleona commented Mar 19, 2024

theolampert commented Mar 19, 2024

theolampert commented Mar 20, 2024

izaaz commented Mar 22, 2024

theolampert commented Feb 28, 2024 •

edited

bwised commented Mar 3, 2024 •

edited

bejewel-sangwon commented Mar 5, 2024 •

edited

izaaz commented Mar 7, 2024 •

edited

bwised commented Mar 8, 2024 •

edited

bwised commented Mar 9, 2024 •

edited