New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket connection drops next notification if invalid notification is sent #14
Comments
My guess is that this issue is being caused by the fact that you're using typical non-blocking sockets, which are implemented a bit weirdly in Ruby by default. I ran into a similar experience, and switching to kgio (http://bogomips.org/kgio/) resolved it. Voodoo, I know -- but what works, works. |
I was thinking of trying kgio to see if it helped, I will try this. Thank On Friday, June 22, 2012, Kenneth Keiter wrote:
|
KGIO does not appear to resolve this situation (on Lead Zeppelin at least). I have tried this on JRuby and got the same results. After testing 3 different SSL socket implementations, I'm starting to arrive at the conclusion that there is no way to fix this except to introduce an IO.select and wait an arbitrary amount of time for an error before using the connection again. Again, if people more apt at socket programming see an obvious problem here, feel free to chime in, it's entirely possible that we're just not doing some cryptic thing that we're supposed to do. Implementing the 0 version of the protocol is another option on the table that might resolve this problem (at the expense of better error reporting). |
I had this kind of error using apn_on_rails , I didn't find out a good solution too. Used to rescue the error with the device id of the bad token and reopen a connection and start with the next device. I'm curious about what you'll find out. |
The only implementation that does this correctly AFAICT is Lead Zeppelin, which uses IO.select to wait an arbitrary time to see if an error comes back. This makes it able to send less messages though. My one and only idea for solving this would be to try to implement the first version of the protocol (0) and see if the problem goes away. I'm really surprised that I'm the only person that has brought this up before.. I found this problem immediately with very basic testing. |
I'm running into this as well. Any solutions in the past three months? |
Nothing has been attempted yet. My only idea so far is to try to inplement I implemented an apns solution around the same time called lead_zeppelin FWIW, every apns implementation i've tried has had the exact same problem. I want to try to implement the original protocol with grocer to see if it On Friday, October 12, 2012, kevboh wrote:
|
Sounds good. I'll check out lead_zeppelin. |
for #43 I introduced 2 different IO.selects
I tried sleep(1) before reading, but that sometimes waited too long, so the apple connection was closed and I couldn't get the error information. |
@vanstee @kbrock @kyledrake @Aleph7 I want to try to clear some things up, get us all on the same page, and then propose a possible, not-waiting-after-every-write, solution. TL;DR: avoid waiting after every write and move to a producer/consumer setup. The problem(s)
Current approachesThe approaches so far all use an An alternative approachWhat if we used a couple of thread-safe queues and the producer/consumer pattern? So there is a queue of When an error is found For more background with better descriptions of the problems, current approaches, and even this kind of queue-based alternative, check out this post: http://redth.info/the-problem-with-apples-push-notification-ser/ |
Hi @stevenharman, @vanstee, @kyledrake, and @Aleph7 We are all on the same page: It is not a viable option to block after every push. But please keep in mind that One implementationYou can find the actual implementation in our copy of pusher.rb. notifications.each do | notification |
push notification
if check_for_issues_non_blocking
rewind_to notification_after_error
end
end
if check_for_issues_blocking(1.second)
rewind_to notification_after_error
end Having the 1 second block at the very end allows us to get the response code from Apple. We can confidently clear out all state once this method is finished. Also, if we were not blocking at the end and Apple did send a response code. We may not come into the method for another few seconds. Apple will have already closed the socket and the Apple response code will have been lost. We can use
|
@stevenharman Totally agree with your "alternative approach". Seems like our best option. Although, we'll need a separate queue for the Any thoughts on also providing a blocking API as well? @kbrock I think blocking code is probably less complex which might be a good argument for it. For that "second example" are you saying, that we can call select on the connection for both read and write, knowing that the write connection will immediately available, and then just check the errors for the response from Apple? That's pretty clever, although I'm still leaning towards doing a blocking read in another thread so far. |
I think the main issue is the secondary thread needs to communicate with the primary thread to say: you need to close the connection and you need to rewind back and send those alerts again. Also, after leaving the producer, the secondary thread will not have a primary thread to tell that we need to rewind/ send the alerts again. Maybe it is a 3 threaded model I guess I have tended to avoid the threaded producer/consumer code when possible. It just seems to have so many synchronization and possible race conditions. |
I think introducing threads overcomplicates things. Something simple like this should suffice
We read without blocking after each write. And at the end of the loop we do a blocking read so that we don't miss any late replies. If there is a reply we go back to the identifier that failed. In the case of "simple notifications" we would just resume where we left, which will potentially miss some notifications. Nothing to do about that. Am I missing something? |
Wasn't exactly sure what this meant. The non-threaded example still has the potential for missing messages, so we would have to do the same history queue thing for that example as well. |
I don't see why it has potential for missing messages. With messages you mean notifications or replies? And are you referring to the "simple" or the "enhanced" case? |
Both would still have the potential for sending notifications to Apple after they have decided to no longer receive data on the socket due to an invalid notification. Since we're using Nagles algorithm to combine packets sent over the wire (as recommended by Apple), there's the potential for Apple to receive a few notifications from us before they have checked the validity of the first notification. Here's a quick example: Let's say we send 3 notifications, but the first one is invalid. We'll send all three down the pipe. Apple will start inspecting them and forwarding them to the correct devices. However the first notification has an invalid token so Apple sends us an error response, closes the connection, and throws away the other messages we've already sent. So we would still have to check the identifier in the error response and rewind to that position to make sure we didn't lose any messages. I'm totally not an expert on networking stuff, so I might be misunderstanding how this actually works. Let me know if I missed anything! |
That is why you use the identifier to know which notification failed and restart from there. See the actual implementation in my comment on #21 (comment) |
I'm running an algorithm similar to the pull request 21 algorithm 10-20 times. Playing with some different variables. We have yet to loose a packet or send a duplicate. Do note: when I introduced a sleep instead of a blocking read at the very end, I did get errors without a response code. So I feel strongly about that. |
One thing to keep in mind - the producer/consumer model has a very nice, and desirable, side benefit of also being thread safe. The read-after-each-send-and-once-more-at-the-end is not, in and of itself, thread safe. I'm not exactly sure what the implementation would look like, but I know that it really helps to keep threaded code simple and concise - so that would be a guiding principle. |
There is a notification sender, and a notification error receiver. Both of which need to act upon a single socket. In the read after write case - both read and write and close are happening on the primary thread by the pusher. In both cases, the pusher can only be used by a single thread. Personally, I'm running this in event machine and as expected, the blocking on write to apple is not working for me. |
I should have been more clear - I meant that from the consumer side, this would make I am making the assumption that the As you've said, there is a single sender thread which only writes and closes the socket. So yes, we are only sending via one thread, but those writes are small and fast, and by using the single connection for all writes we save a huge amount on the overhead of opening/closing the sockets. Does that make sense, or am I misunderstanding, confused, or don't-know-what-I'm-talking-about? (I would not be offended if you said the later. 😄 ) |
Regardless of how it comes out, a non-threaded as an option would be appreciated. Our use-case for Grocer is via Sidekiq, so using threads with my threads feels like a fragile house full of cards. We instantiate a separate |
Depending on how you're sending messages, you need to be careful of this as Apple will can consider many short-lived connections to be a DDoS and shut you down. |
Yeah, I saw that in some of the documentation, forums. The threads are long lived, and the sockets don't look like they'll timeout on their own (the pusher never closes them as a surprise, from what I can see). We also limit our concurrency to that particular Sidekiq instance to 15. --edit |
Let's go digging! since this was never resolved... APNS has some oddities in regards to what they expect: Note: I am not sure how this applies to the HTTP/2 API Dropping Connection on Failure
The reason why people do the synchronous blocking: Async Error Responses
Possible SolutionsMulti-Threaded Strategy
Implementation Details
Apple guidelines indicate: Single-Threaded Strategy
User Strategy
@stevenharman do you have any suggestions as to how we should implement these? Should we implement them all ? I was thinking we use some sort of pluggable strategy approach. |
Hi, I'm no longer using grocer, so things may have changed, but I did use grocer and have much success with it. In my situation we could not loose any messages, so we did solve this issue. The actual problem is in the catch / retry logic on a connection close. It retries the send and forgot to ask the connection, "wait, did you want to leave a message?" The tricky part is the messages have to be resent. A typical API shouldn't keep a history of messages around. Because that causes a memory leak. But that is what is required here. Whether keeping it in the connection object, or in the queue / postgres. Also, going over ssl is a little tricky, because you call a Your solution is to either a) cache messages along with the connection and be able to resend them when a connection retry is necessary, or b) move it out into a service. I removed the retry logic, and kept a "memory leak" / message cache along with each connection. If I had unlimited time, I would have created a service rather than treating apns/google as an api. Just put messages into postgres to be sent out. May have lost a little in throughput since that can't keep up with rabbit mq, but I would have been able to capture metrics on the number of retries per telephone number and come up with meaningful statistics. Best of luck. update: |
The HTTP/2 API will make life a lot easier, but when! |
With the changes to push tokens in iOS 9.0 this bug has made grocer completely unreliable. iOS 9.0 results in a new push token being generated when an app is deleted then re-installed. The old token is invalidated. Often it will take the feedback services hours to report it, but Apple rejects it pretty much immediately. I think that if there's no clear solution to this the README should be updated with a warning. The project simply isn't usable without this being fixed. |
Hello @idyll, I don't use Grocer in my current product, and so am unable to dedicate much time to it other than smaller maintenance things. In the event you (or anyone else reading this) would like to help out, I think a reasonable way forward is to implement the recently released HTTP/2 push provider.. Building that will "fix" this issue (and a handful of others) by way of allowing us to ignore the awkward and indeterminate push APIs that have caused this issue. I would be happy to help with such an effort, but I can't be the one driving it at this time. Thank you. |
There's an implementation here: https://github.com/alloy/lowdown but you'd need some sort of connection pooling and reuse to really make it work. But that's not really the point. This situation breaks existing apps once the users upgrade to iOS 9. I'm just saying this merits an update to the readme to warn people. And probably a statement asking for help on a PR to move to the HTTP/2 interface. |
How's this: A WARNING TO APPLE SHARE CROPPERSUsing an Apple SDK to develop software for your livelihood is dangerous. They will give you a broken implementation of a broken protocol written by somebody that's clearly never written an API before and has zero concept of error handling, then randomly add breaking changes to it, making your barely working broken implementation now 100% absolutely unusably broken. Expect everything to break, and expect Apple to not care or provide any way to communicate with the authors. If you want to write reliable software, or software that doesn't make you feel like you're cooking soup for prisoners in a Siberian gulag, you should not under any circumstances develop software with an Apple SDK or API. |
Important NoteiOS 9.0 (and subsequent versions) introduced a subtle change to the way push tokens provided. If you delete and then re-install an application the push token is invalidated and a new push token is generated. This is important because the feedback service does not deliver the list of invalidated tokens quickly enough to prevent you from using the now invalidated token. There is currently a bug in grocer (#14) that will cause the APNS socket connect to hang up and fail to send subsequent notifications when one of these invalid tokens is used. This bug combined with the change to push tokens in iOS results in varied reliability of push notification delivery. This may or may not affect you, but if you are seeing a large amount of undelivered notifications - specifically when sending multiple messages in quick succession - it is likely that you are coming up against this. Right now we are looking for help moving over to Apple's HTTP/2 notification endpoint which should address this situation, but the current maintainer doesn't have time to do this work. |
So just to clarify, if we send an invalid notification, we'll lose the notifications that we send quickly after it, but the pusher does eventually get the error message and re-open the connection, so that at some point it will resume being able to send notifications successfully again, right? |
@mohamedhafez The messages you send right after sending a bad token will be lost unless you are tracking messages sent. |
@kbrock just the messages immediately after, or all messages afterwards, forever? |
@mohamedhafez it is documented as such: let G be good messages. GGGGGBGGGGGGR All messages right of the bad message will be dropped. |
@AlexRiedler my question is what happens after the R? Does the pusher object recover by opening a new connection, or do we lose all messages after the R as well? |
@mohamedhafez the connection is closed immediately by APNS; so it has to re-establish a connection. |
@mohamedhafez Once Grocer realizes the connection has been closed, it will open a new connection and resume sending messages. So any messages sent between the B message and R will be lost. Unless you're tracking them and then try to re-send them. |
|
Assuming they're good notifications, yes.
The R isn't even an error message sent back by APNS, it's that they close their end of the socket, and so suddenly Grocer can't push any more messages. Internally, Grocer checks before sending each notification to make sure the connection is open. When it finds it's closed, it will open a new one and use that. |
Perfect, thanks for your patience guys:) |
So, this is the problem I ran into on Lead Zeppelin (https://github.com/geoloqi/lead_zeppelin). It's kindof a big problem right now, and I haven't figured out a graceful way to fix it yet.
If you send an illegit message (such as a message with a bad device token), the connection will drop without throwing an exception for the next message, and that next message will effectively be dropped.
Example:
I don't know what the source of this problem is.. whether it's something in Ruby, something on Apple's APNS servers, or some sort of buffering issue, or what.
I worked around this problem by putting in an IO.select to wait for a specific period of time, to see if the socket becomes available for reading. With the enhanced protocol, being readable indicates that there was an error. But the only way to do this is to wait for a response before using it again, due to the silent failure. But this solution sucks, because it slows everything down.
You can see the implementation of that IO.select here: https://github.com/geoloqi/lead_zeppelin/blob/master/lib/lead_zeppelin/apns/gateway.rb#L96
The solution I was pondering to deal with this problem was to switch back to the version 0 protocol, but then of course you don't get the enhanced error response, which also kindof sucks.
The real solution would be for APNS to send a confirmation bit like they should, or drop the connection correctly... but that's not going to happen anytime soon. Really at a loss for how to solve this.
The text was updated successfully, but these errors were encountered: