New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to clear ongoing observe notifications from transit on CoapEndpoint stop? #2245
Comments
First at all, the observe/notification is defined in RFC7641. There in 4.5 Server-Side Requirements - Transmission:
For me this seems, that the implementation is in "spec", while your expectation isn't. Sometime not all cases may be covered in a specific way, because for others an other way is preferable. What makes me wonder in your description: A "in transit notification" should be replaced by the "new notification" keeping the state of retransmission (number and current timeout, see ObserveLayer.NotificationController.onRetransmission()). But your description is different. Could you please provide a capture to check that? |
What does that mean? Stop the java engine? |
Do you mean, the transmission of the notification? Then try "MessageObserver.onTransferComplete". Anyway, I guess, once I understand, what is really done, I guess the answer will be simpler. |
It indeed looks like that the canceling of observation in case of timed out notifications is the specified behavior, thanks for the clarification.
Is this By client stop I mean that the (Leshan) client's CoapServer and it's endpoints are stopped, using their So the stop will stop the Endpoint, which in turn stops the Connector. In this situation, do you still think that the in-progress notification should be cleaned up by the next notification? In this case, the next notification will come after the I will see if I can provide a capture somehow, but might be difficult due to company policies :/ |
I guess, no one cared in the past 10 years about observe/notify and start/stop ;-).
In stop, I don't think that some one should send a new notification, or? I have currently no idea, what will be the best for observe/notify relations. For a common exchange it's easier, because that is more or less "short living". I guess the answer depends more the use-case for the stop/start and that then gives and idea about the intention according the observe/notify relation.
That's OK. I guess even a unit test with stop/start will show the issue. I'm currently not longer in a job, where I can spend the required time to fix such things in short term. Let's see, how long it takes. By the way, I'm curious: in the end you want to emulated a "lot of clients". And the obvious solution would be to use nio (non-blocking). You preferred a work-around, which now shows some issues. Maybe for short term, If I want to simulate quite a lot clients (1.000.000 or so), I usually just use RPK clients and do frequently new handshakes. It depends on some settings(identity), but with that the connection and exchange store can be filled up a lot. And if it's filled, I don't see the large difference in have the 2000 client or 200.000 client really active. |
Yeah I guess this whole approach in our use case has been pretty much not the most common one :D
Ok you got my attention, could you elaborate on the NIO approach you had in mind? But yes, basically this start/stop workaround has been a pain, and if there would be some other way to look at this, while keeping the resource usage sensible, I would be very interested in hearing more. I guess the main reason for this approach was that that the Leshan/californium didn't really allow keeping the clients on all the time, CPU/thread resource wise. At least that was my impression, as the communication began to break the more Leshan clients we had on simultaneously. But again, this implementation of mine is probably very naive on some ways. The requirements are basically that we need to always use the same DTLS session, x509 certificates, and handle these long-lasting observations relations. |
You get already Simon's answer on that in your issue in Leshan. He pointed to the Californium issue #1203. Californium already supports for many parts to use a shared thread-pool. But the synchronous DatagramSocket requires unfortunately to use a dedicated thread for the blocking receive function. For using Californium as central server that doesn't matter. But for using it for massive client endpoints it does. NIO may overcome that. But it will come with some implementation and mainly test efforts. Anyway, I guess, if you can spend some time in providing something as an unit-test for your issue, I will check, what could be done on it. |
Just as fast shot, not tested, nor really sure, if it addresses the issue at all: Add that snippet to the
and then test it. |
I tried to come up with some kind of unit test for this, but to be honest could not really wrap my head around the californium code. If you have some pointers on how to proceed with the testing, I can try again at some point, but for now I couldn't really do anything :(
I guess this is the part that the Worker class handles inside the DTLSConnector? Since at least that seems to be outside the executors one can pass in via |
Some examples of unit-test around observer: |
Yes, DTLSConnector, Worker receiver and DTLSConnector, Receiver |
Fixes issue eclipse-californium#2245. Cancel pending responses and notifies on stop as well. Canceled notifications are not longer in transit. Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
If possible, please retest with PR #2246 and check, if this is fixing your issue. |
Fixes issue #2245. Cancel pending responses and notifies on stop as well. Canceled notifications are not longer in transit. Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
Any new on this? |
Hi, unfortunately I've been busy with other things and haven't been able to test this yet. Will try to do it this week and share the results. |
As long as I know, that you plan to test it, I will keep it open. I currently consider to schedule the minor release 3.12 for the 6. June. |
Hiya. Replicating the situation is pretty difficult on the real environment, but based on the local testing I've done, this seems to do the trick. No more in transit - logging on the next notification. I will keep an eye out if the behavior is seen at some point. I think this can be closed for now at least. Thanks for all the help Achim, appreciate it 🙏 |
Background
We are using Leshan 2.0.0-M14 (Californium 3.9.1) on our LwM2M client implementation, communicating with a server using Leshan 2.0.0-M9 (Californium 3.7.0). We are actually running multiple clients on same machine, connecting to the same server instance.
Posting this here as this is regarding the way the observations are handled in Californium.
The issue
In our use case, the server observes some resources on the client. We have noticed that the notifications from the client are not sent to the server in a situation where the client (i.e. the client's CoapEndpoint) has been stopped while the previous notification was not yet ACKed.
For example:
Client is started, and server observes a resource of the client.
Client sends a notification on resource change, but the server, for some reason, does not answer immediately.
Client is stopped, whilst the server has still not acked the notification.
Later, client is started again. It tries to send a new observation to the server.
The notification send fails, as the previous notification is still "in transit".
From our clients' logs:
When we stop the LWM2M client, it calls the
stop
method on the CoapServer. Relevant Leshan snippet in here.Looks to me like the
ObserveRelation
is keeping the previous notification, despite it being in "canceled" state, and then this leads to postponing the future notifications in here.Questions
stop
orstart
, so that it won't cause this on later notifications?The text was updated successfully, but these errors were encountered: