Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to clear ongoing observe notifications from transit on CoapEndpoint stop? #2245

Open
Ozame opened this issue Apr 22, 2024 · 16 comments
Open

Comments

@Ozame
Copy link

Ozame commented Apr 22, 2024

Background

We are using Leshan 2.0.0-M14 (Californium 3.9.1) on our LwM2M client implementation, communicating with a server using Leshan 2.0.0-M9 (Californium 3.7.0). We are actually running multiple clients on same machine, connecting to the same server instance.

Posting this here as this is regarding the way the observations are handled in Californium.

The issue

In our use case, the server observes some resources on the client. We have noticed that the notifications from the client are not sent to the server in a situation where the client (i.e. the client's CoapEndpoint) has been stopped while the previous notification was not yet ACKed.

For example:

  1. Client is started, and server observes a resource of the client.

  2. Client sends a notification on resource change, but the server, for some reason, does not answer immediately.

  3. Client is stopped, whilst the server has still not acked the notification.

  4. Later, client is started again. It tries to send a new observation to the server.

  5. The notification send fails, as the previous notification is still "in transit".

From our clients' logs:

2024-04-22 09:25:37,560 [executor] TRACE- o.e.c.c.o.ObserveRelation - in transit CON-2.05   MID=48271, Token=FFB1D80C889E8CA9, OptionSet={"Observe":1414, "Content-Format":"application/senml+cbor"}, canceled 86 A3 00 ....
2024-04-22 09:25:37,560 [executor] DEBUG- o.e.c.c.n.UdpMatcher - tracking open request [KeyMID[10.222.22.22:5683-48304], KeyToken[10.222.22.22:5683-B0A47C35FE9F3213]]
2024-04-22 09:25:37,560 [executor] TRACE- o.e.c.s.DTLSConnector - connection available for /10.222.22.22:5683,null
2024-04-22 09:25:37,560 [executor] TRACE- o.e.c.c.c.Message - Message transfer completed CON-2.05   MID=   -1, Token=null, OptionSet={"Observe":1624, "Content-Format":"application/senml+cbor"}, 86 A3 00 ....
2024-04-22 09:25:37,560 [executor] DEBUG- o.e.c.c.n.s.ObserveLayer - a former notification is still in transit. Postponing CON-2.05   MID=   -1, Token=null, OptionSet={"Observe":1631, "Content-Format":"application/senml+cbor"}, 86 A3 00 ....
2024-04-22 09:25:37,560 [executor] TRACE- o.e.c.s.DTLSConnector - Sending application layer message to [MAP(10.222.22.22:5683)]

When we stop the LWM2M client, it calls the stop method on the CoapServer. Relevant Leshan snippet in here.

Looks to me like the ObserveRelation is keeping the previous notification, despite it being in "canceled" state, and then this leads to postponing the future notifications in here.

Questions

  • Is there a way to know, when an notification related exchange has completed, either successfully or by timing out? If we could know when it e.g. timeouts, we could wait until that to stop the client.
    • It looks like when an notification times out, the observe relation is cancelled. Is there a way to change this behavior, so that the observations would never be cancelled?
  • Alternatively, can you suggest a way to clean up this "in transit" notification on Endpoint stop or start, so that it won't cause this on later notifications?
@boaks
Copy link
Contributor

boaks commented Apr 22, 2024

First at all, the observe/notification is defined in RFC7641. There in 4.5 Server-Side Requirements - Transmission:

The client's acknowledgement of a confirmable notification signals
that the client is interested in receiving further notifications. If
a client rejects a confirmable or non-confirmable notification with a
Reset message, or if the last attempt to retransmit a confirmable
notification times out, then the client is considered no longer
interested and the server MUST remove the associated entry from the
list of observers.

For me this seems, that the implementation is in "spec", while your expectation isn't. Sometime not all cases may be covered in a specific way, because for others an other way is preferable.

What makes me wonder in your description:

A "in transit notification" should be replaced by the "new notification" keeping the state of retransmission (number and current timeout, see ObserveLayer.NotificationController.onRetransmission()). But your description is different.

Could you please provide a capture to check that?

@boaks
Copy link
Contributor

boaks commented Apr 22, 2024

Client is stopped

What does that mean? Stop the java engine?

@boaks
Copy link
Contributor

boaks commented Apr 22, 2024

Is there a way to know, when an notification related exchange has completed, either successfully or by timing out?

Do you mean, the transmission of the notification? Then try "MessageObserver.onTransferComplete".

Anyway, I guess, once I understand, what is really done, I guess the answer will be simpler.

@Ozame
Copy link
Author

Ozame commented Apr 23, 2024

For me this seems, that the implementation is in "spec", while your expectation isn't. Sometime not all cases may be covered in a specific way, because for others an other way is preferable.

It indeed looks like that the canceling of observation in case of timed out notifications is the specified behavior, thanks for the clarification.

A "in transit notification" should be replaced by the "new notification" keeping the state of retransmission (number and current timeout, see ObserveLayer.NotificationController.onRetransmission()). But your description is different.

Is this onRetransmission happening on every time the notification send is being retried? If so, our case is probably a bit different:

By client stop I mean that the (Leshan) client's CoapServer and it's endpoints are stopped, using their stop method.

So the stop will stop the Endpoint, which in turn stops the Connector. In this situation, do you still think that the in-progress notification should be cleaned up by the next notification? In this case, the next notification will come after the CoapServer#start is called, and a resource change triggered. I have a suspicion that the stop/start might affect things here.

I will see if I can provide a capture somehow, but might be difficult due to company policies :/

@boaks
Copy link
Contributor

boaks commented Apr 23, 2024

I guess, no one cared in the past 10 years about observe/notify and start/stop ;-).

In this situation, do you still think that the in-progress notification should be cleaned up by the next notification?

In stop, I don't think that some one should send a new notification, or?

I have currently no idea, what will be the best for observe/notify relations. For a common exchange it's easier, because that is more or less "short living". I guess the answer depends more the use-case for the stop/start and that then gives and idea about the intention according the observe/notify relation.

but might be difficult due to company policies

That's OK. I guess even a unit test with stop/start will show the issue. I'm currently not longer in a job, where I can spend the required time to fix such things in short term. Let's see, how long it takes.

By the way, I'm curious: in the end you want to emulated a "lot of clients". And the obvious solution would be to use nio (non-blocking). You preferred a work-around, which now shows some issues. Maybe for short term, If I want to simulate quite a lot clients (1.000.000 or so), I usually just use RPK clients and do frequently new handshakes. It depends on some settings(identity), but with that the connection and exchange store can be filled up a lot. And if it's filled, I don't see the large difference in have the 2000 client or 200.000 client really active.

@Ozame
Copy link
Author

Ozame commented Apr 23, 2024

I guess, no one cared in the past 10 years about observe/notify and start/stop ;-).

Yeah I guess this whole approach in our use case has been pretty much not the most common one :D

By the way, I'm curious: in the end you want to emulated a "lot of clients". And the obvious solution would be to use nio (non-blocking). You preferred a work-around, which now shows some issues.

Ok you got my attention, could you elaborate on the NIO approach you had in mind?

But yes, basically this start/stop workaround has been a pain, and if there would be some other way to look at this, while keeping the resource usage sensible, I would be very interested in hearing more. I guess the main reason for this approach was that that the Leshan/californium didn't really allow keeping the clients on all the time, CPU/thread resource wise. At least that was my impression, as the communication began to break the more Leshan clients we had on simultaneously. But again, this implementation of mine is probably very naive on some ways.

The requirements are basically that we need to always use the same DTLS session, x509 certificates, and handle these long-lasting observations relations.

@boaks
Copy link
Contributor

boaks commented Apr 23, 2024

could you elaborate on the NIO approach you had in mind?

You get already Simon's answer on that in your issue in Leshan. He pointed to the Californium issue #1203.

Californium already supports for many parts to use a shared thread-pool. But the synchronous DatagramSocket requires unfortunately to use a dedicated thread for the blocking receive function. For using Californium as central server that doesn't matter. But for using it for massive client endpoints it does.

NIO may overcome that. But it will come with some implementation and mainly test efforts.

Anyway, I guess, if you can spend some time in providing something as an unit-test for your issue, I will check, what could be done on it.

@boaks
Copy link
Contributor

boaks commented Apr 23, 2024

Just as fast shot, not tested, nor really sure, if it addresses the issue at all:

Add that snippet to the ObserveLayer.NotificationController

		@Override
		public void onCancel() {
			exchange.execute(new Runnable() {

				@Override
				public void run() {
					ObserveRelation relation = exchange.getRelation();
					Response next = relation.getNextNotification(response, true);
					if (next != null) {
						next.cancel();
					}
				}
			});
		}

and then test it.

@Ozame
Copy link
Author

Ozame commented Apr 24, 2024

I tried to come up with some kind of unit test for this, but to be honest could not really wrap my head around the californium code. If you have some pointers on how to proceed with the testing, I can try again at some point, but for now I couldn't really do anything :(

Californium already supports for many parts to use a shared thread-pool. But the synchronous DatagramSocket requires unfortunately to use a dedicated thread for the blocking receive function. For using Californium as central server that doesn't matter. But for using it for massive client endpoints it does.

I guess this is the part that the Worker class handles inside the DTLSConnector? Since at least that seems to be outside the executors one can pass in via DTLSConnector#setExecutor

@boaks
Copy link
Contributor

boaks commented Apr 25, 2024

@boaks
Copy link
Contributor

boaks commented Apr 25, 2024

I guess this is the part that the Worker class handles inside the DTLSConnector?

Yes, DTLSConnector, Worker receiver and DTLSConnector, Receiver

boaks added a commit to boaks/californium that referenced this issue Apr 25, 2024
Fixes issue eclipse-californium#2245.
Cancel pending responses and notifies on stop as well.
Canceled notifications are not longer in transit.

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
@boaks
Copy link
Contributor

boaks commented Apr 25, 2024

If possible, please retest with PR #2246 and check, if this is fixing your issue.

boaks added a commit that referenced this issue May 3, 2024
Fixes issue #2245.
Cancel pending responses and notifies on stop as well.
Canceled notifications are not longer in transit.

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
@boaks
Copy link
Contributor

boaks commented May 14, 2024

Any new on this?
Can we close it?

@Ozame
Copy link
Author

Ozame commented May 14, 2024

Hi, unfortunately I've been busy with other things and haven't been able to test this yet. Will try to do it this week and share the results.

@boaks
Copy link
Contributor

boaks commented May 14, 2024

As long as I know, that you plan to test it, I will keep it open.

I currently consider to schedule the minor release 3.12 for the 6. June.
So, I fit's possible to get some feedback until that, it may be also possible to incorporate that in this next minor release.

@Ozame
Copy link
Author

Ozame commented May 23, 2024

Hiya. Replicating the situation is pretty difficult on the real environment, but based on the local testing I've done, this seems to do the trick. No more in transit - logging on the next notification. I will keep an eye out if the behavior is seen at some point.

I think this can be closed for now at least.

Thanks for all the help Achim, appreciate it 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants