Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SARA-R4 module closes socket unexpectedly #165

Open
arnoutdekimo opened this issue Nov 14, 2023 · 11 comments
Open

SARA-R4 module closes socket unexpectedly #165

arnoutdekimo opened this issue Nov 14, 2023 · 11 comments

Comments

@arnoutdekimo
Copy link

arnoutdekimo commented Nov 14, 2023

Hi,

I ran into an unexpected issue while trying to fetch a large file over HTTP, using the socket lib.
(This relates a bit to #161, where also the reference to some example code was put).

Scenario:

  • I am opening a socket using uSockConnect to my own HTTP server (non SSL)
  • I set the socket idle timeout using uSockOptionSet(sock, U_SOCK_OPT_LEVEL_SOCK, U_SOCK_OPT_RCVTIMEO, ..) to 10s
  • Then write a GET request to get a big file (217kB), using uSockWrite
  • I then, in a loop, call uSockRead until either it returns and error (-1) or I get all the data I want

The issue:

Sometimes this works..
But sometimes, it doesn't, and in this case, I see unexpected things:

In the ubxlib log, I see:

0:01:24:783 - AT+USORD=0,1024
0:01:24:898 -
0:01:24:938 -
0:01:24:938 - +USORD: 0,1024,"o GET file "%s"; status code was %d![00][00][00]WARNING: APPLYING HARDCODED DEFAULTS![00][00][00][00][00][00]web.be[00][00]%s[00][00]chat.layers.be:8080[00]HCJx9y-_RKepara6bQJ3pw[00][00]afe23915adf799a51e5792ba3ea27c7a9385d76b[00][00][00][00]%u: %s Built %s %s[00][00]15:21:56[00][00][00][00]Nov 13 2023[00]internet[00][00][00][00]m2m.business[00][00][00][00]cmnet[00][00][00]cmATclientreadbytes-READrequest 1 bytes
0:01:24:948 - ATclientreadbytes-result 1 bytes
0:01:24:948 - ATclientreadbytes-READrequest 1024 bytes
0:01:24:948 - wap[00][00][00][00]3gnet[00][00][00]uninet[00]uninet[00]uninet[00][00]internet.t-mobile[00]t-mobile[00]tm[00][00]ibox.tim.it[00][00][00][00]web.omnitel.it[00][00][00][00]internet.wind.biz[00][00][00][00]open.softbank.ne.jp[00]opensoftbank[00]ebMNuX1FIHg9d3DA[00]smile.world[00]dna1trop[00]so2t3k3m2a[00][00]bmobilewap[00][00][00]mpr2.bizho.net[00]Mopera U[00][00]bmobile.ne.jp[00]bmobile@wifi2[00]bmobile[00][00]public4.m2minternet.com[00][00][00][00]internet.simobil.si[00][00][00][00]internet.tusmobil.si[00][00][00][00]online.telia.se[00][00][00][00]services.telenor.se[00][00][00][00]mobileinternet.tele2.se[00][00][00][00]gprs.swisscom.ch[00][00][00][00]internet[00][00][00]click[00][00][00][00]mobile.o2.co.uk[00]faster[00]web[00]mobile.o2.co.uk[00]bypass[00]web[00]payandgo.o2.co.uk[00]payandgo[00]payandgo[00][00]internet[00]web[00]web[00]pp.vodafone.co.uk[00]wap[00]wap[00][00]three.co.uk[00][00][00][00]jtm2m[00][00][00][00]epc.tmobile.com[00][00][00]fast.tmobile.com[00][00][00][00]phone[00][00][00]wap.cingular[00]WAP@CINGULARGPRS.COM[00]CINGULAR1[00]isp.cingular[00]ISP@CINGULARGPRS.COM[00]CINGULAR1[00][00]n"
0:01:25:033 -
0:01:25:033 - OK
0:01:25:033 -
0:01:25:033 - +UUSORD: 0,13448
0:01:25:034 -
0:01:25:034 - +UUSOCL: 0
0:01:25:034 - ATclientreadbytes-result 1024 bytes
0:01:25:035 - AT+USORD=0,1024
0:01:25:149 -
0:01:25:189 - +CME ERROR: Operation not allowed

Note the +UUSOCL: 0 URC here. This seems to indicate that the remote closed the session.
However, since I run my own HTTP server, I took some wireshark captures of the actual data. This does not align..

image

6.9s after the first packet was captured, the big transfer starts from the LTE modem.
In my logs, this corresponds to the +- 34 second mark:

0:00:33:594 - +USOCR: 0
0:00:33:594 -
0:00:33:594 - OK
0:00:33:595 - U_SOCK: socket created, descriptor 0, network handle 0x20019848, socket handle 0.
0:00:33:595 - U_SOCK: option set command 4102:0x0fff called on descriptor 0 with value [4 int32s] 10 (0x0000000a) 0 (0x00000000) 500000 (0x0007a120) 0 (0x00000000)
0:00:33:596 - U_SOCK: timeout for socket descriptor 0 set to 10500 ms.
0:00:33:596 - U_SOCK: connecting socket to "84.199.6.178:8080"...
0:00:33:596 - AT+USOCO=0,"84.199.6.178",8080
0:00:33:700 -
0:00:33:780 - OK
0:00:33:780 - U_SOCK: socket with descriptor 0, network handle 0x20019848, socket handle 0, is connected to address "84.199.6.178:8080".
0:00:33:780 - AT+USOWR=0,128
0:00:33:883 -
0:00:33:913 - @get /getOTA.html
0:00:33:974 -
0:00:33:975 - [00][9a][03][08]'[00] [a0]'[00] [00][00][00][00][09][00][00][00]<)[00] [f0]'[00] [a3][99][03][08]Cd[03][08][a3][99][03][08]'[00] [f0]'[00] [1d][9a][03][08]-[a9][03][08][f0]'[00] [03][00][00][00]<)[00] [f4][eb][03][08][01][00][00][00]0[a9][03][08][c7][9a][03][08]'[00] [94]([00] [00][00][00][00]p([00] [84][00][00][00][ff][ff][ff][ff]
0:00:34:019 - +USOWR: 0,128
0:00:34:019 -
0:00:34:020 - OK
0:00:34:020 - AT+USORD=0,0
0:00:34:122 -
0:00:34:162 -
0:00:34:162 -
0:00:34:162 -
0:00:34:162 - +USORD: 0,0
0:00:34:163 - OK
0:00:34:163 -
0:00:34:238 - +UUSORD: 0,1376

image

I then see, that the module sends a FIN, 35 seconds after the fetching of data!

This is totally unexpected. In fact, on AT, the transfer just keeps going on and on (it's just slow), until the 1min 25s mark, when it finally reports a remote-initiated closure.

So, I see two main things here:

  • The modem sends a FIN packet on itself ==> This is a major issue, and it is not clear to my why this happens. (There is the possibility that this is caused by some firewall, middlebox, actively intervening somehow on the connection, but that sounds unlikely. )
  • The modem still allows the data to be read out .. until a certain point, when it then reports the connection being lost

If I'd have to guess, this "could" be caused by something like the following:

  • The modem sees 30s of idle traffic over cellular (since the UART is just so slow)
  • Somehow this seems to trigger something to decide to cut the connection. A FIN is sent to the remote side, and the modem itself only records the connection being closed after 1.3 minutes..

But this is unexpected no?
Am I doing anything wrong here?

Thanks for your feedback,
Arnout

@arnoutdekimo
Copy link
Author

[
socket-error-faileddownload.zip
In attachment a salae capture of this

@arnoutdekimo
Copy link
Author

Pff, and now I am getting even different behavior:

  • Again, on the internet side, everything looks perfect (
    superslowconn.zip

  • But on the AT interface, now the data comes in increadibly slowly. I didn't start the salae capture from the start, but you can see what happens here anyway (
    superslow-connection.zip

The logs look like this now:

0:01:02:244 - U_CELL_SOCK: found it at "84.199.6.178".
0:01:02:244 - AT+USOCR=6
0:01:02:345 -
0:01:02:425 - +USOCR: 0
0:01:02:425 -
0:01:02:426 - OK
0:01:02:426 - U_SOCK: socket created, descriptor 0, network handle 0x20019890, socket handle 0.
0:01:02:426 - U_SOCK: option set command 4102:0x0fff called on descriptor 0 with value [4 int32s] 10 (0x0000000a) 0 (0x00000000) 500000 (0x0007a120) 0 (0x00000000)
0:01:02:427 - U_SOCK: timeout for socket descriptor 0 set to 10500 ms.
0:01:02:427 - U_SOCK: connecting socket to "84.199.6.178:8080"...
0:01:02:427 - AT+USOCO=0,"84.199.6.178",8080
0:01:02:531 -
0:01:03:121 - OK
0:01:03:121 - U_SOCK: socket with descriptor 0, network handle 0x20019890, socket handle 0, is connected to address "84.199.6.178:8080".
0:01:03:122 - AT+USOWR=0,128
0:01:03:224 -
0:01:03:254 - @get /getOTA.html
0:01:03:316 -
0:01:03:316 - [00][9a][03][08]'[00] [a0]'[00] [00][00][00][00][07][00][00][00]<)[00] [f0]'[00] [a3][99][03][08]Cd[03][08][a3][99][03][08]'[00] [f0]'[00] [1d][9a][03][08]-[a9][03][08][f0]'[00] [03][00][00][00]<)[00] [f4][eb][03][08][01][00][00][00]0[a9][03][08][c7][9a][03][08]'[00] [94]([00] [00][00][00][00]p([00] [84][00][00][00][ff][ff][ff][ff]
0:01:03:330 - +USOWR: 0,128
0:01:03:350 -
0:01:03:350 - OK
0:01:03:351 - AT+USORD=0,0
0:01:03:453 -
0:01:03:493 -
0:01:03:493 -
0:01:03:493 -
0:01:03:493 - +USORD: 0,0
0:01:03:494 - OK
0:01:03:494 - AT+USORD=0,0
0:01:03:596 -
0:01:03:636 -
0:01:03:636 -
0:01:03:636 -
0:01:03:636 - +USORD: 0,0
0:01:03:637 - OK
0:01:03:637 - AT+USORD=0,0
0:01:03:739 -

...

And then every so ofton data does get through for a moment

0:01:09:356 -
0:01:09:356 -
0:01:09:356 -
0:01:09:356 - +USORD: 0,0
0:01:09:357 - OK
0:01:09:357 -
0:01:09:452 - +UUSORD: 0,1376
0:01:09:453 - AT+USORD=0,1024
0:01:09:564 -
0:01:09:604 -
0:01:09:604 - +USORD: 0,1024,"[00][00][04] [95][05][01][08]I[c4][00][08]i[c4][00][08][89][c4][00][08][a9][c4][00][08][c9][c4][00][08][e5][05][01][08][00][00][00][00][00][00][00][00][00][00][00][00][a1][02][02][08][e9][c4][00][08][00][00][00][00]A[02][02][08][d5][c6][01][08][e5][05][01][08][e5][05][01][08][f9][c4][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]
0:01:09:611 - [c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]![c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]5[c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05]ATclientreadbytes-READrequest 1 bytes
0:01:09:616 - ATclientreadbytes-result 1 bytes
0:01:09:616 - ATclientreadbytes-READrequest 1024 bytes
0:01:09:616 - [01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]I[c5][00][08]][c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]q[c5][00][08][00][00][00][00][85][c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][00][00][00][00][00][00][00][00][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01]+[11][f8][01];[01]([bf][9a]B[f7][d0][d0][1a]pG[03]F[13][f8][01]+[00][fb][d1][18][1a][01]8pG[81][f0][00]A[02][e0][00][bf][83][f0][00]C0[b5]O[ea]A[04]O[ea]C[05][94][ea][05][0f][08][bf][90][ea][02][0f][1f][bf]T[ea][00][0c]U[ea][02][0c][7f][ea]d[7f][ea]e[00][f0][e2][80]O[ea]TT[d4][eb]UU[b8][bf]mB[0c][dd],D[80][ea][02][02][81][ea][03][03][82][ea][00][00][83][ea][01][01][80][ea][02][02][81][ea][03][03]6-[88][bf]0[bd][11][f0][00]OO[ea][01]1O[f4][80][1c]L[ea][11]1[02][d0]@ba[eb]A[01][13][f0][00]OO[ea][03]3L[ea][13]3[02][d0]RBc[eb]C[03][94][ea][05][0f][00][f0][a7][80][a4][f1][01][04][d5][f1] [0e]
0:01:09:680 - [db][02][fa][0e][fc]"[fa][05][f2][80][18]A[f1][00][01][03][fa][0e][f2][80][18]C[fa][05][f3]YA[0e][e0][a5][f1] [05][0e][f1] [0e][01]*[03][fa][0e][fc]([bf]L[f0][02][0c]C[fa][05][f3][c0][18]Q[eb][e3]q[01][f0][00]E[07][d5]O[f0][00][0e][dc][f1][00][0c]~[eb][00][00]n[eb][01][01][b1][f5][80][1f][1b][d3][b1][f5][00][1f][0c][d3]I[08][ea1][01]<([bf][b1][f5][80][1f][e9][d2][91][f0][00][0f][04][bf][01]F[00] [b1][fa][81][f3][08][bf] 3[a3][f1][0b][03][b3][f1]"
0:01:09:688 - ATclientreadbytes-result 1024 bytes
0:01:09:688 -
0:01:09:708 - OK
0:01:09:708 -
0:01:09:708 - +UUSORD: 0,352
0:01:09:709 - AT+USORD=0,1024
0:01:09:823 -
0:01:09:863 -
0:01:09:863 - +USORD: 0,352," [02][0c][da][0c]2[08][dd][02][f1][14][0c][c2][f1][0c][02][01][fa][0c][f0]![fa][02][f1][0c][e0][02][f1][14][02][d8][bf][c2][f1] [0c][01][fa][02][f1] [fa][0c][fc][dc][bf]A[ea][0c][01][90]@[e4][1a][a2][bf][01][eb][04]Q)C0[bd]o[ea][04][04][1f]<[1c][da][0c]4[0e][dc][04][f1][14][04][c4][f1] [02] [fa][04][f0][01][fa][02][f3]@[ea][03][00]![fa][04][f3]E[ea][03][01]0[bd][c4][f1][0c][04][c4][f1] [02] [fa][02][f0][01][fa][04][f3]@[ea][03][00])F0[bd]![fa][04][f0])F0[bd][94][f0][00][0f][83][f4][80][13][06][bf][81][f4][80][11][01]4[01]=N[e7][7f][ea]d[18][bf][7f][ea]e)[d0][94][ea][05][0f][08][bf][90][ea][02][0f][05][d0]T[ea][00][0c][04][bf][19]F[10]F0[bd][91][ea][03][0f][1e][bf][00]![00] 0[bd]
[ea]T[05][d1]ATclientreadbytes-READrequest 1 bytes
0:01:09:872 - ATclientreadbytes-result 1 bytes
0:01:09:872 - ATclientreadbytes-READrequest 352 bytes
0:01:09:872 - @[00]IA([bf]A[f0][00]A0[bd][14][f5][80][04]<[bf][01][f5][80][11]0[bd][01][f0][00]EE[f0][fe]AA[f4]p[01]O[f0][00][00]0[bd][7f][ea]d[1a][bf][19]F[10]F[7f][ea]e[1c][bf][0b]F[02]FP[ea][01]4[06][bf]R[ea][03]5[91][ea][03][0f]A[f4][00]!0[bd][00][bf][90][f0][00][0f][04][bf][00]!pG0[b5]O[f4][80]d[04][f1]2[04]O[f0][00][05]O[f0][00][01]P[e7][00][bf][90][f0][00][0f][04][bf][00]!pG0[b5]O[f4][80]d[04][f1]2[04][10][f0][00]E"
0:01:09:898 -
0:01:09:898 - OK
0:01:09:899 - ATclientreadbytes-result 352 bytes
0:01:09:899 - AT+USORD=0,0
0:01:10:002 -
0:01:10:042 -

0:01:10:042 -
0:01:10:042 -
0:01:10:042 - +USORD: 0,0
0:01:10:043 - OK
0:01:10:043 - AT+USORD=0,0
0:01:10:145 -
0:01:10:185 -
0:01:10:185 -
0:01:10:185 -
0:01:10:185 - +USORD: 0,0
0:01:10:186 - OK
0:01:10:186 - AT+USORD=0,0
0:01:10:288 -

The frustrating part is that all seems to go well on the internet side of things, but that on the packetization on the AT interface stuff goes really slow, and I don't understand why?

@RobMeades
Copy link
Contributor

RobMeades commented Nov 14, 2023

Hi there: first thing to say is thanks for the very detailed report, second thing to say is that, since this cellular, there is the cellular network's firewall to consider in the equation; whether it is anything to do with this or not I don't know, just thought I'd mention it. And of course there is a radio path in the way.

Comments/questions:

(1) the +UUSORD: 0,xxxx you can see is the module indicating how many bytes it already has buffered, internally within the module, received from the far end. In your first log there is a lot of data buffered at the point that the +UUSOCL is received:

0:01:25:033 - +UUSORD: 0,13448
0:01:25:034 -
0:01:25:034 - +UUSOCL: 0

This aligns with your suggestion that the process of reading data from the module is lagging the IP traffic significantly. Of course, the IP behaviour within the module will have a different timing to the activity logged at that AT interface; the "wireshark equivalent" is inside the module somewhere.

I don't know how large the module's internal IP buffers are for received data but I'd guess that, at some point, it must stop receiving IP traffic from the server and wait for the MCU to read the stuff it has buffered out, and it may be that your 10 second socket timeout kicks-in. I don't have my copy of Stevens et al to hand, but I would guess that the socket option to do a receive timeout is set locally, at "your" end of the IP link, so if that link does nothing for 10 seconds then the module should terminate the IP link (not sure what else it would do)?

(2) concerning the variability of throughput, it would be worth you putting in a few calls to check what the radio environment is like, in case that is a factor. Just call uCellInfoRefreshRadioParameters() every so often, maybe between packet receives; no need to do more than that, with logging on as you have it, radio stuff will be printed in the log.

@arnoutdekimo
Copy link
Author

Hi,

I don't know how large the module's internal IP buffers are for received data but I'd guess that, at some point, it must stop receiving IP traffic from the server and wait for the MCU to read the stuff it has buffered out, and it may be that your 10 second socket timeout kicks-in. I don't have my copy of Stevens et al to hand, but I would guess that the socket option to do a receive timeout is set locally, at "your" end of the IP link, so if that link does nothing for 10 seconds then the module should terminate the IP link (not sure what else it would do)?

So coming back to this, what the module -should- normally do is the following:

  • It should transparently execute the posix socket calls, and behave just like it were another interface on , say a linux system. This means:
  • When a (TCP) socket open command is sent, it should initiate a connection. TCP connections advertise receive windows. Just like a linux OS, the module could advertise that it has room for e.g. 10 kB of data as buffer, before the connection has to stall until the application has time to consume all data
  • When a write command is sent, it should write the data onto the TCP connection (assuming the server's receive window is not full (server's will never be full - it's more than fast enough))
  • When a read command is sent, it should fetch data from its internal buffer, and send that part over AT. When complete, it should then send a TCP ACK updating its receive window size (saying it can now receive more data)

At no point in time, should the module decide to -close- a connection, without the application explicitly saying to do so, unless malformed TCP packets or whatever were to arrive.

Coming back to your question regarding the socket timeout - This is normally used for a the blocking read call. The read call will normally block forever, until some data is available, some errror on the socket happens (e.g. remote closes the connection), OR a timeout happens. But again, this timeout should never cause a connection to be actively closed by the module/OS.

@arnoutdekimo
Copy link
Author

FYI, this is the log file with after every succesful read, doing the ucell parameter refresh
successfulltransfer.zip

Note that this time it succeeded. I'll run it a couple of more times and add a failure as well

@arnoutdekimo
Copy link
Author

This is a failure:

failure-withcelldbg.zip
socketerrorlog.zip

@RobMeades
Copy link
Contributor

Ah, yes, I should have checked the code, the U_SOCK_OPT_LEVEL_SOCK U_SOCK_OPT_RCVTIMEO is purely a local thing, nothing to do with the module at all.

So we are back to what reason there might be for an +UUSOCL. The signal strength and SNR from your radio parameters calls are all good, and stable as well, so no problem there. I've plotted the amount of data that +UUSORD reports in your socketerrorlog.zip and it is pretty constant, no obvious problem there:

image

There are 192 of them, so likely around 192 kbytes downloaded when the +UUSOCL occurs. I am not aware of a way to query the module for a socket error code once a socket has been closed but then again I wouldn't: I have posted a question to the relevant person internally.

@arnoutdekimo
Copy link
Author

Thanks!

@RobMeades
Copy link
Contributor

RobMeades commented Nov 15, 2023

OK, so the exact person I need, who is specifically expert in the IP stack implementation of SARA-R422, is out of the office until Monday but, generically, I have confirmed that, to get a cause out of the module, will require a debug trace from the debug trace port (USB) of the module. For you to be able to get that, in the SARA-422 case, assuming you have access to the USB port, would unfortunately also need an NDA.

Alternatively, I can try doing exactly the same thing with a SARA-R422 from my desk here and hope the problem occurs for me also: I guess I can just bounce off your server and pull that same file?

@arnoutdekimo
Copy link
Author

Hi,

NDA is no problem - feel free to get in touch directly for that.
You are free to test with my server (as long as it is up - this is just for testing). Note that because of the speedups from #159 I now don't see the issue anymore (for now) (I suspect because the transfer happens in some +- 40s instead of 1.5 minutes).

Kind regards,
Arnout

@RobMeades
Copy link
Contributor

The +UUSOCL issue being duration-related is kinda weird, not what I expected. I guess it could be timing related and that has moved the goal posts. Anyway, please drop me a line, rob.meades@u-blox.com, and we can sort the NDA in any case; might be useful for you to have module logging running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants