Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sockets get stuck in drain-wait state on ESP32 in AP mode #2297

Open
GaryOtt opened this issue Nov 28, 2022 · 16 comments
Open

Sockets get stuck in drain-wait state on ESP32 in AP mode #2297

GaryOtt opened this issue Nov 28, 2022 · 16 comments
Labels
ESP32 This is only a problem on ESP32-based devices

Comments

@GaryOtt
Copy link
Contributor

GaryOtt commented Nov 28, 2022

EDIT: Scroll down to #2297 (comment) for later test code.


A failure to pipe data to a TCP socket can occur on the ESP32 build. I have not tested UDP sockets.

The socket may get stuck in a drain wait condition for tens of seconds, sometimes minutes, sometimes seemingly infinitely. This continues in ESP-IDF 3.3.6 based builds.


Details of my test:

test_code.zip

In this test, the WiFi is in AP mode and Espruino is running a web server (esp32_code.js). A script running on my development machine (test_http.js) sends HTTP requests repeatedly. Requests are sent 5 seconds after the completion of the last. The content of a file loaded from Storage is piped to the HTTP response.

The web server code keeps a check on the progress of requests and logs a timeout after 30 seconds of inactivity. When everything works fine, there is a continuous stream of data piped to the client but as the below video shows, sometimes it just stops for minutes at a time.

Results: https://youtu.be/4FyuYrBg8ck

0.00 Script is already uploading to ESP32.
0.23 Upload completes.
0.29 MacBook is connected to the WiFi AP hosted by the ESP32.
0.30 I start the node script to make repeated HTTP requests.
0.36 First HTTP request is completed and we see 'Finished' logged.
0.41 Second HTTP request made.
0.44 Some very minor pauses seen.
0.46 Second HTTP completes.
(third, fourth and fifth requests complete perfectly).
1.20 We get some significant pauses.
1.57 30 second timeout elapses.
3.00 Rest of page delivered.
3.13 More pauses.
3.43 30 second timeout elapses.
3.54 End

@MaBecker MaBecker added the ESP32 This is only a problem on ESP32-based devices label Nov 28, 2022
@MaBecker
Copy link
Contributor

Thanks, for posting, hope you find some time to fix it .

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Nov 28, 2022

Yes, I have a few things to attend to first but I plan to be working on this by the end of the week.

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Dec 1, 2022

This issue is client dependent. I experience it when connecting using a MacBook running Big Sir 11.7.1. I have been told of problems with clients using iOS and to a lesser extent Windows. There are no issues when I use my android phone and there were no issues when I used a now dead Dell laptop running Linux Mint.

The problem occurs when the select() call on line 279 of network_esp32.c returns 0. I'm still trying to understand what might cause that.

/// Send data if possible. returns nBytes on success, 0 on no data, or -1 on failure
int net_esp32_send(JsNetwork *net, SocketType socketType, int sckt, const void *buf, size_t len) {
NOT_USED(net);
fd_set writefds;
FD_ZERO(&writefds);
FD_SET(sckt, &writefds);
struct timeval time;
time.tv_sec = 0;
time.tv_usec = 0;
int n = select(sckt+1, 0, &writefds, 0, &time);
if (n==SOCKET_ERROR ) {
// we probably disconnected so just get rid of this
return -1;
} else if (FD_ISSET(sckt, &writefds)) {
int flags = 0;
#if !defined(SO_NOSIGPIPE) && defined(MSG_NOSIGNAL)
flags |= MSG_NOSIGNAL;
#endif
if (socketType & ST_UDP) {
JsNetUDPPacketHeader *header = (JsNetUDPPacketHeader*)buf;
sockaddr_in sin;
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = *(in_addr_t*)&header->host;
sin.sin_port = htons(header->port);
DBG("Send %d %x:%d", len - sizeof(JsNetUDPPacketHeader), *(uint32_t*)&header->host, header->port);
n = (int)sendto(sckt, buf + sizeof(JsNetUDPPacketHeader), header->length, flags, (struct sockaddr *)&sin, sizeof(sockaddr_in)) + sizeof(JsNetUDPPacketHeader);
DBG("Send bytes %d", n);
} else {
n = (int)send(sckt, buf, len, flags);
}
return n;
} else
return 0; // just not ready
}

I have changed the #include statements to match ESP-IDF 3.3.6 example code. It compiles and runs but the problem persists unchanged.

Now:

#include <string.h>

#include "lwip/err.h"
#include "lwip/sockets.h"
#include "lwip/sys.h"
#include <lwip/netdb.h>

#include "network_esp32.h"

@gfwilliams
Copy link
Member

Thanks for the update! Interesting about the client issue - even though they're all just running Node.js?

At least on desktop browsers, I seem to remember having some issue where they could leave the HTTP connection open in the hope that they can make another request. The server now responds with a Connection:close header.

I wonder whether this could be something similar? I guess once the max number of sockets are open, the ESP32 will probably just wait until the last one times out and closes which might produce this behaviour.

I also note your HTTP server code is quite big, and uses Storage. Personally, I'd try and strip it right down to the absolute bare minimum (just creating and sending the Strings directly), and maybe try and work without even using piping. It may be that at some point during that it starts working and might help to find out what the issue is

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Dec 1, 2022

The clients mentioned used a variety of web browsers. The choice of browser does not affect the issue. The node script is just something I wrote to give me a consistent test that is transparent about what is going on and easy to use. It makes sure to finish each request before starting another so as to avoid concurrent connections. Also worth noting that sometimes it occurs on the first or second request.

Yes, you're right. I shall simplify the web server and see what I get. Quite honestly, coming back to this after 6 weeks or so, I've forgotten what testing I've already done.

@MaBecker
Copy link
Contributor

MaBecker commented Dec 1, 2022

Here are the simple tests that had been created by one of the ESP32 developer:

https://github.com/espruino/Espruino/tree/master/targets/esp32/tests

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Dec 1, 2022

This is a much simplified test that I should have shared in the first place. It is exhibiting the same behavior.

//jshint esversion:6
const wifi = require("Wifi");

wifi.startAP(
	"EspruinoESP32",

	{
		authMode: "open"
	},

	(err) => {
		//This is not for meter level errors. This is a system failure.
		console.log("WiFi AP active. IP address:", wifi.getAPIP().ip);

		server = require("http").createServer((req, res) => {
			console.log("HTTP Request received: ", req.url);

			res.writeHead(200, { "Content-Type": "text/plain" });

			let len    = 33074;   //The length of the file
			let offset = 0;
			let chunk  = 0;

			while (offset < len) {
				chunk = Math.min(len - offset, 256);
				offset += chunk;

				res.write(".".padEnd(chunk, "*"));
			}

			res.end();
		});

		server.listen(80);
		console.log("Listening...");
	}
);

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Dec 1, 2022

The test at https://github.com/espruino/Espruino/blob/039ffb54415713e926f31c46c57ec4f170176422/targets/esp32/tests/simple_web_server.js appears much less problematic. There might be some occasional slowing of responses but at least I can say it is not so obviously problematic. I am guessing that is because the response is so small that it never requires a second packet.

My tests replicate problems seen delivering a web interface consisting of a HTML file (33K), a CSS file (17K), a Javascript file (64K) and a SVG file (1K).

@gfwilliams
Copy link
Member

Yes - I'd expect the code in #2297 (comment) to have issues - because JavaScript is expecting to be non-blocking, and by writing 33k of data in one go you're just blocking execution of everything else until sending all the data for that one socket finishes.

... but the piping in your initial socket example should be better - or actually just handling the drain event directly to send another chunk of data: http://www.espruino.com/Reference#l_httpSRs_drain

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Dec 2, 2022

Just a quick note on where I got to.

I don't hold high hopes that there is any JavaScript code that can side step the issue because all methods eventually use net_esp32_send() to transmit data. But I was going to try code that sets up a TCP socket server anyway, just to make sure that no problem is introduced by the HTTP layer. I will try directly using the drain event (as suggested) because it might surprise me.

Something interesting I have seen is that the TCP socket server example in ESP-IDF does not use select() so I need to investigate if select() can safely be discarded like that or if it foregoes flow control completely. Where I left off (I've been stuck in meetings all day), I was trying to find documentation about how the send() function handles a drain wait scenarios. I might find it blocks until it can send, it might be undefined behavior, or it might return an error. There is a viable solution if an error response can differentiate between drain-wait conditions and connection loss.

@gfwilliams
Copy link
Member

It is possible that select actually blocks... On Linux (IIRC - it was a while back), you have to configure select so that it returns immediately regardless of whether the socket is ready or not.

If it hasn't been configured like that on ESP32 I guess that might somehow be able to cause a deadlock (eg it's waiting to send data out, but it can't because buffers are full with data that needs to be read in).

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Jan 3, 2023

For the purpose of recreating this issue, I recommend these scripts:
Socket Drain Wait Test.zip

I have created many different tests. Each test resulted in more questions and more tests, until I eventually run out of avenues of investigation. My testing was conducted on two ESP32 development boards and on a "Data Logger" (a device including an ESP32 from the project which this problem first arose).

My testing was done in the following stages:

  • Recreated problem seen in application in a test script
  • Eliminated hardware as a factor.
  • Confirmed that the problem is not related to the HTTP layer.
  • Confirmed that the rate of problems increases with the number of concurrent connections
  • Confirmed that the rate of problems increases with other background activity

My final attempts to fix the problem were to try to pin the WiFi task to Core 1. This achieved no appreciable difference and I subsequently found a comment on a forum somewhere that the ESP-IDF option that I was trying to use, although selectable is not implemented.

That's where I got to before the missus ordered me to have a break over Christmas. I intend to return to this in the coming weeks.

The code, raw data and notes for all my tests are here: Drain Wait Problem.zip

@gfwilliams
Copy link
Member

Just wondering - I know you're using ESP32 as an AP - do you get the same behaviour when it's just connected to an existing WiFi network?

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Nov 15, 2023

Hi Gordon, I have not tried the ESP32 in station mode. I will see if I can try that today.

@GaryOtt
Copy link
Contributor Author

GaryOtt commented Nov 15, 2023

The problem is not at all apparent in station mode.

Method:

I had an ESP32 respond to new TCP connections by transferring 40,000 bytes and then closing the connection. The client opened 5 connections initially and another each time a connection was closed.

In station mode, I left the test running for 45 minutes without observing any problems. in the order of 150,000 connections were opened.

In AP mode, timeout errors were observed after about 3 minutes.

Just as I was ready to report my results, it occurred to me that I had changed more than one variable. I have previously seen that the client device can impact the severity of the problem with Apple devices being most affected. So I repeated the station mode test after setting up a MacBook as a hotspot. Station mode again worked perfectly for a full hour.

@gfwilliams
Copy link
Member

Ok, thanks - that's really interesting. So I think you did narrow it down pretty well before, but that basically clinches it - that it's something buried inside the ESP-IDF, so hopefully an upgraded version may fix this.

It also explains why more people weren't reporting it being an issue - I think the normal use case for most people is when it's in station mode.

@MaBecker MaBecker changed the title Sockets get stuck in drain-wait state on ESP32 Sockets get stuck in drain-wait state on ESP32 in AP mode Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ESP32 This is only a problem on ESP32-based devices
Projects
None yet
Development

No branches or pull requests

3 participants