twemproxy only sees one of the memcached servers in the pool #534

jslusher · 2017-08-18T19:08:23Z

It's my understanding that when a server in the twemproxy pool gets ejected, the other server in the pool should still be available for caching. It seems that when I take out memcached-1 only, the proxy itself becomes unavailable. If I take out memcached-2 from the pool, everything operates normally, except that there doesn't seem to be any indication in the logs that the server leaves or returns to the pool.

I have tested that both memcached servers are available directly. If I put one or the other memcached sever by itself in the pool configuration, they're available using the proxy, but only memcached-1 is available if I have them both in the pool. I've tried ordering them differently and it doesn't seem to make a difference. A tcpdump only ever shows traffic to memcached-1 when they are both in the pool. When nutcracker is restarted, I only see arp traffic going to one of the two servers, but never both.

To reproduce:
(nutcracker version 0.4.1 on centos 7)
/etc/nutcracker/nutcracker.yml

bad_pool:
  listen: 127.0.0.1:22122
  hash: fnv1a_64
  distribution: ketama
  auto_eject_hosts: true
  timeout: 400
  server_retry_timeout: 30000
  server_failure_limit: 3
  servers:
   - 10.10.10.33:11211:1 memcached-1
   - 10.10.10.34:11211:1 memcached-2

telnet 127.0.0.1 22122

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
set testing 1 0 3
one
STORED

ssh 10.10.10.33:

sudo systemctl stop memcached

telnet console:

get testing
SERVER_ERROR Connection refused
Connection closed by foreign host.

nutcracker logs for sequence:

[2017-08-18 11:08:46.894] nc_core.c:43 max fds 1024 max client conns 989 max server conns 3
[2017-08-18 11:08:46.894] nc_stats.c:851 m 4 listening on '0.0.0.0:22222'
[2017-08-18 11:08:46.894] nc_proxy.c:217 p 6 listening on '127.0.0.1:22122' in memcache pool 0 'bad_pool' with 2 servers
[2017-08-18 11:08:56.457] nc_proxy.c:377 accepted c 8 on p 6 from '127.0.0.1:41122'
[2017-08-18 11:09:11.595] nc_request.c:96 req 1 done on c 8 req_time 1160.716 msec type REQ_MC_SET narg 2 req_len 24 rsp_len 8 key0 'testing' peer '127.0.0.1:41122' done 1 error 0
[2017-08-18 11:14:00.115] nc_response.c:118 s 9 active 0 is done
[2017-08-18 11:14:00.116] nc_core.c:237 close s 9 '10.50.20.35:11211' on event 00FF eof 1 done 1 rb 8 sb 24
[2017-08-18 11:14:06.887] nc_core.c:237 close s 9 '10.50.20.35:11211' on event FFFFFF eof 0 done 0 rb 0 sb 0: Connection refused
[2017-08-18 11:14:06.887] nc_request.c:96 req 4 done on c 8 req_time 0.597 msec type REQ_MC_GET narg 2 req_len 13 rsp_len 33 key0 'testing' peer '127.0.0.1:41122' done 1 error 0
[2017-08-18 11:14:06.887] nc_core.c:237 close c 8 '127.0.0.1:41122' on event FF00 eof 0 done 0 rb 37 sb 41: Operation not permitted

The text was updated successfully, but these errors were encountered:

rposky · 2017-08-21T22:55:10Z

This sounds like expected behavior for the service, which does not retry failed requests against remaining server members. The client will need to respond appropriately to such failures, perhaps by retrying the request.

The "testing" key is mapped to a server in the pool, which would explain why you can deactivate "memcached-2" to no apparent effects, since "memcached-1" is selected to service the request. The pool is configured to eject hosts after 3 errors, so in the testing scenario that you have provided, I would expect the 4th request for key "testing" to evaluate against "memached-2".

TysonAndre · 2021-07-01T23:53:24Z

Also, server_retry_timeout: 30000 means it will take 30 seconds before twemproxy attempts to reconnect - until 30 seconds have elapsed all traffic will be sent to the other server.

I think the planned heartbeat/failover patches in #608 may result in faster reconnections when a server recovers once those changes are merged into twitter/twemproxy, though that may change before the planned 0.6.0 release.

TysonAndre · 2021-07-01T23:54:51Z

If twemproxy didn't reconnect after more than 30 seconds, the changes planned for 0.6.0 also refactors the reconnection logic significantly, and may end up fixing it.

0.5.x also fixes some memory corruption errors.

jslusher mentioned this issue Aug 18, 2017

Twemproxy " Connection refused " in logs on failure of one redis instance #530

Closed

TysonAndre closed this as completed Jul 1, 2021

TysonAndre added the question label Jul 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

twemproxy only sees one of the memcached servers in the pool #534

twemproxy only sees one of the memcached servers in the pool #534

jslusher commented Aug 18, 2017 •

edited

rposky commented Aug 21, 2017

TysonAndre commented Jul 1, 2021

TysonAndre commented Jul 1, 2021

twemproxy only sees one of the memcached servers in the pool #534

twemproxy only sees one of the memcached servers in the pool #534

Comments

jslusher commented Aug 18, 2017 • edited

rposky commented Aug 21, 2017

TysonAndre commented Jul 1, 2021

TysonAndre commented Jul 1, 2021

jslusher commented Aug 18, 2017 •

edited