Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

twemproxy only sees one of the memcached servers in the pool #534

Closed
jslusher opened this issue Aug 18, 2017 · 3 comments
Closed

twemproxy only sees one of the memcached servers in the pool #534

jslusher opened this issue Aug 18, 2017 · 3 comments
Labels

Comments

@jslusher
Copy link

jslusher commented Aug 18, 2017

It's my understanding that when a server in the twemproxy pool gets ejected, the other server in the pool should still be available for caching. It seems that when I take out memcached-1 only, the proxy itself becomes unavailable. If I take out memcached-2 from the pool, everything operates normally, except that there doesn't seem to be any indication in the logs that the server leaves or returns to the pool.

I have tested that both memcached servers are available directly. If I put one or the other memcached sever by itself in the pool configuration, they're available using the proxy, but only memcached-1 is available if I have them both in the pool. I've tried ordering them differently and it doesn't seem to make a difference. A tcpdump only ever shows traffic to memcached-1 when they are both in the pool. When nutcracker is restarted, I only see arp traffic going to one of the two servers, but never both.

To reproduce:
(nutcracker version 0.4.1 on centos 7)
/etc/nutcracker/nutcracker.yml

bad_pool:
  listen: 127.0.0.1:22122
  hash: fnv1a_64
  distribution: ketama
  auto_eject_hosts: true
  timeout: 400
  server_retry_timeout: 30000
  server_failure_limit: 3
  servers:
   - 10.10.10.33:11211:1 memcached-1
   - 10.10.10.34:11211:1 memcached-2

telnet 127.0.0.1 22122

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
set testing 1 0 3
one
STORED

ssh 10.10.10.33:

sudo systemctl stop memcached

telnet console:

get testing
SERVER_ERROR Connection refused
Connection closed by foreign host.

nutcracker logs for sequence:

[2017-08-18 11:08:46.894] nc_core.c:43 max fds 1024 max client conns 989 max server conns 3
[2017-08-18 11:08:46.894] nc_stats.c:851 m 4 listening on '0.0.0.0:22222'
[2017-08-18 11:08:46.894] nc_proxy.c:217 p 6 listening on '127.0.0.1:22122' in memcache pool 0 'bad_pool' with 2 servers
[2017-08-18 11:08:56.457] nc_proxy.c:377 accepted c 8 on p 6 from '127.0.0.1:41122'
[2017-08-18 11:09:11.595] nc_request.c:96 req 1 done on c 8 req_time 1160.716 msec type REQ_MC_SET narg 2 req_len 24 rsp_len 8 key0 'testing' peer '127.0.0.1:41122' done 1 error 0
[2017-08-18 11:14:00.115] nc_response.c:118 s 9 active 0 is done
[2017-08-18 11:14:00.116] nc_core.c:237 close s 9 '10.50.20.35:11211' on event 00FF eof 1 done 1 rb 8 sb 24
[2017-08-18 11:14:06.887] nc_core.c:237 close s 9 '10.50.20.35:11211' on event FFFFFF eof 0 done 0 rb 0 sb 0: Connection refused
[2017-08-18 11:14:06.887] nc_request.c:96 req 4 done on c 8 req_time 0.597 msec type REQ_MC_GET narg 2 req_len 13 rsp_len 33 key0 'testing' peer '127.0.0.1:41122' done 1 error 0
[2017-08-18 11:14:06.887] nc_core.c:237 close c 8 '127.0.0.1:41122' on event FF00 eof 0 done 0 rb 37 sb 41: Operation not permitted
@rposky
Copy link

rposky commented Aug 21, 2017

This sounds like expected behavior for the service, which does not retry failed requests against remaining server members. The client will need to respond appropriately to such failures, perhaps by retrying the request.

The "testing" key is mapped to a server in the pool, which would explain why you can deactivate "memcached-2" to no apparent effects, since "memcached-1" is selected to service the request. The pool is configured to eject hosts after 3 errors, so in the testing scenario that you have provided, I would expect the 4th request for key "testing" to evaluate against "memached-2".

@TysonAndre
Copy link
Collaborator

Also, server_retry_timeout: 30000 means it will take 30 seconds before twemproxy attempts to reconnect - until 30 seconds have elapsed all traffic will be sent to the other server.

I think the planned heartbeat/failover patches in #608 may result in faster reconnections when a server recovers once those changes are merged into twitter/twemproxy, though that may change before the planned 0.6.0 release.

@TysonAndre
Copy link
Collaborator

If twemproxy didn't reconnect after more than 30 seconds, the changes planned for 0.6.0 also refactors the reconnection logic significantly, and may end up fixing it.

0.5.x also fixes some memory corruption errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants