Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priam on new ASG instance tells Cassandra to gossip with dead ASG EC2 instance #686

Open
amr46 opened this issue Jul 3, 2018 · 3 comments

Comments

@amr46
Copy link

amr46 commented Jul 3, 2018

Setup:

  • running cassandra 3.11 and Priam 3.11
  • ASG with 3 nodes, one in each AZ 1a, 1b, 1c
  • Manually terminated the node in 1c via the AWS console.
  • New node comes up shortly after. Priam detects the old 1c node is dead in its log.
  • is_replace_token is true and the IP of the dead node is returned in get_replaced_ip
  • Old node marked as '-dead' in aws sdb and DN in nodetool status
  • New Node starts Priam, TRIES to start Cassandra, but keeps telling Cassandra that the old 1c node is still in gossip. Cassandra cannot connect to the downed node, and aborts on startup

Fix:

  • a service tomcat8 restart on the new node fixes the problem
  • on the restart, is_replace_token returns: false so no IP is replaced so no gossip with dead nodes occurs
  • upon restart, nodetool status on the other nodes replaces the 'DN' node with the new node

Questions:

  • Why is Cassandra not able to replace the dead node ?
  • Why on the Priam restart, is Cassandra able to restart successfully ignoring the dead node?
@amr46
Copy link
Author

amr46 commented Jul 12, 2018

I think that the protocol used by priam is incorrect:
If is_replace = true, and it's attempting to replace a downed node - that node might be unavailable altogether. Priam has explicitly marked this downed node as dead, so the expectation of any communication with it should be 0.

Cassandra, when started with in replace mode, attempts to talk to the downed node and fails whenever the node doesn't exist in gossip. Hence the replace can never happen without manual intervention.

@arunagrawal84 thx for helping out in the past, could you comment on this?

@arunagrawal84
Copy link
Contributor

@amr46 can you please confirm if other 2 nodes (in other AZ's), are marked as seed nodes as well?

@amr46
Copy link
Author

amr46 commented Aug 25, 2018

I will have to replicate the environment and get back to you ASAP in the week of 9/3 @arunagrawal84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants