Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support: Configuration issues with multiple nodes #20

Open
justinfx opened this issue Mar 3, 2020 · 19 comments
Open

Support: Configuration issues with multiple nodes #20

justinfx opened this issue Mar 3, 2020 · 19 comments
Assignees
Labels
question Further information is requested

Comments

@justinfx
Copy link
Contributor

justinfx commented Mar 3, 2020

I have a number questions about the Config, now that I actually have Olric embedded in my application and working, and am spinning up 2 local instances to form a cluster experiment.

Question 1
I've found (just as you documented) that the memberlist configuration is complicated. There are a lot of ports that need to be allocated for each instance in order to not conflict with another instance and also so the instances can actually connect into a cluster. For each instance on the same host I have to set:

  1. Config.Name to something like 0.0.0.0:<unique port>
  2. Config.MemberlistConfig.BindPort to a unique port
  3. Config.Peers to point at the unique bind port of the other instance

Do I need to worry about Config.MemberlistConfig.AdvertisePort? It seems to not conflict when I dont adjust it on either host. But it WILL conflict and error if I am not using peers and two node instances on the same host use the same AdvertisePort.
What is the difference between the Name port and the Bind port, and do I really need to make sure to have 2 unique ports per instance?

Question 2
For Peers, is that just a seed list where I only have to point it at one other node for the new node to connect to the entire cluster? Or does every node need a complete list of every other node, like a memcached configuration?
When I start NodeA, I may not have started it with a known peer list. Then I start NodeB with a Peer list pointing at NodeA (great!). When I start NodeA up again, I need to make sure to point it at NodeB (I think?).
I'm hoping I only need a single other seed node in the peer list. My goal would be to somehow point the peer list at a single service discovery endpoint to find another node in the active cluster. Otherwise it would be more difficult to dynamically scale the service if you always need to know the address of the other nodes when you scale.
Aside from the seed node question, the node discovery problem may be outside the scope of Olric. Likely would I would end up doing is have each Olric node register itself with Consul. And then it would check Consul for one other node to use as its peer list.

Question 3
When I start local NodeA and NodeB, I can confirm that an item cached on NodeA can be retrieved by NodeB. However I can't seem to find the configuration option that gets NodeA to actually back up the item to NodeB. That is, NodeB isn't storing the cached item, so when NodeA goes down, the next request to NodeB is a cache miss. What is the combination of config options that will result in backing up the cached item to at least one other node?

Question 4
When I cache an item into NodeA, and then retrieve it for the first time in NodeB, the gob encoder in NodeB spits out an error gob: name not registered for interface: "*mylib.MyCachedItem", which results in a cache miss. When it then fills the cache, subsequent requests will now work correctly since the gob encoder knows the type. To fix this, I had to add this to my application:

func init() {
	gob.Register(&MyCachedItem{})
}

Should this be documented somewhere in Olric? The fact that it uses gob is kind of an implementation detail.

Thanks for the support!

@buraksezer buraksezer self-assigned this Mar 3, 2020
@buraksezer buraksezer added the question Further information is requested label Mar 3, 2020
@buraksezer
Copy link
Owner

Answer for Q2

For Peers, is that just a seed list where I only have to point it at one other node for the new node to connect to the entire cluster? Or does every node need a complete list of every other node, like a memcached configuration?

A new node needs only one alive member to discover the entire cluster. This feature is provided by memberlist.

When I start NodeA up again, I need to make sure to point it at NodeB (I think?).

I didn't test this but I think NodeA cannot discover NodeB again in this scenario. We need an automatic peer discovery system here.

I'm hoping I only need a single other seed node in the peer list.

Single alive node should be enough for discovery but it's good to maintain a static list of peers, if it's possible.

Clearly, we need to have a service discovery integration here. I may work on a possible Consul integration. It's already in my task list for v0.3.x. It looks easy to implement: https://sreeninet.wordpress.com/2016/04/17/service-discovery-with-consul/

@buraksezer
Copy link
Owner

Answer for Q3

What is the combination of config options that will result in backing up the cached item to at least one other node?

Backup count is controlled by Config.ReplicaCount. It's 1 by default. You feel free to set 2 if you want a replica of every entry in the cache.

@buraksezer
Copy link
Owner

Answer for Q4

the gob encoder in NodeB spits out an error gob: name not registered for interface: "*mylib.MyCachedItem",

Good catch! Actually Olric registers these structs[0] before encoding but the remote side also needs to register the structs before decoding. I missed this detail. I'm going to document this behaviour of Gob.

[0] https://github.com/buraksezer/olric/blob/master/serializer/serializer.go#L47

Answer for Q1

Config.Name to something like 0.0.0.0:

Config.Name is the name of this server in the cluster. It has to be unique. It can be an IP address (0.0.0.0:unique_port is still valid) or domain name (ip-172-31-40-220.eu-west-1.compute.internal:3320 or olric1.myserver.com:3320) Olric also runs a TCP server on this host:port. I preferred this design to prevent generating and storing random IDs for nodes.

What do you think? Would you like to give more details about your deployment scenarios?

Config.MemberlistConfig.BindPort to a unique port

Config.MemberlistConfig.BindPort is for memberlist. The default port is 3322.

Config.Peers to point at the unique bind port of the other instance

A list of domain names or IP addresses of the other nodes. Like this: <node1.olric:3322, node2.olric:3322> Please note that that port number is used by memberlist.

Do I need to worry about Config.MemberlistConfig.AdvertisePort?

This is a bit confusing. I'm going to prepare a more detailed answer on this topic when I have time. Currently you should use the same values for AdvertiseAddr and AdvertisePort.

What is the difference between the Name port and the Bind port, and do I really need to make sure to have 2 unique ports per instance?

Config.MemberlistConfig.BindPort is used by the HashiCorp's memberlist. The default BindPort is 3322.

Config.Name is used by Olric for communication between clients and rest of the cluster. It should be unique in the entire cluster. It's good to use a domain name or IP address as Config.Name

I may add more details for Q1.

Thank you!

@justinfx
Copy link
Contributor Author

justinfx commented Mar 3, 2020

Thanks for confirming that the peer list is just a bootstrap list, and for suggesting that you might look into Consul service discovery! Very helpful to know.

Backup count is controlled by Config.ReplicaCount. It's 1 by default. You feel free to set 2 if you want a replica of every entry in the cache.

I tried setting the count to 2 and performing the following steps:

  1. Start NodeA and allow first processing request to generate cached item
  2. Start NodeB and make same request. NodeB finds distributed cached item from NodeA
  3. Wait a bit and notice some logging output about Moving DMap: ... and memberlist: Initiating push/pull sync on both nodes
  4. Kill NodeA
  5. Re-run request on NodeB and notice a cache miss and re-fill

Is there some long interval I need to wait for the async replication? The stats on NodeB don't end up reflecting the same size as NodeA to indicate a replication of the cache item.

What do you think? Would you like to give more details about your deployment scenarios?

This current cluster test isn't 100% realistic of a production deployment since the nodes would run on different hosts. However, sometimes I run a low-configured dev instance alongside a production instance on the same node. So when dealing with the embedded olric configuration, the more unique ports that have to be claimed translate into more port configuration options I have to expose through my app and be aware of to avoid startup clashes. These olric ports are in addition to my primary application tcp port... so its.. a lot of ports :-)
My short term production deployment will be on different hosts. My goal after that is to set this up in Kubernetes as a service that can auto-scale. That means it will need to be able to spin up between 1-N instances of my app (with olric embedded) to accomodate load. These will need to be able to look at Consul or a service discovery registry at startup to know the value of the peer list to use. That way they never need to actually know the peer ports statically. My own app actually uses nats.io for its peer discovery. When instances start up, they advertise to a channel and can get replies about the other members.

@justinfx
Copy link
Contributor Author

justinfx commented Mar 4, 2020

Config.Name is the name of this server in the cluster. It has to be unique. It can be an IP address (0.0.0.0:unique_port is still valid) or domain name (ip-172-31-40-220.eu-west-1.compute.internal:3320 or olric1.myserver.com:3320) Olric also runs a TCP server on this host:port. I preferred this design to prevent generating and storing random IDs for nodes.

Is this Name/port broadcasted to the other nodes through the member communication? My reason for asking is, if I didn't need the option of connecting a client to a specific tcp port, or I was happy to look at the log to see the actual port, is it feasible for Config.Name to accept a value like "hostname:" and let the port be automatically selected by net.Listen? I tried doing this, but apparently an internal olric client wants to connect and isn't actually figuring out the port that was assigned by the Listen.
If I were able to do this, I would only then have to really care about reserving 1 port, which is the memberlist bind port.

@buraksezer
Copy link
Owner

Actually I picked two ports. First one for Olric, it's used for client communication and sync data/state between Olric nodes. It's 3320. Second one for memberlist library. It's 3322. Peer discovery and failure detection system(so it's memberlist) uses that port. Olric only process NodeJoin/NodeLeave events from memberlist library. There is no other business logic here.

I don't fully understand that why do you need select random ports for Olric nodes? As my understanding, you only need to use a unique domain name or IP address for Olric nodes in your configuration file.

Is this Name/port broadcasted to the other nodes through the member communication?

Yes, it is. Config.Name is actually used by the internal/transport package to run a TCP server and memberlist library propagates Config.Name to other nodes.

It's a hack but I use the following in the tests to get a random port from the operating system. You may want to use it to generate random ports before create a configuration for an Olric node.

func getRandomPort() (string, error) {
	addr, err := net.ResolveTCPAddr("tcp", "127.0.0.1:0")
	if err != nil {
		return "", err
	}

	l, err := net.ListenTCP("tcp", addr)
	if err != nil {
		return "", err
	}
	defer l.Close()

	hostport := l.Addr().String()
	_, port, err := net.SplitHostPort(hostport)
	if err != nil {
		return "", err
	}
	return port, nil
}

I imagine something like that:

	port, err := getRandomPort()
	// handle err
	config.Config{
		Name: hostname + ":" + port, 

This is how I implement integration tests to test multiple Olric nodes: https://github.com/buraksezer/olric/blob/master/dmap_test.go#L174

Note: I'm going to write something more detailed during the day. It's 08:00 AM in Istanbul :)

@justinfx
Copy link
Contributor Author

justinfx commented Mar 4, 2020

If you want to run to applications containing Olric, on the same host, with the default settings one of them will fail with port conflicts. In order to avoid port conflicts, one must uniquely set 2 different ports.

Actualy I ended up using the same random port selection logic before I saw your example. Because I don't neccessarily need external client access to olric, and the nodes all broadcast the port around anyways, I just default the client port to random.
Now for the memeberlist bind port, I've also allowed the option for my application to randomize the port, because I can always have it push its actual port to Consul or some service discovery to derive a peer list later.

@buraksezer
Copy link
Owner

buraksezer commented Mar 4, 2020

Is there some long interval I need to wait for the async replication?

There may be two different answers for that question:

1-

memberlist can be tuned to be more responsive against network events. I prefer the use the defaults. memberlist.DefaultLANConfig() seems fine to me. So when you call config.New("lan"), Olric runs with that config.

The following lines from memberlist/config.go, DefaultLANConfig function:

TCPTimeout:              10 * time.Second,       // Timeout after 10 seconds
IndirectChecks:          3,                      // Use 3 nodes for the indirect ping
RetransmitMult:          4,                      // Retransmit a message 4 * log(N+1) nodes
SuspicionMult:           4,                      // Suspect a node for 4 * log(N+1) * Interval
SuspicionMaxTimeoutMult: 6,                      // For 10k nodes this will give a max timeout of 120 seconds
PushPullInterval:        30 * time.Second,       // Low frequency
ProbeTimeout:            500 * time.Millisecond, // Reasonable RTT time for LAN
ProbeInterval:           1 * time.Second,        // Failure check every second
DisableTcpPings:         false,                  // TCP pings are safe, even with mixed versions
AwarenessMaxMultiplier:  8,                      // Probe interval backs off to 8 seconds

GossipNodes:          3,                      // Gossip to 3 nodes
GossipInterval:       200 * time.Millisecond, // Gossip more rapidly
GossipToTheDeadTime:  30 * time.Second,       // Same as push/pull
GossipVerifyIncoming: true,
GossipVerifyOutgoing: true,

These variables is also used by HashiCorp's Serf tool. From the documentation:

gossip_nodes - The number of random nodes to send gossip messages to per gossip_interval. Increasing this number causes the gossip messages to propagate across the cluster more quickly at the expense of increased bandwidth. The default is 3.

gossip_interval - The interval between sending messages that need to be gossiped that haven't been able to piggyback on probing messages. If this is set to zero, non-piggyback gossip is disabled. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth. The default is 200ms.

probe_interval - The interval between random node probes. Setting this lower (more frequent) will cause the cluster to detect failed nodes more quickly at the expense of increased bandwidth usage. The default is 1s.

probe_timeout - The timeout to wait for an ack from a probed node before assuming it is unhealthy. This should be at least the 99-percentile of RTT (round-trip time) on your network. The default is 500ms and is a conservative value suitable for almost all realistic deployments.

retransmit_mult - The multiplier for the number of retransmissions that are attempted for messages broadcasted over gossip. The number of retransmits is scaled using this multiplier and the cluster size. The higher the multiplier, the more likely a failed broadcast is to converge at the expense of increased bandwidth. The default is 4.

suspicion_mult - The multiplier for determining the time an inaccessible node is considered suspect before declaring it dead. The timeout is scaled with the cluster size and the probe_interval. This allows the timeout to scale properly with expected propagation delay with a larger cluster size. The higher the multiplier, the longer an inaccessible node is considered part of the cluster before declaring it dead, giving that suspect node more time to refute if it is indeed still alive. The default is 4.

2-

When a node goes down, Olric doesn't try to create a new copy of the data from the primary copy. Instead of that, it applies a technique called read-repair. When you call get on a key, if the backups are out-of-sync or doesn't have the key, the primary owner sends the recent key/value pair to its replicas.

I tried to explain it in the readme. https://github.com/buraksezer/olric#read-repair-on-dmaps

The stats on NodeB don't end up reflecting the same size as NodeA to indicate a replication of the cache item.

Size may be different but the key/value pair is available, it's okay. This is a subtle and harmless "bug" in the storage engine.

@justinfx
Copy link
Contributor Author

justinfx commented Mar 4, 2020

Ok, ReadRepair=true seems to be the magic I was after. Once I enable this, I started seeing messages like this, Moving DMap: NAME, when a node comes back up. And I see that it looks to be transfering the cache item to the new node. The effect is that I can do a rolling upgrade of the application version and presumably not lose the entire cache. Thanks!

@buraksezer
Copy link
Owner

I think a clarification is needed. When a node goes down, Olric doesn't make a new copy for missing replicas immediately. This kind of inconsistencies is covered under an umbrella term called entropy. There are different approaches to decrease entropy in a distributed system. Olric currently implements quorum-based replica control and read-repair techniques to keep entropy low. A merkle-tree based anti-entropy system is planned to make a full scan and repair.

So when a node goes down, the replica count for keys decreases. That's clear. Olric doesn't make a full-sync immediately. Instead of that, when you call put or get functions on the keys, Olric ensures that the replication factor is satisfied and the most recent version of the key/value pair is replicated over the cluster.

Please take look at this scenario:

  1. Run NodeA and NodeB. ReplicaCount=2, ReadRepair=true
  2. Everything is OK. You call put or get functions on DMaps.
  3. NodeB goes down, Now, NodeA is the single server in the cluster. Replica count for keys is 1. All data remains available. NodeA owns everything and it's the cluster coordinator (the oldest node).
  4. NodeB is up again. NodeA moves some partitions and backup partitions to NodeB. You should see messages like this: Moving DMap: NAME. Backups or primary copies may be transferred in this case.
  5. NodeB owns half of the partitions as primary owner and functions as a backup node for the partitions which are owned by NodeA.
  6. At that point, replica count is still 1 but when you call get/put functions on the keys, Olric repairs the backups gradually.

I couldn't find an optimal way to make full-sync of backup partitions in the case of network partitioning. We may try to find a way if it's vital for you.

@justinfx
Copy link
Contributor Author

justinfx commented Mar 5, 2020

Thank you for this detailed explanation. I thought I had confirmed a kind of automatic value replication when I alternated between taking down NodeA and NodeB and confirmed I would still get a cache hit on the other node. It seemed with Repair enabled that it was keeping a copy on both Nodes.
So I can't say I am totally clear on the different between read repair and the full sync backup. It seems like the current solution is better than what I currently have in the sense that I get some form of cache value resilience between cycled upgrades to each node. And that missing some values is better than losing the whole cache.

@buraksezer
Copy link
Owner

buraksezer commented Mar 5, 2020

So I can't say I am totally clear on the different between read repair and the full sync backup.

I couldn't find an optimal way of how a node decides to make a full replica sync. In ReadRepair, keys are replicated and updated on the backup nodes, if it's needed. And read-repair is triggered by a Get call.

I'm not willing to increase coordination level between the cluster members. But we may still work on a new way to handle it.

@buraksezer
Copy link
Owner

Hey @justinfx, I hope you are well. I have been working on Kubernetes integration since a couple of weeks. It seems to work fine. You may want to check out this: https://github.com/buraksezer/olric#kubernetes

@justinfx
Copy link
Contributor Author

@buraksezer , very cool! I've been doing a lot of k8s investigation lately. It's great to see that the new plugins are continuing that pattern of providing both a plugin and library interface. My primary integration of olric is embedded into a distributed app. So I imagine I will need the k8s discovery plugin for that.
I see your documentation has been expanding nicely. Did you want to update your service discovery section to also point at nats support?
https://github.com/buraksezer/olric/#service-discovery
https://github.com/justinfx/olric-nats-plugin

@buraksezer
Copy link
Owner

buraksezer commented Apr 23, 2020

@justinfx thank you! olric-cloud-plugin will provide a service discovery intervace for k8s and many other cloud providers including Amazon AWS, Google Cloud and Microsoft Azure. It uses hashicorp/go-discover under the hood. You can use the plugin in your distributed app. It's designed for this purpose.

I'm expanding the documentation these days. I'll test and add olric-nats-plugin in a couple of days.

Keep in touch!

@buraksezer
Copy link
Owner

Hi @justinfx

It has been for a long time since our last contact. I saw your recent PR to the olric-nats-plugin repository. I wonder whether you use Olric at production?

@justinfx
Copy link
Contributor Author

Hi @buraksezer. Well I had that one project where I introduced Olric and set up a helm deployment for kubernetes. But then we ran into some unrelated infrastructure problems with putting it on kubernetes, so I ended up shelving the new version. There is probably still a way to deploy this to our older setup, so I could possibly revisit it again soon.

But I do still want to use Olric in production and I have a new project that will need a distributed cache in front of databases accesses.

Do you have any production success stories from other projects yet? Interested to know where it is in serious use.

@buraksezer
Copy link
Owner

Do you have any production success stories from other projects yet? Interested to know where it is in serious use.

We have a few users. One of the bigs is IBM Cloud. See #82 and #106 for details. The other one is LilithGames. They are building a proxy tool that uses Olric but I don't have any idea about the current status of the project: https://github.com/LilithGames/spiracle

Open source projects by indie developers and communities:

Some of them are probably inactive. There may be unknown users. Olric has ~300 clones per day.

@justinfx
Copy link
Contributor Author

Thanks for that info. Congrats on receiving so much activity! Well if I end up using Olric on this new project then I am sure you will hear from me. I'm hoping to use it as an embedded cache to the application nodes, instead of the previous version that used varnish in front of the application nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants