direct reads in high availability mode #153

mapshen · 2019-01-07T23:54:48Z

Hi Ryan,

Recently noticed a "strange" behavior of Monstache and am hoping you could shed some light on it.

Sometimes, because of an index template change, we need to recreate an index. By that, we mean stopping Monstache, deleting the current index, updating the index template and restarting Monstache immediately. However, at this time, it doesn't seem Monstache has realized the old index is gone and it will not perform a direct read to recreate the index.

Here are my steps to reproduce it:

Run two Monstache processes in high availability mode with

resume = true
replay = false
cluster-name = "test"

The index of interest is created
Stop both processes
Remove the index
Update the template for the index
Restart both processes
No index is created

Suspect it may have something to do with monstache.cluster, which has an TTL index on expireAt because this issue will not come up if we wait for like 1 minute before restarting Monstache.

The text was updated successfully, but these errors were encountered:

rwynn · 2019-01-08T19:43:22Z

@mapshen I wasn't able to recreate this problem yet with your information. I am assuming that you also have direct-read-namespaces configured in addition to the settings above? That would be required to copy over the data after an index delete since replay = false.

Can you try to start the processes with verbose = true? Usually, when an active member of the cluster exits it deletes the record of itself in monstache.cluster so that another process can take over. Do you think maybe this did not happen and so when you restarted the 2 processes they both came up in the paused state?

Does this happen if you do a use monstache; db.cluster.remove({}); between steps 5 and 6?

rwynn · 2019-01-08T19:50:32Z

I can sort of replicate this by using killall -9 monstache to kill both processes. In this case the shutdown doesn't get run so the document in monstache.cluster remains. Then when I start the 2 processes they both come up in paused state. The problem is that even though they are paused, the direct reads still happen, but since both processes are not "active" they throw away these direct-read documents instead of indexing them. This is to prevent multiple processes from re-indexing the same direct reads. I guess it assumes that one process with always be active. But in this case both are paused while the direct-reads are happening.

mapshen · 2019-01-08T22:24:12Z

Yes, I do use direct-read-namespaces as well and when I run monstache; db.cluster.remove({}); between steps 5 and 6, this doesn't happen.

What you described above is exactly what I am seeing. When the 2 processes start, they come up in paused state. However, as you know, the old document (0ec41c92595f) in monstache.cluster will be dropped in 30 seconds because of the TTL index

  {
    "v": 1,
    "key": {
      "expireAt": 1
    },
    "name": "expireAt_1",
    "ns": "monstache.cluster",
    "background": true,
    "expireAfterSeconds": 30
  }

When that happens, one of the two new processes (927dc273ea23) will register itself and resume work

 monstache> db.cluster.find()
{
  "_id": "test",
  "expireAt": ISODate("2019-01-08T22:08:13.993Z"),
  "pid": 34,
  "host": "0ec41c92595f"
}
Fetched 1 record(s) in 1ms

 monstache> db.cluster.find()
Fetched 0 record(s) in 0ms

 monstache> db.cluster.find()
{
  "_id": "test",
  "expireAt": ISODate("2019-01-08T22:09:14.362Z"),
  "pid": 34,
  "host": "927dc273ea23"
}
Fetched 1 record(s) in 3ms

Logs from 927dc273ea23 also supports this observation:

INFO 2019/01/08 22:08:44 Joined cluster test
INFO 2019/01/08 22:08:44 Pausing work for cluster test
INFO 2019/01/08 22:08:44 Listening for events
INFO 2019/01/08 22:08:44 Starting http server at :8080
INFO 2019/01/08 22:08:44 Sending systemd READY=1
INFO 2019/01/08 22:09:14 Resuming work for cluster test

But 927dc273ea23 still doesn't process the direct reads whereas, I think, in this case, it should.

Also, I turned on verbose as well this time, and the output log messages didn't change.

It seems the shutdown procedure never gets run for me because I see the document in monstache.cluster always remains after the process stops. I run monstache in docker using a wrapper script as the image entry point and it seems when I run docker stop on a monstache container, the SIGTERM gets caught and docker has to send a SIGKILL after a grace period to kill it so that the shutdown doesn't happen?

By the way, is it correct that timestamp written to monstache.monstache will only used by oplog tailing events, not direct reads?

mapshen · 2019-01-08T22:32:56Z

In addition, an easier way to reproduce this is:

pick a large collection (> 10 million docs in my case)
start 2 monstache processes to create an elastic index for that collection
stop (kill -9) the 2 monstache processes before the index finishes building
start 2 new monstache processes before the old doc in monstache.cluster is removed

You will see the elastic index doesn't proceed to populate.

rwynn · 2019-01-08T23:45:08Z

Thanks for confirming the behavior. One of the issues is the following code which turns off direct reads when the process is in a cluster and it is not enabled:

https://github.com/rwynn/monstache/blob/v4.13.0/monstache.go#L3730

The assumption there is that if the process is not enabled then some other process must be (and doing direct reads). And in that case we don't want every process in the cluster to re-perform the same direct reads. Not sure if it would be better to have them all process the direct-reads regardless of whether or not they are enabled.

In this case that assumption is bad cause all processes are starting disabled due to the zombie process still lingering in the cluster collection.

The problem with using the timestamp in monstache.monstache with direct reads is that (a) sometimes you really want to sync the whole collection regardless of the last oplog event you processed (e.g you deleted the index) and (b) direct reads really don't have a reliable way to factor in time. They don't have an oplog structure with a timestamp only the doc itself. ObjectIDs have a time component but you may not always use an ObjectID for your _id and the time component of the ObjectID is when it was created not when last updated.

The best you can do in monstache for delta direct-reads is to setup a pipeline with a $match clause that targets a date/timestamp field in your document with a $gt operator.

Here is a link to that shutdown logic that resets the cluster state which looks not to be happening here:
https://github.com/rwynn/monstache/blob/v4.13.0/monstache.go#L3529

mapshen · 2019-01-09T16:34:02Z

Clearly it's my oversight when dockerizing monstache. Made a change on my end and now docker stop stops the container gracefully and monstache.cluster gets cleaned up on exit. See detailed at here if you're interested.

Although this means the two cases I mentioned above are no longer a concern, I found a third scenario where this would happen:

pick a large collection (> 10 million docs in my case)
start 2 monstache processes (ba62d4e7fdbe and 3d5ef1f2efdd) to create an elastic index for that collection
stop (gracefully) ba62d4e7fdbe when it's still working on direct reads
3d5ef1f2efdd will resume work in no time but it won't process direct reads any more

So for high availability, monstache only works for oplog-tailing events, not direct reads.

Thanks for pointing me to the source code, and that really helps. To be honest, I haven't read all the code, but I was wondering if there is a logic for checking whether the direct reads are complete or not and if so, we may save its value to monstache.monstache as well so that when 3d5ef1f2efdd takes over, it knows whether it should re-perform the same direct reads?

rwynn · 2019-01-09T18:32:52Z

You raise a valid issue. I think it would be difficult to track and reliably clean up the state of direct reads. I think we could add an option such that all processes do direct reads and only active processes track the oplog (they would be redundant but maybe not a big overhead?).

Or, a different approach would be to turn off direct reads in the configs of the cluster processes and then add a third docker container that only does direct-reads. This 3rd (non-clustered) container could be setup to turn on -disable-change-events so that it doesn't track the oplog at all but just does direct-reads. Maybe also configure this container with -exit-after-direct-reads and then auto-restart it after an interval (say 30 minutes or longer) with unless-stopped? I think you would need a custom entry-point to support the delay. Or use some trick like...

docker run --restart unless-stopped --entrypoint "/bin/ash" golang:1.11.4-alpine3.8 -c "/usr/local/go/bin/go version && /bin/sleep 10"

I was thinking about scheduled direct-reads (restart after an interval) but thought maybe that should be handled outside of monstache.

mapshen · 2019-01-10T20:20:36Z

This scenario could potentially dangerous in the sense it's possible for us to lose data without knowing when one process goes down during direct reads which is what we can't afford to live with.

Put some thought into this and I think I'm in favor of your adding a third container approach. What I have in mind is all the 3 containers will still share same configuration file but the first one will take -disable-change-events and -exit-after-direct-reads options and the other two will take disable-direct-reads from the command line. Since we don't have the disable-direct-reads option yet, do you think this is something we can add to make my life much easier? :)

P.S.
We both agree it's difficult to track the status of direct reads, and, thus, by HA, we mean when one process goes down, a second process should re-perform the direct reads from the beginning. In your approach, the second process is a restart of the first failed process, and will not start till the first one fails.

I have two other ideas, in which the second process starts with the first one but will stand by till it fails.

When running in the HA mode, processes started will not perform direct reads automatically, i.e., they will not start go routines to read the docs from a collection. And based on this, my two ideas are:

They will first check who the primary one is. Only the primary does direct reads and all the others enter the paused state. Whenever a secondary becomes a primary, it re-performs the direct reads. Again, like you said, this would be redundant and a overhead.
Similar to writing the timestamp to monstache.monstache, we can also write the state of direct reads there as well:

{
  "_id": "<cluster-name>_oplog",
  "ts": Timestamp(1547028156, 1)
}
{
  "_id": "<cluster-name>_direct_reads"
  "expireAt": ISODate("2019-01-08T22:08:13.993Z"),
  "is_complete": false
}

Then the process at work will first check monstache.monstache and will only start direct reads if the "_direct_reads" doc doesn't exist or "is_complete" is false. When it's shut down, the "_direct_reads" doc will not be removed, like "_oplog". When a second process takes over, it will also perform the same check and it won't perform direct reads if "is_complete" is true.

Note that there will be a TTL index on the "expireAt" field in the "_direct_reads" doc, in order to clean up the doc. The downside of this is when we are to rebuild an index, we stop the containers, remove the index and update the template but we can't restart the containers right away with the same because of the legacy "_direct_reads" doc.

rwynn · 2019-01-10T21:41:54Z

I will respond in more length when I get some time. But one thing. MONSTACHE_DIRECT_READ_NS is supported as an env var. You could use that instead of the config file since there is no disable flag. Just leave the var out for one of the processes.

MONSTACHE_DIRECT_READ_NS=db.col1,db.col2

mapshen · 2019-01-11T03:57:21Z

Good call, Ryan. Will test it out and circle back.

rwynn · 2019-01-15T22:07:54Z

@mapshen I put some fixes in related to this in v4.13.1. I am very hesitant to add any additional state in MongoDB so at this point no direct-reads-complete flag. But monstache now ensures that:

When a process starts up in cluster mode and detects it is not enabled it will wait until it is enabled before proceeding to consume events. It will not turn off its direct reads.
When a process that was enabled becomes disabled it will wait until it is re-enabled before proceeding to consume events.

By wait I mean that go routine consuming events from a channel will be blocked until an active status is determined. This whole thing is predicated on the fact that the go routines producing data from MongoDB cannot advance unless the channels they write to are emptied. So by not reading on those channels, the producers (direct readers) will feel back pressure and effectively be paused.

I think this should fix all the cases that you outlined in this issue. For the first case where all the monstache processes are killed forceably without cleanup, monstache will wait till the zombie process is expired by MongoDB (max 30 seconds), but proceed to process the direct reads.

In the case where the cleanup does occur and one of the new active processes is killed before completing the direct reads, the subsequent active process in the cluster will still sync direct reads (repeating any from the killed process).

mapshen · 2019-01-16T17:49:16Z

@rwynn Thanks a lot for the fix, much appreciated.

Did a test in my dev env, and based on my setup, where two monstache processes running as a cluster, I killed the active process when it finished the direct reads so that the subsequent one started a full re-sync. It worked as expected and the RAM usage increased about 50% on the Elasticsearch data nodes during the re-sync and the amount disk space used doubled when the re-sync is done, which we can accept given this is the worst case scenario. Of course, the nodes will try to utilize all the CPUs available.

In sum, we are going down this road.

rwynn · 2019-01-16T18:35:07Z

@mapshen that's interesting about the disk usage. I can see the RAM usage increasing but surprised about the disk since the documents should be overwriting existing not adding more to the index. Maybe this will decrease again when Elasticsearch merges segments?

mapshen · 2019-01-16T20:04:21Z

@rwynn Yes, the disk usage went back to normal later on, but I haven't found a good explanation on why the merges only happen after the re-sync is complete. It also seems to be what happens even when I build the indexes from scratch.

mapshen · 2019-04-25T21:35:14Z

@rwynn think we need to reopen this as it seems I have found a case in which this solution fails (and it just happened in our production environment).

Here are the steps to reproduce it:

pick a collection (296 docs in my case)
start 2 monstache processes (ba62d4e7fdbe and 3d5ef1f2efdd) to create an elastic index for that collection, ba62d4e7fdbe being the active one
when the index finishes populating, remove some docs (285 docs in my case) from the collection and wait for the index to catch up
stop ba62d4e7fdbe, and 3d5ef1f2efdd picks up the work
check the index again and you will see the index has a higher count than what is in mongo, which is supposed to 11 docs

If we take a look at those extra docs, you will find they are the docs we removed from mongo earlier.

The reason, I suspect, is that although 3d5ef1f2efdd doesn't consume events from a channel when it's not active, old events get queued up. When it starts to work, it will process those events as they just come in, while they should have been discarded.

The consequence is I have to manually clean up data in elastic periodically providing fail-over happens once a while.

Would we be able to fix this? Let me if you need more information. Many thanks in advance!

rwynn · 2019-04-26T16:42:50Z

@mapshen I will take a look at fixing this issue. One measure monstache has to prevent old data from getting into Elasticsearch is the use of version numbers (used by default but not used when index-as-update is enabled).

Monstache uses the timestamp from the oplog as a version number such that old data will get rejected ensue the index is consistent with MongoDB. However, when it comes to deletes, Elasticsearch garbage collects deleted versions after 60s by default. This results in this rejection not happening if the 2nd process takes over more than 60s after the delete is performed by the 1st. For example, I just tried the scenario but killed the main process immediately after performing the deletes. The 2nd process took over and sent the buffered index requests, but since the version number was less than the current they were rejected. Obviously, this won't work in the general case since the process may take over well after the 60s has elapsed. You can change 60s to a higher value using the index.gc_deletes setting in Elasticsearch, but probably isn't a great idea to keep that very high.

All that having been said, I think there are some measures I can take to ensure that the accumulated buffer of change events is cleared when a process takes over for another. The trick is still ensure that the original reason for this issue is still fixed. All the direct reads should still be performed when a process takes over. So I cannot just discard those. I will need to rerun them.

mapshen · 2019-04-26T17:33:28Z

Agreed, clearing up the buffer and re-perform the direct reads will do.

Happy to help test the fix. Let me know!

mapshen · 2019-04-29T21:07:28Z

@rwynn Thanks for the quick turnaround and it looks good in my testing.

However, it seems we've got a regression here. Currently, the behavior is the http server won't start until a process first becomes "enabled". Since we use something like curl -f http://localhost:8080/healthz || exit 1 for health checks, the inactive one is considered unhealthy and may subsequently be shut down and/or restarted.

rwynn · 2019-04-30T00:41:35Z

@mapshen Thanks for reporting the issue with the http server. I just checked in a fix across all branches. Do you think you would be able to do a local build and let me know if you run into any issues?

mapshen · 2019-04-30T16:47:51Z

@rwynn all good now in my testing.

Had to change things up a bit to support local builds on my end though ;)

mapshen · 2019-05-01T17:49:03Z

Are we going to make a release for this?

rwynn · 2019-05-02T00:32:30Z

@mapshen just pushed a new release. thanks.

mapshen changed the title ~~index recreation~~ direct reads in high availability mode Jan 9, 2019

rwynn added a commit that referenced this issue Jan 14, 2019

fix for issue #153

039929b

rwynn added a commit that referenced this issue Jan 14, 2019

fix for issue #153

3676778

rwynn closed this as completed Jan 18, 2019

rwynn reopened this Apr 26, 2019

rwynn added a commit that referenced this issue Apr 26, 2019

improvements for issue #153

93e1d9d

rwynn added a commit that referenced this issue Apr 26, 2019

improvements for issue #153

777efef

rwynn added a commit that referenced this issue Apr 26, 2019

improvements for issue #153

babfda4

rwynn added a commit that referenced this issue Apr 26, 2019

improvements for issue #153

b710c67

rwynn closed this as completed May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

direct reads in high availability mode #153

direct reads in high availability mode #153

mapshen commented Jan 7, 2019 •

edited

rwynn commented Jan 8, 2019

rwynn commented Jan 8, 2019 •

edited

mapshen commented Jan 8, 2019 •

edited

mapshen commented Jan 8, 2019

rwynn commented Jan 8, 2019

mapshen commented Jan 9, 2019 •

edited

rwynn commented Jan 9, 2019 •

edited

mapshen commented Jan 10, 2019

rwynn commented Jan 10, 2019

mapshen commented Jan 11, 2019

rwynn commented Jan 15, 2019

mapshen commented Jan 16, 2019

rwynn commented Jan 16, 2019

mapshen commented Jan 16, 2019

mapshen commented Apr 25, 2019 •

edited

rwynn commented Apr 26, 2019

mapshen commented Apr 26, 2019

mapshen commented Apr 29, 2019

rwynn commented Apr 30, 2019

mapshen commented Apr 30, 2019 •

edited

mapshen commented May 1, 2019

rwynn commented May 2, 2019

direct reads in high availability mode #153

direct reads in high availability mode #153

Comments

mapshen commented Jan 7, 2019 • edited

rwynn commented Jan 8, 2019

rwynn commented Jan 8, 2019 • edited

mapshen commented Jan 8, 2019 • edited

mapshen commented Jan 8, 2019

rwynn commented Jan 8, 2019

mapshen commented Jan 9, 2019 • edited

rwynn commented Jan 9, 2019 • edited

mapshen commented Jan 10, 2019

rwynn commented Jan 10, 2019

mapshen commented Jan 11, 2019

rwynn commented Jan 15, 2019

mapshen commented Jan 16, 2019

rwynn commented Jan 16, 2019

mapshen commented Jan 16, 2019

mapshen commented Apr 25, 2019 • edited

rwynn commented Apr 26, 2019

mapshen commented Apr 26, 2019

mapshen commented Apr 29, 2019

rwynn commented Apr 30, 2019

mapshen commented Apr 30, 2019 • edited

mapshen commented May 1, 2019

rwynn commented May 2, 2019

mapshen commented Jan 7, 2019 •

edited

rwynn commented Jan 8, 2019 •

edited

mapshen commented Jan 8, 2019 •

edited

mapshen commented Jan 9, 2019 •

edited

rwynn commented Jan 9, 2019 •

edited

mapshen commented Apr 25, 2019 •

edited

mapshen commented Apr 30, 2019 •

edited