New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication stopped for several databases after some days, couchdb version 3.3.1 #4897
Comments
|
Thanks Nick, we bounced our nodes to sync doc count with master ( source ) , now we found that replication for same databases stopped replicating once again, we have 300 databases , out of 6 /7 databases stopped replicating again on the same nodes. Anything we need to check at database/ shards level to see if there is any issue with databases. we have not yet deployed your recommended config changes , we will apply the changes and observe for few days and let you know |
Nick , We seeing the same issue after configuring our cluster with new values as per your recommendation: we have 300 databases , after we bounced all the nodes ( 5 nodes ) with the new configuration ( retries_per_request=5, worker_processes = 4 , worker_processes=20) , we see replication stopped for some databases after few days. "database": "_replicator", Error log : Jan 04 09:52:20 item-cache-ro-us-east4-5 couchdb[3345357]: couchdb@item-cache-ro-us-east4-5.c.pr-store-xxxx.internal <0.18866.1840> -------- Replicator, request GET to "https://item-cache-master.pr-store-xxxxxx.str.xxxxxx.com/item_store-104/_changes?feed=continuous&style=all_docs&since=23237954-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11EvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYwJTIk2f___z8rgzmJISLYMxcoxm5kYWKRnGRCvskEXGVM0FVJDkAyqR7msEhRdrDDLC2MDRJNjcg3nOLgymMBkgwNQArotv2QUCt2BjvOLDHNMiUtjWahZkKk4w5AHAeN0p3rwY4zMDExNzC1JN-CLAAhTNNR&timeout=40000" failed due to error {connection_closed,mid_stream} Note : The above error messages appears for all databases , also this message was appread for this database(item_store-104) earlier , but replication was not stopped. |
Do the source servers show any errors in the logs Are there other issue with the endpoints, when the condition occurs, do basic doc reads and writes still work? Could try perhaps reducing the batch size a bit more (2000 -> 500), and the worker count as well (4 -> 2) and increase the timeout a bit more 120000 -> 240000. Another idea is to try to upgrade to 3.3.3 with the latest deb package as we upgraded Erlang/OTP there to fix a memory leak issue. There were also some replication fixes since then. If possible when identified a case of a stopped db, try updating the source doc and see what happens, does that change get propagated to the target? |
we have updated the config with values you recommended but seen same issue , the replication runs smoothly for few days and stop replicating for few databases "replicator": { we have load balancer between Master ( 5 nodes ) and Replica ( 5 nodes ), the replica access master using load balancer , the timeout in load balancer is 2 min (120 sec), we see replication crashes after 2 mins for all databases , but it restarts automatically. But for some databases replication restarts and remain as "running" even though the connection is timed out from load balancer. Couchdb scheduler does not restart the replication for that database since it is not in "crashed". is there anything we can try from our side |
Thanks for trying the settings out.
Hmm, nothing obvious then jumps out just from the description above. Wonder if you can collaborate with the load balancer team to see if they'd make an exception for Also, perhaps try upgrading to the latest 3.3.3 release. If that doesn't work, try to pick a particular replication and see if you get access logs from the load balancer and the couchdb replication host and endpoint to see exactly where it stops. Is it waiting on a particular request, or what's the latest request made. |
Replication stopped working after sometime
we have 322 databases , but we found on some databases the replication stopped working for several days
couchdb version 3.3.1
Below is the information for one database which stopped replicating after date 2023-11-23:
On Master(source) cluster database information :
{
"instance_start_time": "1642704430",
"db_name": "item_store-54",
"purge_seq": "0-g1AAAAFDeJytzsEJgDAMBdCgrtOCKHjUVdK0UKW2UutZN9PNNOAA9WAOP_ADjzgAqGypoaOwkdVqGJOZBSFZI2Zck4mikSSXKNYUInfBYyQrR88nj84xUPiKEy5eN8-EkBHbj-LximderLOiGjjV_tN_CKp_qekB84Fvhg",
"update_seq": "18822108-g1AAAAJveJzLYWBgYMlgTmEwT84vTc5ISXLILEnN1U1OTM5I1c1NLC5JLdI11kvWKyjSLS7JLwKK5eclFiVn6GXmAaXyEnNygAYw5bEASYYDQOo_EGRlMCcxuB9ZlgsUY08ySksxtzQk3wICjjMh0nENEMfthzhutxjYcWmWJkZmKUnkW0DAcYYEHZfkACST6uGhtkgA7DALc4tko-QU8g2nONQSGZLsYa7yEIK6KinJ0CTJknyTswAsLNQN",
"sizes": {
"file": 3859096600,
"external": 2393485316,
"active": 3329731810
},
"props": {},
"doc_del_count": 0,
"doc_count": 4223553,
"disk_format_version": 8,
"compact_running": false,
"cluster": {
"q": 4,
"n": 2,
"w": 2,
"r": 2
}
On Read-only(Tareget) cluster database information ( to see doc count difference)
{
"instance_start_time": "1642718965",
"db_name": "item_store-54",
"purge_seq": "0-g1AAAAFXeJytzrsNwzAMBFAi8TpUkU8XIF6FognIhiwFlFzHmyWbxQQ8gFXkirvu4SIAdOE8wIPzwmHw_VhlRiYOgppxKShU6g2vjt1LsdSsgnNOpBzcmKpoohhNOaXOGr42P8tE0MLeG9l1Zz-N7OWQ9b21f__zKYF_7t60ATasdX0",
"update_seq": "18634483-g1AAAAKXeJzLYWBgYMlgTmGwSc4vTc5ISXLILEnN1U1OTM5I1S3K1y0t1k1NLC4x0TXWS9YrKNItLskvStXNzc9LLErO0MvMK0ktykvMyQGawpTHAiQZDgCp_0CQlcGcxODOX5sLFGM3NbZMNjZNpNAWYpxpSqQzGyDO3A9xJlMc2JnJFonmyabmFNpCjDONCDozyQFIJtXDQtLtpRrYiebJRkZGSWkU2kCdkExkSLKHx3T0J7D7DMwSLcxTDCg0PgsAqfbfxA",
"sizes": {
"file": 3721901100,
"external": 2390184264,
"active": 3321577757
},
"props": {},
"doc_del_count": 0,
"doc_count": 4220250,
"disk_format_version": 8,
"compact_running": false,
"cluster": {
"q": 4,
"n": 2,
"w": 2,
"r": 2
}
}
_scheduler/docs information ( it shows state as 'running' but last updated is 2023-11-23
{
"database": "_replicator",
"doc_id": "d5d4402593cd4f349a16afd480b56769",
"id": "fa8e979a5cd08b7f5bc79312621e5d78+continuous",
"node": "couchdb@item-cache-ro-us-east4-1.c.pr-xxxx-xxxx.internal",
"source": "https://item-cache-master.pr-store-xxxxx.str.xxxxxx.com/item_store-54/",
"target": "http://localhost:5984/item_store-54/",
"state": "running",
"info": {
"revisions_checked": 25012513,
"missing_revisions_found": 9805364,
"docs_read": 9785878,
"docs_written": 9785878,
"changes_pending": 0,
"doc_write_failures": 0,
"bulk_get_docs": 2907167,
"bulk_get_attempts": 2907167,
"checkpointed_source_seq": "18689135-g1AAAAJveJyd0EEKwjAQBdBgKz2GJ0ix0ZC4sgvBcySTQCptKmld6xW8iDvXerM6tOI20s0MzMCfx9SEkNQlhghoL-CMLqveNhQUOEsb1fU20E0O-TnQrm8DzlqvAri88rjyqq4xYKGI3g_DcHKJJsd71uAs2xrJC8HnJ0dURVSlS6z6-oOtHiNMCgkMzPzwCIxHYT7FSm7Y0PaacOI54jTjO6HZ_AMRHPsT955w388dliPOciHXIOcfOH0AiZDS3A",
"source_seq": "18689136-g1AAAAJveJyd0EEKwjAQBdBgiz2GJ0ix0ZC4sgvBcySTQCptKmld6xW8iDvXerM6tOI20s0MzMCfx9SEkNQlhghoL-CMLqveNhQUOEsb1fU20E0O-TnQrm8DzlqvAri88rjyqq4xYKGI3g_DcHKJJsd71uAs2xrJC8HnJ0dURVSlS6z6-oOtHiNMCgkMzPzwCIxHYT7FSm7Y0PaacOI54jTjO6HZ_AMRHPsT955w388dliPOciHXIOcfOH0AidjS3Q",
"through_seq": "18689136-g1AAAAJveJyd0EEKwjAQBdBgiz2GJ0ix0ZC4sgvBcySTQCptKmld6xW8iDvXerM6tOI20s0MzMCfx9SEkNQlhghoL-CMLqveNhQUOEsb1fU20E0O-TnQrm8DzlqvAri88rjyqq4xYKGI3g_DcHKJJsd71uAs2xrJC8HnJ0dURVSlS6z6-oOtHiNMCgkMzPzwCIxHYT7FSm7Y0PaacOI54jTjO6HZ_AMRHPsT955w388dliPOciHXIOcfOH0AidjS3Q"
},
"error_count": 0,
"last_updated": "2023-11-23T08:31:41Z",
"start_time": "2023-11-22T08:02:38Z",
"source_proxy": null,
"target_proxy": null
},
_active_tasks Information
{
"node": "couchdb@item-cache-ro-us-east4-1.c.pr-xxxxx-xxxxx.internal",
"pid": "<0.28061.3417>",
"process_status": "waiting",
"bulk_get_attempts": 2907167,
"bulk_get_docs": 2907167,
"changes_pending": 0,
"checkpoint_interval": 5000,
"checkpointed_source_seq": "18689135-g1AAAAJveJyd0EEKwjAQBdBgKz2GJ0ix0ZC4sgvBcySTQCptKmld6xW8iDvXerM6tOI20s0MzMCfx9SEkNQlhghoL-CMLqveNhQUOEsb1fU20E0O-TnQrm8DzlqvAri88rjyqq4xYKGI3g_DcHKJJsd71uAs2xrJC8HnJ0dURVSlS6z6-oOtHiNMCgkMzPzwCIxHYT7FSm7Y0PaacOI54jTjO6HZ_AMRHPsT955w388dliPOciHXIOcfOH0AiZDS3A",
"continuous": true,
"database": "shards/90000000-9fffffff/_replicator.1642718882",
"doc_id": "d5d4402593cd4f349a16afd480b56769",
"doc_write_failures": 0,
"docs_read": 9785878,
"docs_written": 9785878,
"missing_revisions_found": 9805364,
"replication_id": "fa8e979a5cd08b7f5bc79312621e5d78+continuous",
"revisions_checked": 25012513,
"source": "https://item-cache-master.pr-xxxxx.xxxxxx.str.xxxxxx.com/item_store-54/",
"source_seq": "18689136-g1AAAAJveJyd0EEKwjAQBdBgiz2GJ0ix0ZC4sgvBcySTQCptKmld6xW8iDvXerM6tOI20s0MzMCfx9SEkNQlhghoL-CMLqveNhQUOEsb1fU20E0O-TnQrm8DzlqvAri88rjyqq4xYKGI3g_DcHKJJsd71uAs2xrJC8HnJ0dURVSlS6z6-oOtHiNMCgkMzPzwCIxHYT7FSm7Y0PaacOI54jTjO6HZ_AMRHPsT955w388dliPOciHXIOcfOH0AidjS3Q",
"started_on": 1700728303,
"target": "http://localhost:5984/item_store-54/",
"through_seq": "18689136-g1AAAAJveJyd0EEKwjAQBdBgiz2GJ0ix0ZC4sgvBcySTQCptKmld6xW8iDvXerM6tOI20s0MzMCfx9SEkNQlhghoL-CMLqveNhQUOEsb1fU20E0O-TnQrm8DzlqvAri88rjyqq4xYKGI3g_DcHKJJsd71uAs2xrJC8HnJ0dURVSlS6z6-oOtHiNMCgkMzPzwCIxHYT7FSm7Y0PaacOI54jTjO6HZ_AMRHPsT955w388dliPOciHXIOcfOH0AidjS3Q",
"type": "replication",
"updated_on": 1700729154,
"user": null
},
Expected Behaviour
The doc count Master (source) and target should match
Your Environment
{
"couchdb": "Welcome",
"version": "3.3.1",
"git_sha": "1fd50b82a",
"uuid": "53152281baf76390a563a08be3b3fd09",
"features": [
"access-ready",
"partitioned",
"pluggable-storage-engines",
"reshard",
"scheduler"
],
"vendor": {
"name": "The Apache Software Foundation"
}
}
Replication Configuration
[replicator]
max_jobs = 1500
interval = 60000
max_churn = 20
max_history = 7
worker_batch_size = 2000
worker_processes = 32
http_connections = 300
connection_timeout = 120000
retries_per_request = 10
checkpoint_interval = 5000
cluster_quiet_period = 600
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
Additional Context
we have some other databases which stopped replicating for few days , we have 4 different target clusters (each 5 nodes), we have seen same issues on these clusters for different databases. After bouncing the clusters the problem goes away for few days and appears again
The _scheduler/jobs not showing these replicators information, which shows that these databases are not scheduled to be replicated
The text was updated successfully, but these errors were encountered: