Nouveau Availability #5002

Sliosh · 2024-03-12T19:56:49Z

Hello.

We are planing to deploy CouchDB to all our customers sites. Because of that we are evaluating how we will rollout couchdb. It would be good to know when we can expect version 3.4.0. Is there any possible timeframe known a this point in time? And will nouveau be included (as stable) in 3.4.0?

Thanks for your help.

nickva · 2024-03-13T02:18:18Z

Nouveau will be included in 3.4.0 release.

There is not set timeframe, but currently the main thing we're waiting on is review of the nouveau deb packaging PR in apache/couchdb-pkg#125. I am planning on reviewing this week or weekend.

Sliosh · 2024-03-13T20:04:31Z

Thanks. This sounds good to me. I have a second question about Nouveau. We want to have one or more couchdb and corresponding nouveau nodes for our customer in different offices from our customers. We need to sync the data between the nodes, and everything should still work for every node, even if the Network between the nodes is lost. The should get synced after network restoration. We use clustering with n = NODE_COUNT and q = 1, so that we get all data on every node. This seems to work as expected, but Nouveau doesen't work if any node is down. We get the following message if a node (In this example couchdb@deacsrewdbbt2.customer.de) is down and is not included in the error response below:

"error": "badrecord",
    "reason": "[{{shard,<<\"shards/00000000-ffffffff/foo.1710344337\">>,\n         'couchdb@deacsrewdbbt1.customer.com',<<\"foo\">>,\n         [0,4294967295],\n         #Ref<0.1884847839.2480668675.227080>,\n         [{props,[]}]},\n  nil},\n {{shard,<<\"shards/00000000-ffffffff/foo.1710344337\">>,\n         'couchdb@deacsrewdbbt3.customer.com',<<\"foo\">>,\n         [0,4294967295],\n         #Ref<0.1884847839.2480668675.227078>,\n         [{props,[]}]},\n  nil}]",
    "ref": 3715306381

This is the Log part from the CouchDB Server 3 that got the request:

[error] 2024-03-14T08:26:05.582072Z couchdb@deacsrewdbbt3.customer.com <0.902.0> 844d099269 req_err(3715306381) badrecord : [{{shard,<<"shards/00000000-ffffffff/foo.1710344337">>,
         'couchdb@deacsrewdbbt1.customer.com',<<"foo">>,
         [0,4294967295],
         #Ref<0.1884847839.2480668675.227080>,
         [{props,[]}]},
  nil},
 {{shard,<<"shards/00000000-ffffffff/foo.1710344337">>,
         'couchdb@deacsrewdbbt3.customer.com',<<"foo">>,
         [0,4294967295],
         #Ref<0.1884847839.2480668675.227078>,
         [{props,[]}]},
  nil}]
    [<<"nouveau_fabric_search:handle_message/3 L84">>,<<"rexi_utils:process_mailbox/6 L55">>,<<"nouveau_fabric_search:go/4 L64">>,<<"nouveau_httpd:handle_search_req/6 L103">>,<<"nouveau_httpd:handle_search_req/3 L56">>,<<"chttpd:handle_req_after_auth/2 L416">>,<<"chttpd:process_request/1 L394">>,<<"chttpd:handle_request_int/1 L329">>]
[notice] 2024-03-14T08:26:05.582265Z couchdb@deacsrewdbbt3.customer.com <0.902.0> 844d099269 10.249.4.203:5984 192.168.244.83 admin GET /foo/_design/foo/_nouveau/search?q=_id:doc1706* 500 ok 38

Is it not supported by Nouveau to work only with the local node in this case, or do i need to change something for this to work?

Thanks.

rnewson · 2024-03-25T08:22:08Z

Nouveau will be marked EXPERIMENTAL in couchdb 3.4.0 as we gather feedback from the community.

I certainly expect fault-tolerance from Nouveau so will look into your finding this week.

rnewson · 2024-03-25T08:53:39Z

as an aside, we strongly recommend that couchdb clusters do not span locations (offices in your case). the nodes of any given cluster should be very close together (<1ms ping time). For your use case we'd recommend a cluster per office and use of the http replication facility to sync data between offices.

rnewson · 2024-03-25T08:54:59Z

further aside, you don't have to have a nouveau node for each couchdb node, you can safely point multiple couchdb nodes at the same nouveau node. whether this is better or worse for you will depend on what you're doing and the performance specs of the server(s) nouveau is running on. one-to-one is a sensible place to start, though.

reported in #5002 the badrecord is because we execute the second clause of handle_message and mess up the internal state (just returning the Counters rather than a state record around it)

rnewson · 2024-03-25T09:15:44Z

I figured it out and have posted a PR to fix the fault tolerance.

Sliosh · 2024-03-25T12:29:13Z

Thanks for the fast fix. I will test this later.

as an aside, we strongly recommend that couchdb clusters do not span locations

I thought that with the following configuration there would be no problem to use a cluster instead of replication:

[cluster]
q=1
n=3
w=1
r=1

But if thats not supported, we need to switch to replication. In this case we need to replicate database deletes in the application layer, right?

you don't have to have a nouveau node for each couchdb node

We do this, because we need every location working independent if there is a network failure between them.

I still have some questions about how i should setup couchdb for our usecase. Where should i ask this questions? This issue is not the right place. Should i use an other issue or the mailing list?

rnewson · 2024-03-25T13:29:47Z

our slack is the best place for this kind of chat (couchdb.slack.com).

a couple of notes;

the 'r' and 'w' fields under cluster do nothing, they are no longer used by the code (for several years now)
erlang clusters (and therefore also couchdb clusters) need low latency between them, and reliable networking too. if these conditions aren't met we recommend separate clusters that use replication to push data around, as our replication system is tolerant of high latency and unreliable networking.

rnewson · 2024-03-25T13:30:38Z

oh, and 3) yes, you would need to delete databases within each cluster separately, as db deletion is not propagated by replication.

Sliosh changed the title ~~Version 3.4.0 and Nouveau~~ Nouveau Availability Mar 15, 2024

rnewson added a commit that referenced this issue Mar 25, 2024

nouveau: fix rexi_DOWN clause

c72960e

reported in #5002 the badrecord is because we execute the second clause of handle_message and mess up the internal state (just returning the Counters rather than a state record around it)

rnewson mentioned this issue Mar 25, 2024

nouveau: fix rexi_DOWN clause #5015

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nouveau Availability #5002

Nouveau Availability #5002

Sliosh commented Mar 12, 2024

nickva commented Mar 13, 2024

Sliosh commented Mar 13, 2024 •

edited

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

Sliosh commented Mar 25, 2024 •

edited

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

Nouveau Availability #5002

Nouveau Availability #5002

Comments

Sliosh commented Mar 12, 2024

nickva commented Mar 13, 2024

Sliosh commented Mar 13, 2024 • edited

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

Sliosh commented Mar 25, 2024 • edited

rnewson commented Mar 25, 2024

rnewson commented Mar 25, 2024

Sliosh commented Mar 13, 2024 •

edited

Sliosh commented Mar 25, 2024 •

edited