Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nouveau Availability #5002

Open
Sliosh opened this issue Mar 12, 2024 · 9 comments
Open

Nouveau Availability #5002

Sliosh opened this issue Mar 12, 2024 · 9 comments

Comments

@Sliosh
Copy link
Contributor

Sliosh commented Mar 12, 2024

Hello.

We are planing to deploy CouchDB to all our customers sites. Because of that we are evaluating how we will rollout couchdb. It would be good to know when we can expect version 3.4.0. Is there any possible timeframe known a this point in time? And will nouveau be included (as stable) in 3.4.0?

Thanks for your help.

@nickva
Copy link
Contributor

nickva commented Mar 13, 2024

Nouveau will be included in 3.4.0 release.

There is not set timeframe, but currently the main thing we're waiting on is review of the nouveau deb packaging PR in apache/couchdb-pkg#125. I am planning on reviewing this week or weekend.

@Sliosh
Copy link
Contributor Author

Sliosh commented Mar 13, 2024

Thanks. This sounds good to me. I have a second question about Nouveau. We want to have one or more couchdb and corresponding nouveau nodes for our customer in different offices from our customers. We need to sync the data between the nodes, and everything should still work for every node, even if the Network between the nodes is lost. The should get synced after network restoration. We use clustering with n = NODE_COUNT and q = 1, so that we get all data on every node. This seems to work as expected, but Nouveau doesen't work if any node is down. We get the following message if a node (In this example couchdb@deacsrewdbbt2.customer.de) is down and is not included in the error response below:

"error": "badrecord",
    "reason": "[{{shard,<<\"shards/00000000-ffffffff/foo.1710344337\">>,\n         'couchdb@deacsrewdbbt1.customer.com',<<\"foo\">>,\n         [0,4294967295],\n         #Ref<0.1884847839.2480668675.227080>,\n         [{props,[]}]},\n  nil},\n {{shard,<<\"shards/00000000-ffffffff/foo.1710344337\">>,\n         'couchdb@deacsrewdbbt3.customer.com',<<\"foo\">>,\n         [0,4294967295],\n         #Ref<0.1884847839.2480668675.227078>,\n         [{props,[]}]},\n  nil}]",
    "ref": 3715306381

This is the Log part from the CouchDB Server 3 that got the request:

[error] 2024-03-14T08:26:05.582072Z couchdb@deacsrewdbbt3.customer.com <0.902.0> 844d099269 req_err(3715306381) badrecord : [{{shard,<<"shards/00000000-ffffffff/foo.1710344337">>,
         'couchdb@deacsrewdbbt1.customer.com',<<"foo">>,
         [0,4294967295],
         #Ref<0.1884847839.2480668675.227080>,
         [{props,[]}]},
  nil},
 {{shard,<<"shards/00000000-ffffffff/foo.1710344337">>,
         'couchdb@deacsrewdbbt3.customer.com',<<"foo">>,
         [0,4294967295],
         #Ref<0.1884847839.2480668675.227078>,
         [{props,[]}]},
  nil}]
    [<<"nouveau_fabric_search:handle_message/3 L84">>,<<"rexi_utils:process_mailbox/6 L55">>,<<"nouveau_fabric_search:go/4 L64">>,<<"nouveau_httpd:handle_search_req/6 L103">>,<<"nouveau_httpd:handle_search_req/3 L56">>,<<"chttpd:handle_req_after_auth/2 L416">>,<<"chttpd:process_request/1 L394">>,<<"chttpd:handle_request_int/1 L329">>]
[notice] 2024-03-14T08:26:05.582265Z couchdb@deacsrewdbbt3.customer.com <0.902.0> 844d099269 10.249.4.203:5984 192.168.244.83 admin GET /foo/_design/foo/_nouveau/search?q=_id:doc1706* 500 ok 38

Is it not supported by Nouveau to work only with the local node in this case, or do i need to change something for this to work?

Thanks.

@Sliosh Sliosh changed the title Version 3.4.0 and Nouveau Nouveau Availability Mar 15, 2024
@rnewson
Copy link
Member

rnewson commented Mar 25, 2024

Nouveau will be marked EXPERIMENTAL in couchdb 3.4.0 as we gather feedback from the community.

I certainly expect fault-tolerance from Nouveau so will look into your finding this week.

@rnewson
Copy link
Member

rnewson commented Mar 25, 2024

as an aside, we strongly recommend that couchdb clusters do not span locations (offices in your case). the nodes of any given cluster should be very close together (<1ms ping time). For your use case we'd recommend a cluster per office and use of the http replication facility to sync data between offices.

@rnewson
Copy link
Member

rnewson commented Mar 25, 2024

further aside, you don't have to have a nouveau node for each couchdb node, you can safely point multiple couchdb nodes at the same nouveau node. whether this is better or worse for you will depend on what you're doing and the performance specs of the server(s) nouveau is running on. one-to-one is a sensible place to start, though.

rnewson added a commit that referenced this issue Mar 25, 2024
reported in #5002

the badrecord is because we execute the second clause of handle_message
and mess up the internal state (just returning the Counters rather than a
state record around it)
@rnewson
Copy link
Member

rnewson commented Mar 25, 2024

I figured it out and have posted a PR to fix the fault tolerance.

@Sliosh
Copy link
Contributor Author

Sliosh commented Mar 25, 2024

Thanks for the fast fix. I will test this later.

as an aside, we strongly recommend that couchdb clusters do not span locations

I thought that with the following configuration there would be no problem to use a cluster instead of replication:

[cluster]
q=1
n=3
w=1
r=1

But if thats not supported, we need to switch to replication. In this case we need to replicate database deletes in the application layer, right?

you don't have to have a nouveau node for each couchdb node

We do this, because we need every location working independent if there is a network failure between them.

I still have some questions about how i should setup couchdb for our usecase. Where should i ask this questions? This issue is not the right place. Should i use an other issue or the mailing list?

@rnewson
Copy link
Member

rnewson commented Mar 25, 2024

our slack is the best place for this kind of chat (couchdb.slack.com).

a couple of notes;

  1. the 'r' and 'w' fields under cluster do nothing, they are no longer used by the code (for several years now)
  2. erlang clusters (and therefore also couchdb clusters) need low latency between them, and reliable networking too. if these conditions aren't met we recommend separate clusters that use replication to push data around, as our replication system is tolerant of high latency and unreliable networking.

@rnewson
Copy link
Member

rnewson commented Mar 25, 2024

oh, and 3) yes, you would need to delete databases within each cluster separately, as db deletion is not propagated by replication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants