Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

view has the wrong signature #4994

Open
sergey-safarov opened this issue Feb 26, 2024 · 20 comments
Open

view has the wrong signature #4994

sergey-safarov opened this issue Feb 26, 2024 · 20 comments

Comments

@sergey-safarov
Copy link

Description

On one node (db-2.example.com) we catch error logs

[error] 2024-02-26T19:53:42.735715Z couchdb@db-2.example.com <0.21142.1038> -------- ./data/.shards/e0000000-ffffffff/account/a1/15/78067a1db9f83fa4da53f6bacc16.1663539188_design/mrview/fa7a1a9e5db7f6873529299ea929daad.view has the wrong signature: expected: <<250,122,26,158,93,183,246,135,53,41,41,158,169,41,218,173>> but got <<87,225,55,84,101,143,50,221,45,61,42,165,141,102,202,177>>
[error] 2024-02-26T19:54:07.738902Z couchdb@db-2.example.com <0.22173.1076> -------- ./data/.shards/40000000-5fffffff/account/a1/15/78067a1db9f83fa4da53f6bacc16.1663539188_design/mrview/fa7a1a9e5db7f6873529299ea929daad.view has the wrong signature: expected: <<250,122,26,158,93,183,246,135,53,41,41,158,169,41,218,173>> but got <<87,225,55,84,101,143,50,221,45,61,42,165,141,102,202,177>>

Then it triggers CPU usage and is not responsible CouchDB 3 nodes cluster.

In logs present messages like
fabric_worker_timeout get_db_info

[error] 2024-02-26T20:00:39.603653Z couchdb@db-2.example.com <0.26790.1086> 182b91a95a fabric_worker_timeout get_db_info,'couchdb@db-0.example.com',<<"shards/20000000-3fffffff/account/56/3b/1e36302175513343e6a13f6a1372-202308.1690848013">>

fabric_worker_timeout open_doc

[error] 2024-02-26T20:00:40.479851Z couchdb@db-2.example.com <0.2074.1115> -------- fabric_worker_timeout open_doc,'couchdb@db-2.example.com',<<"shards/40000000-5fffffff/account/e3/2d/d743a02e6ee4e28f02755e070d33.1685539862">>

fabric_worker_timeout open_doc

[error] 2024-02-26T20:00:45.480715Z couchdb@db-2.example.com <0.26670.936> a6eca4ce74 fabric_worker_timeout open_doc,'couchdb@db-1.example.com',<<"shards/40000000-5fffffff/account/e3/2d/d743a02e6ee4e28f02755e070d33.1685539862">>

Steps to Reproduce

Not known.

Expected Behaviour

Error in the one view should not stop functionality of 3 nodes cluster.

Your Environment

  • CouchDB version used: 3.3.2
  • Browser name and version: browser do not used
  • Operating system and version: CentOS 8

Additional Context

Used apache/couchdb:3.3.2 docker container.

@nickva
Copy link
Contributor

nickva commented Feb 26, 2024

That happens when a view shard is opened but has a view signature that's not current. It comes from

"~s has the wrong signature: expected: ~p but got ~p",

The expected signature matches file path:

binary:encode_hex(<<250,122,26,158,93,183,246,135,53,41,41,158,169,41,218,173>>).
<<"FA7A1A9E5DB7F6873529299EA929DAAD">>

The other one is:

binary:encode_hex(<<87,225,55,84,101,143,50,221,45,61,42,165,141,102,202,177>>).
<<"57E13754658F32DD2D3D2AA58D66CAB1">>

I don't recall seeing this error too often. Is there any chance your view shard files were moved, copied, restored from backup from an much older couch instance, or mounted on a volume shared across multiple nodes?

Is it easy to reproduce? Just happened once or it's a regular occurrence?

Then it triggers CPU usage and is not responsible CouchDB 3 nodes cluster.

That's expected as the next action after the log is to reset the view shard and rebuild. So in other words, after the view rebuilds, it should be back to normal.

@rnewson
Copy link
Member

rnewson commented Feb 26, 2024

I would also ask that question: have the files been moved or renamed outside of couchdb's control? This error is not normal, and is a protection mechanism (effectively an assertion). another possibility is you have multiple nodes pointing at the same shared volume, and trashing each others state.

@sergey-safarov
Copy link
Author

Is there any chance your view shard files were moved, copied, restored from backup from an much older couch instance, or mounted on a volume shared across multiple nodes?

I do not think we make something from the above.
We starting docker container and do not move CouchDB files.
Also, we do not use shared volumes.
We make couchdb backup using bulk requests but do not restore CouchdDB database at this time period.

Is it easy to reproduce? Just happened once or it's a regular occurrence?
It is the first time.

after the view rebuilds, it should be back to normal.

This does not happen for 17 minutes and we have recreated the docker container.

@sergey-safarov
Copy link
Author

Just checked via AWS console, the CouchDB has "Multi-Attach enabled: no".
So use volume mounted to only one node.

@sergey-safarov
Copy link
Author

after the view rebuilds, it should be back to normal.

Also, an issue with the view file happens on one node in the cluster. But this does not explain why CPU consumed on other nodes.
So view rebuild should happen on one node and two other nodes should be not affected.

@nickva
Copy link
Contributor

nickva commented Mar 14, 2024

I am not sure exactly why more CPU would be consumed on the other nodes. It could be that view shards there hadn't caught up as much as the shard which was reset so they started to get built. When this view was reset perhaps there were active requests waiting to receive rows (responses) from it and so waiting view clients might have been piling up.

You can check for the number of waiting view clients with: https://docs.couchdb.org/en/stable/api/ddoc/common.html#get--db-_design-ddoc-_info

To see if it's any view builds taking place, try using https://docs.couchdb.org/en/stable/api/server/common.html#active-tasks

@sergey-safarov
Copy link
Author

We have reproduced the same issue on the server without docker.

@jcoglan
Copy link
Contributor

jcoglan commented Apr 3, 2024

It appears that it is possible for the design doc signature to vary according to something we have not yet determined. The client's cluster that we (Neighbourhoodie) are investigating gives the following results for one of the design docs affected by a bad index signature:

$ cdb '/' | jq '{ version, git_sha }'
{
  "version": "3.3.3",
  "git_sha": "40afbcfc7"
}

$ cdb '/{db}/_design/{id}' | md5sum
ef932f29b40bc1795360ea095051d782  -

$ cdb '/{db}/_design/{id}/_info' | jq '.view_index.signature'
"c15a2c0300aefa07abdea125eba80a98"

We copied this design doc and saved into a local dev cluster running the same CouchDB version/commit, and it gave the same md5sum and signature.

However, when putting this same design doc into the CouchDB service installed by Homebrew, it gives the same md5sum but a different signature:

$ cdb '/{db}/_design/{id}/_info' | jq '.view_index.signature'
"20ae103a2d660c16f4cbbc43703ede09"

This indicates that it’s possible for the same design document to produce a different signature on different systems, possibly on different nodes of the same cluster depending on what the cause is.

Environment of each test:

  • Customer production system: Linux, CouchDB v3.3.3, SpiderMonkey 1.8.5, Erlang 24.3.4.15, ICU library 60.3, collator 153.80, algorithm 10

  • Local dev cluster: macOS 13.6.4, CouchDB v3.3.3, SpiderMonkey 91, Erlang 25.3.2.9, ICU library 74.2, collator 153.121, algorithm 15.1

  • Homebrew install: macOS 13.6.4, CouchDB v3.3.3, SpiderMonkey 91, Erlang 26.2.1, ICU library 73.2, collator 153.120, algorithm 15

@nickva
Copy link
Contributor

nickva commented Apr 3, 2024

@jcoglan I suspect if term_to_binary output is different between architectures or erlang versions somehow.

View signatures are computed in:

SigInfo = {Views, Language, DesignOpts, couch_index_util:sort_lib(Lib)},
{ok, IdxState#mrst{sig = couch_hash:md5_hash(?term_to_bin(SigInfo))}}.

See if you can an add a log statement and dump the SigInfo term there and try running the term through the term_to_binary in an erlang prompt on all erlang/architecture versions and see it produces the same result.

Erlang 26 had introduced the deterministic term_to_binary option so maybe we should start using that (though it only guarantees to be stable for the same Erlang/OTP release).

The other thing that's changed is UTF8 encoding of atoms. That's probably what the problem is here.

See some discussion from last year about it: #4467 (comment)

@rnewson
Copy link
Member

rnewson commented Apr 3, 2024

(node1@127.0.0.1)17> Views.
[{mrview,0,0,0,[],
         [{<<"consumed">>,<<"_sum">>}],
         <<"function (doc) { if (doc.pvt_type != 'allotment_consumption' || doc.pvt_deleted) ret"...>>,
         nil,[]},
 {mrview,1,0,0,
         [<<"consumed_by_callid">>],
         [],
         <<"function (doc) { if (doc.pvt_type != 'allotment_consumption' || doc.pvt_deleted)"...>>,
         nil,[]}]
(node1@127.0.0.1)18> Language.
<<"javascript">>
(node1@127.0.0.1)19> DesignOpts.
[]
(node1@127.0.0.1)20> Libs.
[]
(node1@127.0.0.1)21> couch_util:to_hex_bin(couch_hash:md5_hash(term_to_binary({Views, Language, DesignOpts, Libs}))).
<<"c15a2c0300aefa07abdea125eba80a98">>
(node1@127.0.0.1)22> couch_util:to_hex_bin(couch_hash:md5_hash(term_to_binary({Views, Language, DesignOpts, Libs}, [{minor_version, 2}]))).
<<"20ae103a2d660c16f4cbbc43703ede09">>

couchdb 3.3.3 does not use term_to_bin everywhere, so its output will vary based on whether it runs inside OTP 26 or earler.

@rnewson
Copy link
Member

rnewson commented Apr 3, 2024

fixed since 3.3.3 with commit 453c698

@rnewson
Copy link
Member

rnewson commented Apr 3, 2024

noting that couchdb 3.3.3 explicitly rejects OTP 26;

make
==> snappy (compile)
ERROR: OTP release 26 does not match required regex 23|24|25

@rnewson
Copy link
Member

rnewson commented Apr 3, 2024

rebar.config.script:    {require_otp_vsn, "23|24|25"},

@nickva
Copy link
Contributor

nickva commented Apr 3, 2024

Good find @rnewson it's the {minor_version, 1} and atom encoding. Any terms with atoms produced running with Erlang 26 before that would definitely have a different hash.

On 26

> term_to_binary(foo, [{minor_version,1}]).
<<131,100,0,3,102,111,111>>

> term_to_binary(foo).
<<131,119,3,102,111,111>>

But I wonder if there is something else besides atoms there that would cause in-determinism, I'd hope not because that would be terrible. We don't use maps there, ref and pids but binary reference chunks might sneak in...?

@rnewson
Copy link
Member

rnewson commented Apr 3, 2024

@nickva term_to_binary is fragile and we should not be using it for view signatures (at least). The original decision was done I'm sure out of expediency and/or (benign) ignorance. forcing {minor_version, 1} is likely a solid fix for years to come, but the only truly safe path out is to define the view signature algorithm explicitly.

@nickva
Copy link
Contributor

nickva commented Apr 4, 2024

We have reproduced the same issue on the server without docker.

@sergey-safarov that's good news. Can it turned into a script or it's fairly simple to describe steps?

@sergey-safarov
Copy link
Author

For me, it isn't easy to reproduce. It randomly happens.
The CouchDB server works several weeks without this error and then it can happen.

@nickva
Copy link
Contributor

nickva commented Apr 5, 2024

I searched through our logs and found a few of "has the wrong signature" errors as well. In our case they all happened on nodes were being decommissioned and database shards were migrating to new nodes. Wonder if there is a higher chance of it happening if there is a network partition or shard map changes when a the design document updates at the same time...

@janl
Copy link
Member

janl commented Apr 9, 2024

noting that couchdb 3.3.3 explicitly rejects OTP 26;

I seem to have been naughty when making the 3.3.3 Mac binaries, I swear I did this for a good reason, but I don’t recall it at the moment. I think the 25-jit failed on arm Mac, but I’m not sure: https://github.com/janl/build-couchdb-mac/blob/master/build.sh#L72-L73

None of this is relevant to the issue in the ticket, it just clarifies what @jcoglan reported

nickva added a commit that referenced this issue Apr 10, 2024
When we upgrade empty view files from 2.x, we end up skipping the commit.
Subsequently, we will see a "wrong signature" error when the view is opened
later. The error is benign as we'd end up resetting an empty view, but it may
surprise an operator. To avoid this, ensure to always commit after upgrading
old views.

Issue: #4994
@janl
Copy link
Member

janl commented Apr 10, 2024

@jcoglan and I have come up with a repro for the 'wrong signature' event: https://gist.github.com/jcoglan/0a5feb4af2a496ce10c9b80cf02ea28f. Our theory is that when an index is first queried following a v2->v3 upgrade, https://github.com/apache/couchdb/blob/3.3.3/src/couch_mrview/src/couch_mrview_index.erl#L121 will rename the file and return the old signature. however, maybe_update_index_file/1 just renames the file, it does not change its content. https://github.com/apache/couchdb/blob/3.3.3/src/couch_mrview/src/couch_mrview_index.erl#L127 matches and the normal index update path is followed, but if the index is empty then no new content and no new header is written, so the old signature remains in the file. the next time the view is queried, maybe_update_index_file/1 will do nothing (the old file does not exist) and return ok, so we hit this clause where wrong signature appears https://github.com/apache/couchdb/blob/3.3.3/src/couch_mrview/src/couch_mrview_index.erl#L139

nickva added a commit that referenced this issue Apr 10, 2024
When we upgrade empty view files from 2.x, we end up skipping the commit.
Subsequently, we will see a "wrong signature" error when the view is opened
later. The error is benign as we'd end up resetting an empty view, but it may
surprise an operator. To avoid this, ensure to always commit after upgrading
old views.

Issue: #4994
nickva added a commit that referenced this issue Apr 10, 2024
When we upgrade empty view files from 2.x, we end up skipping the commit.
Subsequently, we will see a "wrong signature" error when the view is opened
later. The error is benign as we'd end up resetting an empty view, but it may
surprise an operator. To avoid this, ensure to always commit after upgrading
old views.

Issue: #4994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants