Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop 3.0 #29

Open
wants to merge 547 commits into
base: develop-3.0
Choose a base branch
from
Open

Develop 3.0 #29

wants to merge 547 commits into from

Conversation

martinsumner
Copy link
Owner

No description provided.

Initial upload of PB API for aae fold
Also correct query string fro fetch_clocks
To develop-2.9 to pick up necessary changes for #1732
There are problems caused by a previous mistake that merged mas-i1691-ttaaefullsync in develop-2.9.

This was reverted, but now updating mas-i1691-ttaaefullsync picks up the reversion, and starts removing stuff from mas-i1691-ttaaefullsync.

This merges back in develop-2.9 and reverts the reversion.
PB API, already supported with HTTP
Add a reaper process which will take reap requests, and also allow for direct reaps from the riak_client.

This will allow for reaping of tombstones, which is potentially useful when running delete-mode=keep.  It is intended to make that delete_mode potentially easier to support, by removing the need to perpetually keep tomstones.
Only accessible via riak_client().  Will only work with `tictacaae_storeheads` enabled if running tictac aae in parallel mode.

Find tombs will return a list of keys and integers, where the integers are the delete_hash's required to reap the tombstone (without an additional read)
Fold that allows for all tombstones to be reaped dynamically by the folding process
Also correction required to the accumulator for reap_tombs
This is lower cost than find_tombs just to calculate the length of the list.
This adds a riak_kv_eraser with equivalent behaviour to the riak_kv_reaper (although those similarities are not currently captured via a behaviour).  The eraser can queue up deletes, and there is an addition of an aae_fold to send deletes to the queue.

Some extra work is still required:

1. Make this a behaviour
2. Add the API changes for fold
3. Should the eraser still doubl-check the vector clock
4, Should it like the reaper only erase when primaries are up
Stops big binary drop in logs when the request fails on binary_to_existing_atom/2
martinsumner and others added 30 commits May 12, 2022 18:39
Expand on use of riak_kv_overflow_queue so that it is used by the riak_kv_replrtq_src, as well as riak_kv_reaper and riak_kv_eraser.

This means that larger queue sizes can be supported for riak_kv_replrtq_src without having to worry about compromising the memory of the node. This should allow for repl_keys_range AAE folds to generate very large replication sets, without clogging the node worker pool by pausing so that real-time replication can keep up.

The overflow queues are deleted on shutdown (if there is a queue on disk). The feature is to allow for larger queues without memory exhaustion, persistence is not used to persist queues across restarts.

Overflow Queues extended to include a 'reader' queue which may be used for read_repairs. Currently this queue is only used for the repair_keys_range query and the read-repair trigger.
Introduces a reader overflowq for doing read repair operations.  Initially this is used for:

- repair_keys_range aae_fold - avoids the pausing of the fold that would block the worker pool;
- repair on key_amnesia - triggers the required repair rather than causing an AAE delta;
- repair on finding a corrupted object when folding to rebuild aae_store - previously the fold would crash, and the AAE store would therefore never be rebuilt.  [This PR](martinsumner/kv_index_tictactree#106) is required to make this consistent in both AAE solutions.
Merges removed the stat updates for ttaae full-sync (detected by riak_test).

A log had been introduced in riak_kv_replrtq_peer what could crash (detected by riak_test).

The safety change to avoid coordination in full-sync by setting time for first work item from beginning of next hour, makes sense with 24 slices (one per hour) ... but less sense with different values.  riak_test which uses a very high slice_count to avoid delays then failed.
The replrtq_srcqueuelimit was removed, as the queue is now an overflow queue and should have its limit configured via replrtq_overflow_limit.

However, removing the old config item altogether will result in upgrades not being possible without changes to configuration files - so the old option is instead retained as a hidden config item to be ignored.
* Update rebar.config

* Update rebar.config
* Add test to illustrate issue

* Do not crash when object's contents is an empty list
* add missing function clause repair_keys_range in convert_fold, to unbreak aae_fold for that case

* thread converted aae_fold query in riak_client, to complete prev commit
This means that get requests will use bucket-type level Primary Read settings

Co-authored-by: Peter Tihanyi <peter.tihanyi@otpbank.hu>
* Add reip/3

To allow for reip without loading the riak_core application

* Use alternate name

* Update riak_kv_console.erl

* Update riak_kv_console.erl

* reip_manual inputs are atoms

* Add warning to update riak.conf file after reip

* Make clear where attention is required

And return 'ok' to make clear the op was successful

* Update rebar.config
Expected release candidate for 3.0.12
Make situation clearer in log

Rather than change behaviour, just make it clear that the warning can be ignored when shutting down.  This should avoid unnecessary concern.
* Read repair - configure to repair primary only

By default, the behaviour should be unchanged.  however it is now configurable to read repair primary vnodes only - fallback vnodes will not be repaired on failing GETs, they will only receive new PUTs.

See schema change for more details.

* get_option returns value not {K, V}

* Add ability to suspend AAE

* Add logging of read repairs

Initially to troubleshoot in test - but perhaps of generic use.

* Handle handoff put through standard put code

Rather than replicate piecemeal the standard PUT code within do_diffobj_put/3, instead use do_handoff_put/3 and use standard prepare_put/2 and perform_put/3 functions used in normal PUTs.

The effect of this is that any optimisations in the normal PUT workflow, will now automatically be used for handoffs.  Of particular relevance at this point is the HEAD (not GET) before PUT optimisation available with leveled backend.  If there are large objects, and objects which already exist in the receiving vnode are to be handed off (such as in hinted handoff), then this increases efficiency.

Some spec improvements to help with some editors that do not like fun() type.  Some indent reductions to improve readability.

* Make HookReason part of PutArgs

This allows the same code to be used for both handoff and put.

* Revert defaulting of properties

As riak_core updated to ensure bucket types are exchanged prior to join committing

* Add helpful operator functions to riak_client

To make recovery of nodes easier, adding some helper functions to riak_client.

* Update branches

Remove legacy thumbs

* Update rebar.config
* Never GET before PUT

As the if_not_modified and if_none_match are not supported via the HTTP API for non-consistent PUT - simply build object from a new object, as with PB API.

* More webmachine friendly override

resource_exists/2 is a key part of the flow for both GET and PUT, as conditional HTTP headers require this check for PUTs.  therefore only override resource_exists (and don't fetch) when it is a PUT, and those conditional headers do not exist.

* Attempt to tidy and refactor delete

So that delete does not require a fetch

* Pipe-cleaned delete path

* Add if_not_modified conflict check

To mimic if_not_modified feature via PB API

* Use hyphen not underscore

To be consistent with other HTTP headers

* Revert to 404 on if DELETE not_found

Also ensure the timeouts passed in in a delete is respected, and passed through the riak_client to the FSM.
* Handoff deletes

* Make delete handoff configurable
* Log Fragmentation

With the leveled backend, memory fragmentation is an ongoing concern.  So, here the regular compaction callback is used to log information about carrier sizes and carrier block sizes for key allocators.

* Correct log format

* Update riak_kv_leveled_backend.erl

* Update to account for review comments

* Spawn function to log

There may be a very large number of allocators, so the cost of calling recon_alloc:fragmentation/1 and processing the outputs could be high - so don't lock the vnode in this case.

* Don't make log function depend on format of recon_alloc:fragmentation/1

* Update rebar.config
Don't crash vnode in this case
When etching an object for replication, enough nodes are found if the expected clock is true - but that might then fail on node_confirms.  Don't generate the node_confirms response error if there has been a match on the expected clock
Avoid overloading the eraser/reaper process mailbox by sending the requests in batches (as already happened with range_repl), and waiting for a response.

When a job is used, not local, the batching is done from the clusteraae_fsm.  This mechanism existed prior to this commit, and has not been changed, but has been extended to support the last-batch overflow
* Use 'RR' when a prunable vclock is replicated

There may be some situations whereby a vector clock grows beyond the prescribed limits on the source cluster - in particular following read repair.

In these cases the new object needs to be replicated but with the same resulting vector clock (assuming no siblings).  If the same vector clock does not result on the sink - any full-sync operation may continuously detect the  delta, but not be able to resolve it (as the sink vnodes prune each time).

The 'rr' option will, on riak_kv_vnode, ensure pruning is bypassed so that we avoid pruning on a sink, if we have not pruned on a source.  The 'rr' option is only used when the clock is prunable (as otherwise the delta could occur in the reverse direction).

The 'rr' option also blocks some sibling constraint checks (e.g. maximum number of siblings.  However, as the most likely cause of it being applied is 'rr' on the src side - this is still generally a win for replication consistency.

* Switch logic to put_fsm

Already know bucket props at this point.  case only to be considered when `asis` - so should also work for riak_repl aae full-sync

* Lose a line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants