Skip to content

Releases: Derecho-Project/derecho

v2.4.1rc

09 May 22:29
323f068
Compare
Choose a tag to compare
v2.4.1rc Pre-release
Pre-release

This is a pre-release for the new API that is not compatible with v2.4.0, which allows removing a copy during persistence implemented by PR #279.

What's Changed

Full Changelog: v2.4.0...v2.4.1rc

v2.4.0

09 May 17:48
v2.4.0
7109b15
Compare
Choose a tag to compare

New Features

  • Out-of-Band RDMA transfers (i.e. using buffers not managed by Derecho) - PR #256
  • GPU-direct and CUDA support for out-of-band transfers - PRs #273 and #275. This is enabled with the new compile-time configuration flag ENABLE_HMEM.
  • Optionally split derecho.cfg into 2 configuration files (derecho.cfg and derecho_node.cfg) - PR #268
  • Configuration option renamed from "leader IP" to "contact IP" - PR #269
  • Allow Replicated objects to get new-view callbacks - PR #250
  • Split debug logs into multiple modules with independent log levels - PR #254
  • Persistent<T> objects throw exceptions derived from std::exception instead of throwing integers - PR #253
  • Configuration-related string constants are exposed as constants rather than macros - PR #260
  • External clients can gracefully exit instead of abruptly closing their connection - PR #266
  • New methods on GroupProjection pointers to allow RPC methods to read group configuration settings - PR #265
  • Global stability callback is not triggered for RPC messages delivered to Replicated objects, only for "raw" Derecho multicast messages. This is not documented in a PR but is a significant API change. (Changed in commit eb5b281)
  • Type aliases node_id_t and ip_addr_t are contained within the derecho:: namespace instead of being placed in the global namespace, to reduce conflicts with other libraries. (Changed in commit 9a0e5ad)
  • Macros that enable/disable features at compile time (e.g. USE_VERBS_API) are written into a generated config.h file by CMake, instead of passed to every compiler invocation as a -D flag

Bug fixes

  • Serialization problems related to inconsistent usage of context_ptr<const T> - issue #204, PRs #240, #270
  • Node crashes caused by an external client connecting while the group is starting - PR #243
  • Node crashes during total restart due to new members being unaware of the total restart - issue #252, PR #257
  • Node configured as the restart leader could not rejoin if it crashed - PR #269
  • Minor errors and inconsistencies in CMake files - PRs #255, #259
  • Dependency-installation scripts neglecting to clean up their temporary work directories - PR #258
  • Timestamp order could be inconsistent among replicas if clocks drifted - PR #267
  • Incomplete synchronization on graceful shutdowns - PR #263
  • Signed persistent logs did not properly handle Delta-supporting Persistent objects in some cases - PR #271

v2.3.0

19 May 18:24
v2.3.0
3aef3d8
Compare
Choose a tag to compare

This version includes some API changes as well as new features.

Notable Changes

  • The type ExternalCaller<T> is now named PeerCaller<T> to reflect the fact that it is not "external" to the Derecho group; it represents a Derecho group member that is not in subgroup type T (but is in the same top-level group as type T).
  • The type ExternalGroup is now named ExternalGroupClient to reflect the fact that it represents a client process that will communicate with the Derecho group, not the entire group or a group member.
  • The bundled mutils-serialization library now uses uint8_t* instead of char* as the type that represents a "pointer to a plain byte array" (in the to_bytes and from_bytes functions). The serialization functions for Derecho objects, and DEFAULT_SERIALIZATION_SUPPORT macro for user-defined replicated types, have been correspondingly updated. See issue #218 and pull request #223.

New Features

  • External clients (processes running outside the Derecho group) can now be sent notifications by members of the Derecho group. To enable this feature, the Derecho subgroup (replicated type) that the client communicates with must inherit from NotificationSupport, and register the notify method as P2P-callable. The macro REGISTER_RPC_FUNCTIONS_WITH_NOTIFICATION can be used instead of REGISTER_RPC_FUNCTIONS when declaring the subgroup's class, in order to ensure that notify is registered. See pull request #239
  • Persistent objects now have a getDeltaSignature<DeltaType>() method that can retrieve a signature from a Delta-supporting Persistent object only if it matches a user-provided search function. This is similar to the existing getDelta<DeltaType>() function that accepts a user-provided function as a parameter. See pull request #220
  • Group has a new get_num_subgroups<SubgroupType>() method, which returns the number of subgroups of the same type that exist in the current configuration, and a new get_my_subgroup_indexes<SubgroupType>() method, which returns a vector of subgroup indexes (of that type) that the local node belongs to.
  • Replicated<T> now has the methods get_global_persistence_frontier() and get_global_verified_frontier(), which allow application code to learn the highest version number that has reached global-persistence stability or global-signature stability. In addition, the method wait_for_global_persistence_frontier() will block until a specified version number has reached global persistence. See pull request #225
  • RPC-callable functions in replicated types can now learn the ID of the calling node by calling _Group::get_rpc_caller_id() within their function body. See #227 and #228.

Bugs fixed

  • The P2P send mechanism was not thread-safe and could suffer from a race condition between the P2P sending thread and the SST predicates (RPC-handling) thread. This was fixed by removing internal state from the P2PConnection object so that it could be accessed concurrently without the concurrent threads modifying shared data. See #217
  • Some internal Derecho files incorrectly used #include with < > instead of #include with quotes to include other Derecho files, which could cause compile errors when trying to rebuild the library after it is already installed: the < > syntax defaults to searching system library locations before files in the local source tree. The double-quotes syntax should always be used to refer to files within the same library.
  • CMakeLists.txt had some errors in the way it packaged the Derecho library for installation. It should be capable of handling custom installation locations now.
  • Throughput was lower than expected in groups that sent large numbers of both P2P and ordered messages. This turned out to be caused by high contention between the P2P thread and the predicates thread for the RDMA queue pair managed by LibFabric: LibFabric internally used a biased spinlock to protect this resource, and one thread would end up starving during periods of contention. We now require LibFabric to be configured with spinlocks disabled, so that it uses fair mutexes instead (see commit ae97bea)

v2.2.2

22 Oct 19:20
v2.2.2
066253c
Compare
Choose a tag to compare

More bug fixes have been implemented and tested. This version should be used in preference to 2.2.1 or 2.2.0, since it's much more stable.

Bugs Fixed

  • View changes could get "stuck" for a variety of reasons if many nodes joined and left in a short period of time, as documented in issue #213. Fixed in #216.
  • Nodes that issued several concurrent multicasts could become deadlocked in RemoteInvoker::receive_response because all calls to the same function's receive_response would share the same receive_response_mutex. This bug was actually introduced in #211 when we changed the way responses were delivered to PendingResults objects in order to fix another bug; previously, there was no mutex in receive_response. Also fixed in #216.
  • The report_failure callback in RPCManager, called by P2PConnectionManager, could deadlock trying to acquire view_mutex while holding a p2p_connection_mutex. Fixed by making RPCManager keep track of external connections on its own, so it doesn't need to acquire view_mutex at all (also in #216).
  • Group members that handle P2P messages from external clients could crash if they attempted to send a reply to an external client after it disconnected, as documented in #214. Fixed by ee9a622

Other Improvements

  • CMakeLists.txt now declares a more recent CMake version, specifically 3.15.4 rather than 2.8.1. This reflects the version of CMake we've actually been using, and avoids generating warnings on newer systems (CMake 3.21 has started emitting warnings if the version required in CMakeLists.txt is older than 2.8.12).
  • CMakeLists.txt now specifies that we require the C++17 standard to compile.
  • Nodes produce fewer warnings and errors when shutting down "cleanly." A node that marks itself as failed will no longer attempt to freeze its own SST row (which causes a segmentation fault), and a leader that marks itself as failed will no longer throw an exception or warn about a potential partitioning event. (Fixed in a3443bb and 64c0396)

v2.2.1

27 Aug 17:39
v2.2.1
d3aef87
Compare
Choose a tag to compare

This is a minor release just to ensure an important bug fix is available.

New Features

  • QueryResults objects can now be polled in a non-blocking manner by calling QueryResults::is_ready(), before calling get() on either the QueryResults itself or one the reply futures contained in its ReplyMap. Added in pull request #209 .
  • The JSON-formatted subgroup layout file can now specify which reserved node IDs should be configured as senders, and which should be configured as non-senders. Added in #210 .

Bugs Fixed

  • The fixed-size array of PendingResults objects could overflow if a node generated more than 4096 concurrent RPC requests, as documented in #205 . This was fixed in #211 by making PendingResults heap-allocated instead.
  • The new-view callback in RPCManager could get stuck in an infinite loop due to a mistake in iterator usage. Fixed in 25df1d2.

v2.2.0

20 Jul 17:57
v2.2.0
03edaf8
Compare
Choose a tag to compare

This version adds some new features needed by Cascade, and fixes several bugs discovered since our last release.

New Features

  • RPC functions on Replicated Objects must now be labeled as either P2P-callable or ordered-callable using tag_p2p or tag_ordered (instead of the previous tag function). The macros P2P_TARGETS and ORDERED_TARGETS can be used within REGISTER_RPC_FUNCTIONS to tag functions appropriately when writing a Replicated Object class. A P2P-callable function should not modify any replicated object state (and must be const), while an ordered-callable function can modify the object's state but cannot be called with a p2p_send. See #178 and #186.
  • QueryResults objects returned from ordered_send calls can now be used to determine whether the new version (object state) created by the ordered_send has finished persisting. QueryResults::await_local_persistence() blocks until the version has finished persisting locally, and QueryResults::await_global_persistence() blocks until the version has finished persisting on all replicas. These functions have the same semantics as std::future<void>::get(), so they can only be called once. See #167 and #194.
  • The DefaultSubgroupAllocator can be configured by reading a JSON string specified in derecho.cfg instead of by constructing SubgroupAllocationPolicy objects. In addition, it now has the ability to "reserve" certain node IDs for certain shards, instead of always assigning them in a round-robin fashion. More details are documented in README.md; also see #206.
  • If a node catches an exception (derived from std::exception) while processing an incoming RPC function call, it now returns the exception's description to the caller. This means derecho::remote_exception_occurred will produce a more useful error message when it is thrown on the caller's side, instead of simply stating that some kind of exception occurred while invoking an RPC function. This was added while fixing #198.

Bugs Fixed

  • A potential deadlock between the predicate-handling thread and the P2P-message thread when a new external client joins the group. Details in #195 (and #197)
  • Sending an RPC reply that exceeds max_reply_payload_size would cause the recipient of the reply to segfault. This now causes an exception on the sender of the reply, which is sent back to the receiver in a remote_exception_occurred message. See #198
  • A few test programs needed to be updated with bug fixes discovered and applied in other test programs: #183, #189, #191

Dependencies

  • Moved to libfabric v1.12.1 (see #199)
  • Added a dependency on nlohmann_json 3.9.0 (for the new JSON layout feature)

v2.1.0

18 Nov 03:26
v2.1.0
bf1554b
Compare
Choose a tag to compare

Added a new feature, and fixed a few bugs.

New Features

  • Persistent objects (i.e. Replicated objects with Persistent fields) can now generate signed logs of their update history, as described in pull request #179
  • A small wrapper library around OpenSSL is implemented in the opnssl/ directory; this is used to support the signed-logs feature
  • New and improved performance tests in the applications/tests/performance_tests/ directory
  • derecho::Group can be constructed with multiple DeserializationContexts, in case each subgroup needs its own DeserializationContext (see issue #162)
  • The Persistent<T> type now has a getDelta() function (and associated getDeltaByIndex()) that can be used when a Persistent field supports delta-based logs, as described in pull request #173
  • Added some new accessor functions to ExternalGroup (the external client class) for retrieving the current number of subgroups and shards

Bugfixes

Fixed the following bugs:

  • Our copy of the mutils-serialization library was missing support for some STL containers due to missing forward-declarations (#170)
  • The RDMC sending thread unnecessarily waited for persistence to finish (#176)

v2.0.1

26 Jun 21:50
v2.0.1
Compare
Choose a tag to compare

A few small but important bug fixes over v2.0.

Bugfixes

  • Fixed a regression where external clients failed to construct a tcp_connections object because tcp_connections started asserting that ip_addrs_and_ports was not empty; tcp_connections now allows this parameter to be empty, as it is for external clients.
  • Fixed inconsistent usage of -1 and INT64_MAX to represent an invalid index within Persistent logs; invalid indexes now always use the constant INVALID_INDEX.
  • Fixed a bug in persistent_test (src/persistent/test.cpp) that could cause a stack overflow.
  • Updated the Derecho version number encoded in CMakeLists.txt to reflect the current Derecho version number.

v2.0.0

25 Jun 22:28
Compare
Choose a tag to compare

Major improvements since v0.9.2

New Features

  • Enabled external client API.
  • Revived code for using ibverbs API with flow control.
  • Derecho can now be configured with a "restart leader" distinct from its normal leader, and if the enable_backup_restart_leaders option is True, it can also use a list of multiple restart leaders in priority order.
  • Added a build script.
  • The type T in Replicated is now aware of the subgroup ID it belongs to.
  • Moved the ObjectStore out to the new Cascade project.
  • Renamed rpc_port to state_transfer_port in configuration.

Bugfixes

  • Fixed the completion queue overrun issue.
  • Avoided relying on ibv_wc::wr_id (or fi_cq_err_entry::op_context) to determine which remote node failed when a request posted to a queue pair fails. Now uses timeout logic to detect the failure.
  • Refactored p2p_connection code.
  • Stored Persistent state into a file named by SHA256 hash string instead of a type string, which might be longer than filename length limitation.
  • Fixed TCP listen() backlog issue contributing to the slow startup with many nodes.

Dependencies

  • Moved to spdlog v1.3.1
  • Moved to libfabric v1.7.0

Known Issues

  • An update to a subgroup type with Persistent fields is acknowledged (by derecho::rpc::QueryResults<>) when all shard members delivered the update. Persistence is processed in the background off the critical path. Applications must use local_persistentce_callback or global_persistence_callback in group constructor to make sure when the updates are persisted locally or globally.
 /**
  * Bundles together a set of callback functions for message delivery events.
  * These will be invoked by MulticastGroup or ViewManager to hand control back
  * to the client if it wants to implement custom logic to respond to each
  * message's arrival. (Note, this is a client-facing constructor argument,
  * not an internal data structure).
  */
 struct CallbackSet {
     message_callback_t global_stability_callback;
     persistence_callback_t local_persistence_callback = nullptr;
     persistence_callback_t global_persistence_callback = nullptr;
 };
 
  • A Derecho node cannot join multiple subgroups so far. We plan to add this feature, aka overlapping subgroup, in the near future.
  • Slow startup (It takes several seconds for RDMC to start, please see #160 for more information)

v2.0.0rc

05 May 20:57
7f68235
Compare
Choose a tag to compare
v2.0.0rc Pre-release
Pre-release

A candidate for the stable derecho (v2.0.0) release. Because we have added many features since v0.9.2, we skip release number v1.

Major improvements since v0.9.2

New Features

  • Enabled external client API.
  • Revived code for using ibverbs API with flow control.
  • Derecho can now be configured with a "restart leader" distinct from its normal leader, and if the enable_backup_restart_leaders option is True, it can also use a list of multiple restart leaders in priority order.
  • Added a build script.
  • The type T in Replicated is now aware of the subgroup ID it belongs to.
  • Move the ObjectStore out to the new Cascade project.

Bugfixes

  • Fixed the completion queue overrun issue.
  • Avoided relying on ibv_wc::wr_id (or fi_cq_err_entry::op_context) to determine which remote node failed when a request posted to a queue pair fails. Now uses timeout logic to detect the failure.
  • Refactored p2p_connection code.
  • Stored Persistent state into a file named by SHA256 hash string instead of a type string, which might be longer than filename length limitation.

Dependencies

  • Moved to spdlog v1.3.1
  • Moved to libfabric v1.7.0

Known Issues

  • An update to a subgroup type with Persistent fields is acknowledged (by derecho::rpc::QueryResults<>) when all shard members delivered the update. Persistence is processed in the background off the critical path. Applications must use local_persistentce_callback or global_persistence_callback in group constructor to make sure when the updates are persisted locally or globally.
 /**
  * Bundles together a set of callback functions for message delivery events.
  * These will be invoked by MulticastGroup or ViewManager to hand control back
  * to the client if it wants to implement custom logic to respond to each
  * message's arrival. (Note, this is a client-facing constructor argument,
  * not an internal data structure).
  */
 struct CallbackSet {
     message_callback_t global_stability_callback;
     persistence_callback_t local_persistence_callback = nullptr;
     persistence_callback_t global_persistence_callback = nullptr;
 };
 
  • A Derecho node cannot join multiple subgroups so far. We plan to add this feature, aka overlapping subgroup, in the near future.
  • Slow startup (It takes several seconds for a Derecho application to start).