Skip to content

v24.1.1

Compare
Choose a tag to compare
@vbotbuildovich vbotbuildovich released this 01 May 01:27
· 540 commits to dev since this release
b5ade3f

New Features

  • Adds new cluster and topic level configurations for write caching feature. by @bharathv in #16924
  • PR #17009 write caching - raft implementation by @bharathv
  • Enables write caching by default in dev container mode. by @bharathv in #17677
  • Add rpk security roles, a new command space to manage your Redpanda roles. by @r-vasquez in #17538
  • Introduce --allow-role and --deny-role flags for rpk acl commands by @oleiman in #17416
  • Introduces GET /v1/security/users/roles (Admin API) by @oleiman in #17155
  • Introduces /v1/security/roles/{role}/members Admin API endpoint for reading and updating RBAC role members. by @oleiman in #17153
  • #17679 rpk security acl list now supports --format=json by @rockwotj in #17684
  • Data Transforms now support writing to multiple output topics. The REDPANDA_OUTPUT_TOPIC environment variable exposed in transforms is now removed for REDPANDA_OUTPUT_TOPIC_%d for each output topic specified. by @rockwotj in #16946
  • rpk transform deploy now supports multiple output topics by @rockwotj in #16950
  • The golang transform-sdk gains the ability to write to multiple output topics.
    This feature can only be used in Redpanda v24.1.x or newer. by @rockwotj in #16978
  • The rust transform-sdk gains the ability to write to multiple output topics.
    This feature can only be used in Redpanda v24.1.x or newer. by @rockwotj in #17007
  • Publish log (i.e. stderr/stdout) output from data transforms exclusively to an internally managed Redpanda topic (_redpanda.transform_logs). Data transform logs will no longer appear in broker logs. by @oleiman in #16485
  • Introduce rpk transform logs NAME to view logs for a transform by @rockwotj in #16923
  • #16075 Data Transform's Rust SDK now supports a Schema Registry Client. by @rockwotj in #16464
  • Topic-aware partition balancing, which attempts to spread partition replicas topic-wise across a cluster. This behavior is controlled by the partition_autobalancing_topic_aware config property (enabled by default). by @ztlpn in #17263
  • Tiered Storage now supports using Azure VM user-assigned managed identities for securely accessing
    Azure Blob Storage @andijcr in #17157
  • Topic recovery and ‘whole-cluster restore’ from Tiered Storage now perform integrity checks on metadata to ensure that each partition can be recovered successfully by @andijcr in #16915
  • You can now create namespaces in Redpanda Cloud using rpk cloud namespace. by @r-vasquez in #16685
  • #13175 rpk debug bundle now includes a CPU profile of the requested nodes. by @r- vasquez in #16414
  • #16107 You can print a schema now using rpk registry schema get --print-schema. by @r-vasquez in #16109
  • #16623 rpk redpanda config bootstrap now supports bootstrapping your advertised addresses configuration. by @r-vasquez in #16652
  • new metric vectorized_storage_log_compacted_away_bytes for compaction observability in local storage added by @andijcr in #17579
  • new public metric redpanda_cluster_latest_cluster_metadata_manifest_age to track the age of the cluster_metadata_manifest in cloud storage added by @andijcr in #17404

Bug Fixes

  • Aggregates partitions in some cloud storage metrics when the aggregate_metrics cluster config is set to true. by @ballard26 in #16336
  • Fix a bug that could lead to raft log inconsistencies when 2 out of 3 nodes in a configuration are changed. by @ztlpn in #17675
  • Fix a bug that resulted in Redpanda ignoring until the next restart config values that were reset to their defaults. by @ztlpn in #16504
  • Fix a bug where logging in a transform could cause the transform to not make progress. by @rockwotj in #17186
  • Fix a crash that happened when a cluster that was partially in recovery mode tried to upload consumer offsets to cloud storage. by @ztlpn in #17013
  • Fix a memory leak when using transactions with many different producer IDs. by @rockwotj in #15797
  • Fix a potential cloud storage cache access time tracker file corruption during shutdown. by @nvartolomei in #16648
  • Fix a race condition between suffix truncation / delete records and adjacent segment compaction that can lead to crashes and data-loss. by @ nvartolomei in #17019
  • Fix a rare bug where http client connections would vanish from the connection pool leading to various operations hanging while waiting for an http client. by @nvartolomei in #15681
  • Fix an issue where rpk transform logs waits for records without the --follow flag specified. by @rockwotj in #17832
  • Fix an issue with Cargo.toml when initializing a Rust Data Transform project via rpk transform init by @rockwotj in #15934
  • Fix initial_leader_epoch/KIP-320 handling in fetch requests. It was ignored until now which prevented consumers to correctly detect suffix truncation. For Redpanda (and Raft), this is a minor problem since suffix truncation is a very improbable event. by @nvartolomei in #17674
  • Fix internal RPC client connection stall after more than 2^32 requests are sent. by @ztlpn in #16156
  • Fix large allocation in partition manifest. by @dotnwat in #16160
  • Fix oversized allocation in storage. by @Lazin in #16642
  • Fix the starter code for Rust projects in rpk transform init by @rockwotj in #16180
  • Fix tiered-storage housekeeping problem that may cause replaced segments to pile up if the spillover is enabled. by @Lazin in #16163
  • Fixed a few oversized allocations for some admin server endpoints. by @rockwotj in #16551
  • Fixed the values for the rpc client in/out bytes metric by @ballard26 in #17933
  • Fixes rpk transform init --install-deps so that an explicit true value is not needed. by @rockwotj in #17831
  • Fixes a bug in windowed compaction that could cause Redpanda to crash when an error occurs while reading batches. by @andrwng in #16928
  • Fixes a bug of config_frontend methods getting called on shards other than the controller shard. by @pgellert in #17088
  • Fixes a bug that may prevent redpanda from shutting down cleanly when auditing is enabled by @graphcareful in #16315
  • Fixes a concurrency issue in transform offset commits pertaining to taking/applying snapshots. by @bharathv in #17383
  • Fixes a crash if a WebAssembly function is deployed that immediately crashes. by @rockwotj in #15939
  • Fixes a crash that could happen when reading from local storage with a large number of segments that all do not contain user data. by @andrwng in #18075
  • Fixes a plausible correctness issue with idempotent requests during replication failures. by @bharathv in #16706
  • Fixes a race between compaction and Raft recovery for compacted topics that could result in aborted transactional data batches being visible. by @andrwng in #16295
  • Fixes an an improper initialization of metrics related to controller snapshot uploads. by @andrwng in #16070
  • Fixes an issue where using the CPU profiler with running Data Transforms could cause the process to deadlock. by @rockwotj in #17877
  • Fixes issue that causes the connection to hang when an unsupported compression type is passed via an incremental_alter_configs request by @graphcareful in #16399
  • Fixes lock starvation during transform offset commits. by @bharathv in #17402
  • Have fetch handler ensure rack awareness is enabled before performing follower fetching by @michael-redpanda in #15883
  • Prevent an assertion from being triggered when Wasm VMs fail immediately. by @rockwotj in #15933
  • Prevent detecting leader epoch advancement when state is not up to date by @mmaslankaprv in #16560
  • Prevent reactor stalls querying leadership information for large clusters by @rockwotj in #17473
  • Protect against a very rare scenario where after node restart, some of the partition replicas hosted on that node could not take part in leader elections. by @ztlpn in #16068
  • Redpanda used to accept an empty string in redpanda.rack in node config. This would cause issues in Kafka operations. Redpanda will now error on startup if redpanda.rack is set to an empty string. by @michael-redpanda in #15835
  • Redpanda will now correctly handle an empty rack ID provided in a fetch request by @michael-redpanda in #15846
  • Reduces maximum log line size from 1MiB to 128KiB to reduce occurrences of memory allocation failures by @michael-redpanda in #17922
  • Report runtime public metrics by task queue for all cores, not just core 0 by @rockwotj in #16154
  • Return a HTTP 400 error code when deploying a transform to a topic that doesn't exist instead of a 500 by @rockwotj in #17011
  • Schema Registry: Deleted schemas no longer reappear after certain compaction patterns on the _schemas topic. by @BenPope in #17091
  • #15042 Fixes a bug in the tiered storage time-based query implementation that could result in a consumer hang when consuming very old data. by @andrwng in #16645
  • #15201 Fix assertion triggered by interleaving of log flush and log truncation followed by append by @Lazin in #16105
  • #15603 cluster config aliases are accepted while reading from yaml by @andijcr in #15605
  • #15674 Fix an issue where new configs would continually revert to legacy defaults after an upgrade. by @oleiman in #15761
  • #15722 #7946 Fix an issue where create topics responses would show incorrect partition count and replication factor by @oleiman in #16410* #15811 Several additional metrics will have their "partition" label aggregated away (i.e., into a single series per remaining label set with no partition label, whose value is the sum of all input series with the same label set and different partition labels). This is already the default behavior for most metrics, but this change extends it to almost all remaining metrics. by @travisdowns in #15966
  • #15839 safer handle unknown properties in local state by @andijcr in #15840
  • #15909 Prevent oversized allocs when group fetching from many partitions. by @ rockwotj in #15918
  • #16129 Fixes a bug in SASL user deletion and update where usernames with a + symbol in the username were prevented from being deleted by @pgellert in #16694
  • #16251 rpk: fixed a bug where the --password flag could not be used along with the new configuration flag -X pass in clusters where basic authentication was enabled. by @r-vasquez in #16278* #16259 Fixes a bug that would previously cause read replicas to report the wrong value for the redpand_kafka_max_offset metric. by @andrwng in #16263
  • #16320 Prevent oversized allocation with large amounts of controller metadata by @ rockwotj in #16381
  • #16322 Fix graceful shutdown of the TS archive area retention procedure. by @Lazin in #16382
  • #16402 Fix invalidate iterator access when yielding control to the scheduler in data transforms debug endpoint. by @rockwotj in #16421
  • #16479 Fix timequery error that triggered full partition scan by @Lazin in #16503
  • #16521 Avoid a large contiguous allocation when creating thousands of topics in a single CreateTopics request. by @travisdowns in #16529
  • #16552 #16064 rpk tune --output-script: Add a missing new line in the ballast file tuner when using the --output-script flag by @r-vasquez in #16576
  • #16578 Fixes a bug in CreateTopicsResponse to now return all the configs of the topic, not just the topic-specific override configs. by @pgellert in #16922
  • #16612 fixes small inconsistency between Kafka and Redpanda when trying to query end_offset of an empty log by @mmaslankaprv in #17789
  • #16643 Fixed deleting Data Transforms with names that had URL unsafe characters by @rockwotj in #16858
  • #16840 Fixes a bug with TLS metrics where expiration timestamps would not advance on certificate reload by @oleiman in #17233
  • #17086 fixed enabling cloud storage in existing clusters by @mmaslankaprv in #17112
  • #17731 Fix incorrect log truncations caused by delayed replication requests. by @ ztlpn in #17895
  • #17788 Fix problem in Tiered-Storage that could potentially cause consumers to get stuck by @Lazin in #17805
  • #17827 fix a race between eviction and producer registration that results in an invalid transaction state. by @bharathv in #17880
  • #2225 Fix reported config source for cleanup.policy by reporting DEFAULT_CONFIG instead of DYNAMIC_TOPIC_CONFIG for the default value. by @pgellert in #17456
  • ext4 is no longer incorrectly detected as ext2 (all of ext2, 3 and 4 are assumed to be ext4). by @travisdowns in #13496
  • fixed a problem leading to UAF error while calculating cloud stage usage by @mmaslankaprv in #17975
  • fixed incorrect fetch offset validation by @mmaslankaprv in #16146
  • fixes a bug where compacting away last aborted data batch in a segment may cause readers to be stuck because of a gap in the segment ranges thus blocking raft recovery by @andrwng in #16295
  • prevents partial consumer group recovery by @mmaslankaprv in #17673
  • rpk: prevent a segfault when creating a profile from a cloud that is not in ready state. by @r-vasquez in #17553

Improvements

  • Add new upload management mechanism by @Lazin in #16684
  • Add an internal debugging endpoint to see the committed progress of data transforms. by @rockwotj in #16185
  • Add Prometheus metrics for data transforms logging by @oleiman in
  • Introduce "trust_file_crc32c" metric to export a checksum for each trust file in the system. by @oleiman in #17539
    #16566
  • Marks the fetch_read_strategy cluster configuration property as deprecated by @ballard26 in #17191
  • Record result of running microbenchmarks in a file using JSON format by @mfleming in #15912
  • #14069 Adds wait_ms parameter to CPU profiler admin API. The API will wait for wait_ms milliseconds then return the profile samples collected during that period of time. by @ballard26 in #14468
  • spillover manifests are enabled by default for clusters that did not explicit set a value or null by @andijcr in #16172
  • new metric vectorized_ntp_archiver_compacted_replaced_bytes by @andijcr in #17627
  • Removes plan_and_execute_latency_us histograms from internal metrics by @ballard26 in #17191* A clearer error message when the data directory does not exist or is the wrong type by @travisdowns in #15821
  • Add a dedicated CPU scheduling policy for Data Transforms by @rockwotj in #16114
  • Add error messages when an unsupported Schema Registry client is used in Data Transforms by @rockwotj in #17591
  • Adds UpdateRoleMembership function to rpk admin API client. by @bojand in #17589
  • Adds a new cluster configuration property fetch_read_strategy. This property determines which fetch execution strategy Redpanda will use to fulfill a fetch request. The newly introduced non_polling execution strategy is the default for this property with the polling strategy being included to make backporting possible. by @ballard26 in #15328
  • Adds a new public metric redpanda_raft_recovery_partition_movement_consumed_bandwidth that tracks how much bandwidth is currently in use for raft recovery. This helps tune raft_learner_recovery_rate. by @bharathv in #16842
  • Adds metric for number of quorum ack requests with write caching. by @bharathv in #17179
  • Adds new cluster configuration parameter minimum_topic_replications that permits Redpanda administrators to set a minimum permitted replication factor for newly created topics. by @michael-redpanda in #15777
  • Adds observability into producer evictions in each shard. by @bharathv in #16724
  • Caches the connections local address preventing the need to make a system calls to grab this value when auditing events. by @graphcareful in #15931
  • Changes what the kafka_latency_fetch_latency metric measures to be the time the first fetch_ntps_in_parallel takes. by @ballard26 in #17720
  • Data Transforms written in Golang now use a non-buffered write mechanism by @rockwotj in #15922
  • Data Transforms written in Rust now use a non-buffered write mechanism by @rockwotj in #15923
  • Fix large wasm module deployments by @rockwotj in #16722
  • Handle missing data transform logs topic in rpk transform logs by @rockwotj in #17830
  • Improve concurrency control and metadata consistency in Tiered-Storage. by @Lazin in #16774
  • Improve log segment upload mechanism which avoid reading files from disk directly. by @Lazin in #16999
  • Improved handling of follower fetching offset validation when used with relaxed consistency by @mmaslankaprv in #16475
  • Improves observability by allowing Redpanda to detect that some internal processes are stuck. by @Lazin in #16466
  • Increase data_transforms_logging_buffer_capacity_bytes from 100KiB to 500KiB by @oleiman in #16972
  • Introduces a new non-polling fetch execution strategy that decreases CPU utilization of fetch requests and fetch request latency. by @ballard26 in #15328
  • Large allocations are now logged by default (similar to reactor stalls) by @StephanDollberg in #16828
  • Making the Redpanda README.md more beautiful and functional! by @WillemKauf in #17822
  • Prevent metadata oversized allocations when requesting many topics. by @rockwotj in #15751
  • Publish total reclaimable space to avoid stuck decommission scenario. by @dotnwat in #16354
  • Reduces the number of allocations performed by the auditing subsystem by @graphcareful in #16056
  • SIMD instructions are generated by default for WebAssembly binaries when building with rpk. by @rockwotj in #16296
  • Schema Registry: Improve retry logic for delete_config and delete_subject_permanent by @BenPope in #17905
  • Schema Registry: Improve tombstoning when deleting a subject by @BenPope in #17905
  • Support changing the timeout for WebAssembly functions by @rockwotj in #15967
  • Support dynamically changing the limit for WebAssembly binary size by @rockwotj in #15967
  • The full_raft_configuration_recovery_pattern cluster config property is now deprecated and will be ignored by Redpanda. by @andrwng in #16890
  • This PR partially reverts the change such that strict retention remains enabled after upgrade unless it had been explicitly disabled before the upgrade. by @dotnwat in #16077
  • Transform SDKs for Schema Registry now publish their ABI version for compatibility checks with the broker. by @rockwotj in #17593* Transforms now start at a record relative to the deploy time instead of the time at which the VM starts.
    This allows for easier testing of transforms, as one does not have to wait for the VM to boot before producing. by @rockwotj in #17590
  • Validate transform code at deploy time to ensure the correct SDK is used. by @rockwotj in #16290
  • #13023 rpk container now prints the container logs in case of failures during startup. by @r-vasquez in #17780
  • #14116 SR/PP will now reply with a 500 error if an internal service semaphore has been completely exhausted by @michael-redpanda in #15977
  • #14814 Added new metric to provide Follower Fetching feature observability by @ mmaslankaprv in #15799
  • #15818 rpk cluster health: now --exit-when-healthy enables --watch when provided. by @r-vasquez in #16106
  • #15900 Internal kafka client now uses asynchronous compression (when possible) to reduce possibility of oversized allocations and reactor stalls by @michael-redpanda in #15920
  • #16162 get_cluster_uuid returns a correctly formatted string by @andijcr in #16183
  • #16552 #16064 rpk tune --output-script: rpk now creates a file for you if the provided file does not exist. by @r-vasquez in #16576
  • #16758 cluster: Avoid oversize allocs for topic creation and configuration by @BenPope in #16982
  • #16771 Data Transform builds in rpk now uses tinygo v0.31.1 by @rockwotj in #16856
  • #16795 Added ability to change transactional manage topic properties by @mmaslankaprv in #16797
  • #16901 Added EHOSTUNREACH to retry-able error code list by @michael-redpanda in #16902
  • #17197 more accurate node status reporting by @mmaslankaprv in #17625
  • #17368 Improves error feedback when Redpanda is given an invalid number of partitions during either topic creation or when the partition count for a topic is increased. by @michael-redpanda in #17369
  • #17925 Allocate and rebalance partition replicas in random order to prevent an undesirable pattern when many partitions have the same replica set. by @ztlpn in #17962
  • #8809 Node-wide throughput throttling is now fair an responsive. by @BenPope in #16441
  • [rpk] more informative error message display on create topic failure by @michael-redpanda in #15837
  • rpk acl user has been moved to rpk security security. Old command is soft deprecated. by @r-vasquez in #17664
  • rpk acl has been moved to rpk security acl. Old command is soft deprecated. by @r-vasquez in #17664
  • rpk profile has been reworked in an attempt to be simpler; see PR #17038 for more detail by @twmb in #17038
  • rpk redpanda start know knows about io characteristics on EC2 instance types i4i, is4gen and im4gn in addition to i3en and i3, allowing more effective IO scheduling even if rpk iotune has not been run. by @travisdowns in #17220
  • rpk transform deploy --file now supports https:// URLs by @rockwotj in #16050
  • rpk transform deploy takes a --file flag to deploy a compiled WebAssembly binary. by @rockwotj in #15932
  • verbose_logging_timeout_max node config, to prevent setting TRACE and DEBUG log level indefinitely by @oleiman in #15798
  • better ERROR|WARN messages for candidate_creation_error by @andijcr in #17090
  • better control of memory usage in storage layer. by @mmaslankaprv in #16846
  • ensure only positive values are accepted for log_segment_ms_min/max by @andijcr in #16513
  • largely reduced number of health report copies by @mmaslankaprv in #17863
  • less overhead of health report collection by @mmaslankaprv in #17158
  • minimising possibility of data loss when using relaxed conistency replication by @mmaslankaprv in #17915
  • new metric redpanda_cloud_storage_cloud_log_size reports size in bytes of the log in cloud storage by @andijcr in #17445
  • optimized updating leadership metadata with health reports by @mmaslankaprv in #16512
  • preventing large allocation in partition balancer code by @mmaslankaprv in #16917
  • rpk: Remove 10s timeout in rpk profile create by @r-vasquez in #16836
  • rpk: change hash for backup filenames to use sha256 by @andrewhsu in #17568
  • rpk: tune script uses sha256sum instead of md5sum by @andrewhsu in #17595
  • rpk: use go module version of common proto definitions. by @bojand in #17628
  • skipping overhead of collecting node health report for each node separately. by @mmaslankaprv in #17715
  • smaller memory footprint when using with large number of topics with small partition count by @mmaslankaprv in #16247

None

No release notes explicitly specified.

Full Changelog: v23.3.1...v24.1.1