Skip to content

Kueue v0.6.0-rc.3

Pre-release
Pre-release
Compare
Choose a tag to compare
@alculquicondor alculquicondor released this 12 Feb 21:00
· 290 commits to main since this release
v0.6.0-rc.3
5a0a714

Changes since v0.5.0:

Changes by Kind

API Change

  • A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)
  • Add MultiKueue garbage collection. (#1643, @trasc)
  • Add Path location type for MultiKueue cluster KubeConfigs (#1640, @trasc)
  • Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
  • Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)
  • Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
  • Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
  • Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
  • Support a backoff re-queueing mechanism for the waitForPodsReady (#1709, @tenzen-y)
  • The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
  • The lendingLimit field in ClusterQueue's quotas allows restricting home much of the unused resources by the ClusterQueue can be borrowed by other ClusterQueues in the cohort. In other words, this allows a quota equal to nominal-lendingLimit to be exclusively used by the ClusterQueue. (#1385, @B1F030)
  • Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)

Feature

  • Add HA support for the visibility API (#1554, @astefanutti)

  • Add MultiKueue support for JobSet (#1606, @trasc)

  • Add Prebuilt Workload support for JobSets. (#1575, @trasc)

  • Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)

  • Add live status updates for multikueue jobs (#1668, @trasc)

  • Add prebuilt workload support for batch/job. (#1358, @trasc)

  • Add support for groups of plain Pods. (#1319, @achernevskii)

  • Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)

  • Allow configuring featureGates on helm charts. (#1314, @B1F030)

  • Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)

  • At log level 6, the usage of ClusterQueues and cohorts is included in logs.

    The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)

  • Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)

  • Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)

  • Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)

  • RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)

  • Support RayCluster as a queue-able workload in Kueue (#1520, @vicentefb)

  • Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)

  • Support for preemption while borrowing (#1397, @mimowo)

  • Support for retry of provisioning request.

    When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
    This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)

  • The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)

  • The leaderElection field in the Configuration API is now defaulted.
    Leader election is now enabled by default. (#1598, @astefanutti)

  • The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)

  • Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)

Bug or Regression

  • Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)

  • Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)

  • Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)

  • Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)

  • Avoid finished Workloads from blocking quota after a Kueue restart (#1689, @trasc)

  • Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)

  • Do not (re)create ProvReq if the state of admission check is Ready (#1617, @mimowo)

  • Fix Kueue crashing at the log level 6 when re-admitting workloads (#1644, @mimowo)

  • Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)

  • Fix a bug that a workload, representing a pod group, was deleted soon after being marked as finished.
    This affected which were preempted during their lifetime. (#1683, @mimowo)

  • Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)

  • Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)

  • Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)

  • Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
    during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

    As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)

  • Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)

  • Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)

  • Kueue replicas are advertised as Ready only once the webhooks are functional.

    This allows users to wait with the first requests until the Kueue deployment is available, so that the
    early requests don't fail. (#1676, @mimowo)

  • Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)

  • Remove deleted pending workloads from the cache (#1679, @astefanutti)

  • Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)

  • Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)

  • Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)

  • Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

  • Adding toleration to a job leads to update workload (#1304, @stuton)
  • Expose utilization functions to setup jobframework reconcilers and webhooks (#1630, @tenzen-y)