Releases: volcano-sh/volcano
v1.8.2
Changes since v1.8.1
- fix wrong pods field format output of queue status (#3287 @Monokaix)
- add ignored csi provisioner when compute csi resources (#3286 @Monokaix)
- fix k8s.io/dynamic-resource-allocation go mod not found err (#3272 @Monokaix)
- fix: json marsh error for unsupport type: func() (#3282 @lowang-bh)
- fix job CRD metadata.annotations: Too long error (#3267 @Monokaix)
- fix queue update validation err when status.allocated empty ( #3266 @Monokaix)
- fix grafana dashboard format err (#3265 @Monokaix)
- update parameter BestEffort of taskInfo after changing parameter InitResreq (#3232 @Lily922)
- fix: allocated field in queue status is calcutated error (#3221 @shusley244)
- Avoid repeatedly creating links to obtain node metrics (#3229 @wangyang0616)
- skip 'pods' resource when checking if the Resource is empty (#3224 @Lily922)
- queue realcapability change to min dimension of queue capability and β¦ (#3219 @Monokaix)
- support preemption when the number of pods of a node reaches the upper limit (#3202 @Lily922)
- Delete duplicate logs generated by the predicate_helper method (#3214 @guoqinwill)
- support preempting task with bound status (#3209 @Lily922)
- support preemption when the number of attachment volumes of a node reaches the upper limit (#3212 @Lily922)
- fix: task scheduling latancy metrics is not accurate (#3128 @lowang-bh)
- backfill add score process (#3164 @lowang-bh)
- Obtains the actual load data of a node from the custom metrics API (#3181 @wangyang0616)
- Update the default value of parameter worker-threads-for-podgroup to 5 (#3180 @Lily922)
- update volcano.sh/apis version (#3166 @Lily922)
v1.8.1
Changes since v1.8.0
- fix: the pod anti-affinity constraint fails (#3140 @wangyang0616)
- add podGroup status to session cache, fix the bug of repeatedly sending pordGroup update request when there is no condations field. (#3125 @Lily922)
- add reSync task callback (#3119 @Monokaix)
- successfully scheduled events will not be reported repeatedly for podGroup resource (#3117 @Lily922)
- add reSync task callback (#3114 @Monokaix)
- volcano adapt k8s v1.27 (#3101 @Mufengzhe)
- add featuregates for volcano capabilities (#3093 @Monokaix)
- msg information optimization; preemption logic optimization (#3082 @wangyang0616)
- fix nodelock issue when using gang-scheduling (#3060 @wangyang0616)
- pods are preferentially scheduled to machines that meet the current session resources (#3035 @wangyang0616)
- optimize the jobflow architecture design diagram (#3025 @wangyang0616)
- use one command of helm install to do smooth upgrade (#3017 @lowang-bh)
- remove node out of sync state (#3006 @Monokaix)
- fix: the task pipeline status is incompatible with cluster autoscaler (#3002 @wangyang0616)
- when Volcano is uninstalled, two resources will remain (#2992 @gj199575)
What's Changed
- [cherry-pick for release-1.8] msg information optimization; preemption logic optimization by @wangyang0616 in #3082
- [cherry-pick for release-1.8]Add featuregates for volcano capabilities by @Monokaix in #3093
- [cherry-pick for release 1.8]volcano adapt k8s v1.27 by @Mufengzhe in #3101
- [cherry-pick for release-1.8]successfully scheduled events will not be reported repeatedly for podGroup resource by @Lily922 in #3117
- [cherry-pick for release-1.8]Add reSync task callback by @Monokaix in #3119
- [cherry-pick for release-1.8]Add podGroup status to session cache, fix the bug of repeatedly sending pordGroup update request when there is no condations field. by @Lily922 in #3125
- Update image version for release v1.8.1 by @Mufengzhe in #3136
- [cherry-pick for release-1.8]fix: the pod anti-affinity constraint fails by @wangyang0616 in #3140
- [cherry-pick for release-1.8]:feat:add printing of MemStats in dumpall by @xiao-jay in #3098
Full Changelog: v1.8.0...v1.8.1
v1.8.0
What's New
Add JobFlow to support lightweight workflow orchestration
The workflow orchestration engine is widely used in high-performance computing, AI biomedicine, image processing, beauty, game AGI, scientific computing and other scenarios, helping users simplify the management of multiple parallel tasks and dependencies, and greatly improving the overall computing efficiency.
JobFlow is a lightweight task flow orchestration engine that focuses on Volcano job orchestration. It provides Volcano with job probes, job completion dependencies, job failure rate tolerance, and other diverse job dependency types, and supports complex process control primitives. The specific capabilities are as follows:
- Support large-scale job management and complex task flow orchestration.
- Support real-time query of the running status and task progress of all associated jobs.
- Support automatic operation of jobs and scheduled start to release labor costs.
- Various action strategies can be set for different tasks, and corresponding actions can be triggered when the task meets certain conditions, such as timeout retry, node failure drift.
Refer to the links for more details. (JobFlow doc, @hwdef, @lowang-bh, @zhoumingcheng)
Support vGPU scheduling and isolation
Since the outbreak of ChatGPT, there have been more and more research and development of AI large models, and different types of AI large models have been launched one after another. In production environment, users have pain points such as low resource utilization and inflexible GPU resource allocation. They have to purchase a large amount of redundant heterogeneous computing power to meet business needs, and heterogeneous computing power itself is expensive. It has brought a great burden to the development of the enterprise.
Starting from version 1.8, Volcano provides an abstract general framework for sharing devices (GPU, NPU, FPGA...), developers can customize multiple types of shared devices based on this framework. Currently Volcano has supported GPU device multiplexing, resource isolation based on this framework, details are as follows:
- GPU sharing: Each task can apply to use part of the resources of a GPU card, and the GPU card can be shared among multiple tasks.
- Device memory control: GPU can be allocated according to device memory (for example: 3000M) or allocated in proportion (for example: 50%) to realize GPU virtualization resource isolation capability.
Refer to the links for more details.
- How to use vGPU function (@archlitchi)
- How to add a new heterogeneous computing power sharing strategy (@archlitchi)
Support the preemption capability for GPU and user-defined resources
Currently, Volcano supports CPU, Memory and other basic resource preemption. GPU resources and user self-managed resources such as NPU, network resources have not been supported yet.
In version 1.8, the predication is refactored to provide more detailed response such as Unschedulable and UnschedulableAndUnresolvable for different scenarios.
The GPU preemption function has been released based on the optimized framework, and the user developed scheduling plugins based on Volcano can be adapted and upgraded according to business scenarios.
Refer to the link for more details. (#2916, @wangyang0616)
Support ElasticSearch monitoring systems in node load-aware scheduling and rescheduling
The status of the kubernetes cluster changes in real time with the creation and termination of tasks. In some scenarios such as adding or deleting nodes, changing the affinity of Pods and Nodes, and dynamically changing the lifecycle of jobs, etc. The following problems will occur. Resource utilization is unbalanced, node performance bottlenecks are offline, etc. At this time, load aware scheduling and rescheduling can help user solve the above problems.
Prior to Volcano version 1.8, the load awareness scheduling and rescheduling only supports Prometheus. Starting from version 1.8, Volcano optimizes the monitoring index acquisition framework and adds support for ElasticSearch monitoring system.
Refer to the links for more details.
Optimize Volcano's ability to schedule microservices
Add Kubernetes default scheduler plugin enable and disable switch
Volcano is a unified integrated scheduling system that not only supports computing jobs such as AI and BigData, but also supports microservice workloads. It is compatible with scheduling plugins such as PodTopologySpread, VolumeZone, VolumeLimits, NodeAffinity, and PodAffinity of the Kubernetes default scheduler, and Kubernetes default scheduling plugins capabilities Enabled by default in Volcano.
Since Volcano 1.8, the Kubernetes default scheduling plugins can be freely selected to be turned on and off through the configuration file, and all of them are turned on by default. If you choose to turn off some plugins, such as: turn off the PodTopologySpread and VolumeZone plugins, you can set the corresponding values ββin the predicate plugin is false.
Refer to the links for more details. (#2748, @jiangkaihua)
Enhance scheduler to keep compatibility with ClusterAutoscaler
In the Kubernetes platform, Volcano is not only used as a scheduler for batch computing services, but also used as a scheduler for general services. Node horizontal scaling is one of the core functions of Kubernetes, which plays an important role in coping with the surge of user traffic and saving operating costs. Volcano optimizes job scheduling and other related logic, and enhances the compatibility and interaction with ClusterAutoscaler, mainly in the following two aspects:
- The pod that enters the pipeline state in the scheduling phase triggers capacity expansion in time.
- Candidate nodes are graded in gradients to reduce the impact of cluster terminating pods on scheduling load, and prevent pods from entering invalid pipeline states, resulting in cluster expansion by mistake.
Refer to the links for more details. (#2782, #3000, @wangyang0616)
Provide tolerance for exception of device plugin
When device plugin crashs or fails to report resouces for some reason and the total resource amount of the node is less than the allocated resource amount, Volcano considers that the node data is inconsistent, make the node as OutOfSync and isolates the node, and stops scheduling any new workload to the node. The isolocation machinism brought some impact to the cluster for example device plugin has no chance to be scheduled to the OutOfSync node. In Volcano v1.8, the machinism is enhanced to tolerate the exception of device plugin, the non-GPU workload like device plugin is still allowed to be scheduled to OutOfSync node.
Refer to the link for more details. (#2999, @Monokaix)
Add helm charts for Volcano
As Volcano is used in production environments and cloud environments with more and more users, simple and standard installation actions are crucial. Since version 1.8, Volcano has optimized charts package publishing and archiving actions, standardized the installation and use process, and completed the migration of historical versions v1.6 and v1.7 to the new helm warehouse.
Refer to the link for more details. (Volcano helm-charts, @wangyang0616)
Other Notable Changes
- rework device sharing in volcano(#2643, @archlitchi)
- style(resource_info): replace 0, -1 with Zero,Infinity(#2650, @kingeasternsun)
- perf(preempt): remove used copy(#2652, @kingeasternsun)
- Add podGroup completed phase(#2667, @waiterQ)
- delete redundant import alias(#2675, @shoothzj)
- delete redundant type convetion(#2627, @shoothzj)
- Extract MetricsClient and NodeMetrics to support other metrics platform(#2678, @shoothzj)
- upgrade klog package version to latest (#2682, @waiterQ)
- Update how_to_use_gpu_sharing.md(#2686, @z2Zhang)
- Rename AddPrePredicateFn annotation(#2689, @zbbkeepgoing)
- Remove duplicate import in session.go(#2690, @zbbkeepgoing)
- Optimize e2e runtime: reduce pytorch-plugin image download time(#2691, @wangyang0616)
- Fix typo in tdm-plugin.md(#2692, @shoothzj)
- volcano metrics source support elasticsearch (#2694, @shoothzj)
- Skip stmt when tasks is empty (#2696, @zbbkeepgoing)
- Add rescheduling related location logs ([#2698](https://github.com/volcano-...
v1.7.0
What's New
Enhanced Plugin for PyTorch Jobs
As one of the most popular AI frameworks, PyTorch has been widely used in deep learning fields such as computer vision and natural language processing. More and more users turn to Kubernetes to run PyTorch in containers for higher resource utilization and parallel processing efficiency.
Volcano 1.7 enhanced the plugin for PyTorch Jobs, freeing you from the manual configuration of container ports, MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK environment variables.
Other enhanced plugins include those for TensorFlow, MPI, and PyTorch Jobs. They are designed to help you run computing jobs on desired training frameworks with ease.
Volcano also provides an extended development framework for you to tailor Job plugins to your needs.
Refer to the links for more details. (#2313, @ccchenjiahuan)
Ray on Volcano
Ray is a unified framework for extending AI and Python applications. It can run on any machine, cluster, cloud, and Kubernetes cluster. Its community and ecosystem are growing steadily.
As machine learning workloads are hosting computing jobs at a density higher than ever before, single-node environments are failing in providing enough resources for training tasks. Here's where Ray comes in, which seamlessly coordinates resources of the entire cluster, instead of a single node, to run the same set of code. Ray is designed for common scenarios and any type of workloads.
For users running multiple types of Jobs, Volcano partners with Ray to provide high-performance batch scheduling. Ray on Volcano has been released in KubeRay 0.4.
Refer to the links for more details. (#2601(#755) @tgaddair)
Enhance Scheduling for Kubernetes long-running services
This enhancement makes Volcano fully compatible with the Kubernetes default scheduler for long-running services. With this enhancement, users can use Volcano to uniformly schedule long-running services and batch workloads in a single cluster.
Refer to the links for more details:
- support multi scheduler name for scheduler and webhook(#2393, @jinzhejz)
- Add nodeVolumeLimits plugin (#2458, @jiangkaihua)
- Volcano support volumeZone plugin (#2480, @jiangkaihua)
- Add podTopologySpread plugin (#2487, @Monokaix)
- Add selector spread plugin (#2500, @elinx)
Support Kubernetes v1.25
This feature is designed to make Volcano compatible with Kubernetes 1.25.
Refer to the links for more details. (#2533, @wangyang0616)
Support multi-arch images for Volcano
This feature is designed to cross-compile volcano images of different architectures. For example, compile an image for the ARM64 architecture on an AMD64 machine.
Refer to the links for more details.(#2435, @ccchenjiahuan)
Optimize Queue Status Information
This feature is designed to enrich the information of the queue. Through this function, users can view the resource allocation of queues in real time, which is convenient for administrators to dynamically plan resources.
Refer to the links for more details.(#2592, @jiangkaihua)
Other Notable Changes
- change enqueue to optional action(#2309, @wpeng102)
- Add documentation on ttlSecondsAfterFinished(#2314, @jsolbrig)
- remove redundant parentheses(#2316, @lucming)
- update go.mod to add queue.spec.Affinity(#2319, @qiankunli)
- Support JobReady for extender plugin(#2334, @xiaoxubeii)
- add jobflow desgin docs(#2339, @zhoumingcheng)
- deploy webhook by yaml(#2346, @hwdef)
- add details for nodegroup doc(#2347, @qiankunli)
- change e2e dependencies of makefile(#2350, @lucming)
- update go to 1.18(#2353, @hwdef)
- clean up the code(#2360, @lucming)
- add csiNode cache for plugin(#2371, @wpeng102)
- add rest config into ssn(#2378, @wpeng102)
- Update field comment(#2386, @zhoumingcheng)
- use patch to replace update pod operator(#2392, @wpeng102)
- get csinodes from ssn(#2399, @wpeng102)
- Consider initContainer GPUs quota in calculating(#2423, @kerthcet)
- Some cleanups in job_info.go(#2434, @kerthcet)
- Add initContainer GPU number when calculating GPUs(#2440, @kerthcet)
- Optimize the way to build images in makefile(#2445, @hwdef)
- add a flag to control whether inherit owner annotations when podgroupβ¦(#2461, @elinx)
- Update CA insert method in webhooks(#2463, @jiangkaihua)
- chore: remove duplicate word in comments(#2470, @Abirdcfly)
- add plugin registration log(#2477, @Monokaix)
- Modify format verification by gofmt(#2499, @jiangkaihua)
- scheduler support ephemeral-storage resources(#2505, @WulixuanS)
- delete task qos limit in webhook(#2513, @waiterQ)
- enable https healthz listen(#2523, @waiterQ)
- Use RWMutex in framework(#2525, @kerthcet)
- Realias scheduling api version name in package imports(#2526, @kerthcet)
- Bump ginkgo version to v2.3.0(#2532, @kerthcet)
- upgrade golangci-lint to v1.50.0(#2537, @waiterQ)
- move prefilter out of predicates to improve performance(#2580, @elinx)
- Move spark e2e integration from self-hosted to github-hosted(#2590, @Yikun)
- Add node image information to the cache of the scheduler(#2593, @wangyang0616)
- By default, the preemption function of gang and drf is turned off(#2613, @wangyang0616)
- The referenced Volcano API version is updated to 1.7(#2618, @wangyang0616)
- update image to v1.7.0-beta.0(#2628, @william-wang)
- update image to v1.7.0(#2636, @wangyang0616)
Bug Fixes
- fix: proportion metrics accuracy(#2297, @LY-today)
- fix scheduler cache waitforcachesync(#2307, @xiaoanyunfei)
- To record the start and end time of job scheduling(#2318, @dontan001)
- fix convertQuanToPercent func(#2325, @autumn0207)
- fix defaultMetricsInternal variable(#2326, @autumn0207)
- filter the rescheduling strategies which contain victim functions(#2342, @Thor-wl)
- fix bug in task dependsOn(#2351, @hwdef)
- fix ci error about mpi plugin struct naming is not standardized(#2354, @hwdef)
- try get get old pg when new pg not exist(#2400, @Akiqqqqqqq)
- fix scheduler panic when webhook is not ready(#2410, @hwdef)
- bugfix: panic if queue already exists(#2413, @elinx)
- fix nil pointer in jobCache.update(#2420, @Akiqqqqqqq)
- fix README.md clearly(#2427, @waiterQ)
- Fix calculating available gpu num error(#2441, @kerthcet)
- fix performance downgrade issue(#2443, @wpeng102)
- docs: fix error in how to confi...
v1.6.0
What's New
Support Dynamic Scheduling Based on Real Node Load
This feature aims to schedule pods based on real node load instead of request resource, which will optimize the node resource utilization.Currently the pod is scheduled based on the request resources and node allocatable resources other than the node usage. This leads to the unbalanced resource usage of compute nodes. Pod is scheduled to node with higher usage and lower allocation rate. This is not what users expect. Users expect the usage of each node to be balanced. More details can be referred to https://github.com/volcano-sh/volcano/blob/master/docs/design/usage-based-scheduling.md. (#2023, #2129 @william-wang )
Support Rescheduling Based on Real Node Load
This feature enables users to rebalance the node utilization based on real node resource usage reqularlly, which is quite suitable for long-running workloads such as deployment. All the rescheduling policies and check interval can be configured according to custom scenarios. More details can be referred to https://github.com/volcano-sh/volcano/blob/master/docs/design/rescheduling.md. (#2174, #2184 @Thor-wl )
Support Elastic Job Scheduling
This feature allows Volcano to schedule volcano job based on the [min,max] configuration in the job, which will improve resource utilization rate and shorten the execution time of training job. More details can be referred to https://github.com/volcano-sh/volcano/blob/master/docs/design/elastic-scheduler.md. (#2105, @qiankunli )
Add MPI Job Plugin
This feature provides a new volcano job plugin - MPI Plugin. It will be more convient for MPI users to make use of volcano job instead of manually making connections for hosts of different roles, registering required environment variables and so on. More details can be referred to https://github.com/volcano-sh/volcano/blob/master/docs/design/distributed-framework-plugins.md. (#2237, @hwdef )
Other Notable Changes
- update helm version in install.sh(#2103, @hwdef )
- modify the way to install the controller-gen(#2104, @hwdef )
- add shuffle action(#2174, @Thor-wl )
- add e2e Spark integration test(#2113, @Yikun )
- if only one candidate node, no need do scoring for it(#2122, @wpeng102 )
- skip verify init container SecurityContex.Privileged(#2125, @zrss )
- add design doc for usage based scheduling(#2023, @william-wang )
- add usage based scheduling plugin(#2129, @william-wang )
- support elastic annotation in preempt/reclaim plugin(#2105, @qiankunli )
- add design doc for Enhance-Generate-PodGroup-OwnerReferences-for-Normal-Pod(#2151, @wpeng102 )
- allow no retry when task failed(#2154, @merryzhou )
- remove useless code in task-topology's manager.go(#2159, @HeGaoYuan )
- add user guidance for svc plugin(#2162, @Thor-wl )
- add user guidance of env plugin(#2153, @Thor-wl )
- add user guidance for ssh plugin(#2168, @Thor-wl )
- add user guidance about how to configure volcano scheduler(#2177, @Thor-wl )
- add user guidance about how to configure job and task policy(#2179, @Thor-wl )
- add overhead for pod request(#2170, @jiangxiaobin96 )
- rename ClusterRole from prometheus to prometheus-volcano(#2178, @SimonYang-CS )
- add image pull secret for volcano-admission-init job(#2185, @SimonYang-CS )
- add rescheduling plugin(#2184, @Thor-wl )
- feat(scheduler): support resource quota consideration during pod group enqueue procedure(#1345, @merryzhou )
- add priorityClassName for rescheduler(#2200, @jiangxiaobin96 )
- allow privilege containers to pass the admission webhook validation by default(#2222, @Thor-wl )
- clean up metrics of deleted objects(#2230, @xiaoanyunfei )
- sunset the reservation plugin and elect reserve actions(#2236, @william-wang )
- add more deploy switches on helm(#2267, @shinytang6 )
Bug Fixes
- fix dynamic provision ut case error(#2133, @wpeng102 )
- fix: add jobUID into job's podgroup name ensure podgroup's unique(#2140, @FengXingYuXin)
- fix: Add mirror for Spark voclano IT(#2163, @Yikun )
- fix controller job cache not sync latest version issue(#2169, @wpeng102 )
- fix: add jobUID into job's podgroup name ensure podgroup's unique(#2140, @FengXingYuXin )
- fix task MinAvailable issue(#2176, @merryzhou )
- fix calculate inqueue resource bug in opensession(#2214, @zbbkeepgoing )
- fix id of gpu devices never delete when number gpu decrease(#2215, @WingkaiHo)
- fix numa divided by zero(#2216, @elinx)
- fix helm install(#2218, @zirain )
- fix api-server deny empty admission response with PatchType set(#2267, @elinx)
- feat exclude unhealthy devices(#2267, @YongjiaHe)
- fix unhealthy gpu data struc array(#2267, @YongjiaHe)
- fix high priority task cannot preemt low priority task when queue is overused(#2267, @wpeng102 )
- avoid panic for query prometheus no data(#2267, @waiterQ )
- modify prometheus.query.result judg(#2267, @waiterQ )
- fix(scheduler): fix jobStarvingFn logic(#2271, @shinytang6 )
v1.5.1
Changes since v1.5.0
- bug fix: fix the driver pod can not be created due to unreasonable admit (#2081 @william-wang )
- bug fix: fix error message in TestValidateJobCreate ( #2077 @william-wang )
- bug fix:
Open
state queue can be deleted ( #2077 @Yikun ) - bug fix: upgrade webhook from v1beta1 to v1 to make sure volcano webhook work on K8S 1.22+ ( #2077 @william-wang )
- bug fix: fix the proportion plugin that ignore the inqueue resource in running jobs( #2057 @Thor-wl )
- bug fix: set the initial phase to be pending for podgroup ( #2057 @Thor-wl )
- bug fix: regenerate installer/volcano-development-arm64.yaml to fix arm64 deployment ( #2030 @hwdef )
- bug fix: fix queue allocated exceeds capability ( #2035 @aidaizyy @Thor-wl )
v1.5.0
Changes since v1.5.0-Beta
- bug fix: fix some concurrent map bugs in numaware-aware(#1968, @huone1 @Jason-Liu-Dream )
- bug fix: fix the scheduler stuck after delete resourcequota for namespace(#1978, @william-wang )
- bug fix: add individual development yamls for volcano v1.5(#2004, @hwdef )
v1.4.1
Changes since v1.4.0
- bug fix: fix panic in setNodeState function when node is nil(#1970, @Thor-wl )
- bug fix: fix possible panic when 'SetNode' is called(#1952, @william-wang )
- bug fix: fix some concurrent map bugs in numaware-aware(#1969, @huone1 @Jason-Liu-Dream )
- bug fix: all pods is existing when restart count exceed max retry(#1997, @william-wang )
- bug fix: add individual development yamls for volcano v1.4(#2002, @hwdef )
- bug fix: optimize resource comparision functions for performance(#2026, @huone1 )
v1.5.0-Beta
What's New
Support Task Dependency
In most mainstream computing platforms such as MPI and Tensorflow, different pods undertake different roles, for example, master/worker. It is necessary to start master or worker first due to the working principle for different platforms. This feature aims to provide the ability to make the start order correct. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/task-launch-order-within-job.md. (#1920, #1833, @hwdef @shinytang6 @Thor-wl )
Support Reserve Resource for Queue
This feature provides the ability to reserve resources for specified queues in order to make sure there is always guaranteed resources for urgent jobs instead of waiting for resource release or being preempted. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/queue-guarantee-resource-reservation-design.md (#1905, #1904, @qiankunli )
Support Specified Nodes for Volcano in Cluster
In some scenarios such as multiple schedulers, it is necessary for Volcano to be only responsible for part of nodes in the cluster. This feature enable users to configure the nodes that are responsible for the Volcano. More details can be referred to #1834 (#1821, @qiankunli )
Add Tendorflow Job Plugin
Volcano provides a unified object for job management which allows user to run AI training such as Tensorflow, Pytorch, Mxnet, MPI with Volcano Job and enjoy the enhanced lifecycle management. However it is a bit complex for some users. This features is to add Tensorflow plugin based on Volcano job plugin framework which reduces the complexity of running Tensorflow with Volcano and make it easy to use. More details can be referred to https://github.com/volcano-sh/volcano/blob/master/docs/design/distributed-framework-plugins.md (#1874, @LuBingtan )
Other Notable Changes
- update CRD version to v1(#1919, @Thor-wl )
- update golang to v1.17(#1912, @Thor-wl )
- optimize: reuse predicate error on same task group(#1906, @justadogistaken )
- default sort task by index(#1898, @xiaoanyunfei )
- update PriorityClass from v1beta1 to v1 for go-client(#1897, @lc2705 )
- support label volcano.sh/task-priority(#1896, @qiankunli )
- enhance the security of TLS client authentication for webhook(#1895, @huone1 )
- add healthz and metric switch for deploy controller and scheduler(#1888, @huone1 )
- add elastic scheduler design doc(#1887, @qiankunli )
- add eventhandler framework proposal(#1886, @sivanzcw )
- add a argument csi-storage to control the storage capacity resource(#1875, @huone1 )
- add rbac for csinode(#1871, @Thor-wl )
- feat: add imagelocality priority to nodeOrder(#1868, @justadogistaken )
- optimize the CA parse Process(#1862, @huone1 )
- ignore the update event if pod is allocated in cache but not present in NodeName(#1857, @xing0821)
- improve taintTolerationScore interPodAffinityScore throghput when failure occurs(#1856, @justadogistaken )
- switch the order of patch and existence check(#1852, @zzr93)
- support to set healthz address and metrics address(#1849, @huone1 )
- add nodeSelector design doc(#1834, @qiankunli )
- enhance the volcano topology framework(#1762, @huone1 )
- support preempt with priority plugin alone(#1757, @Thor-wl )
- support reserved node(#1821, @qiankunli )
- support multi-cluster scheduling in framework(#1521, @william-wang )
- cleanup scheduler cache informerFactory(#1831, @xiaoanyunfei )
- cleanup AddPriorityClass(#1828, @xiaoanyunfei )
- clean addNumaInfo(#1829, @xiaoanyunfei )
- clean up readAdmissionConf(#1823, @xiaoanyunfei )
- Remove default quota info in
NewNamespaceCollection
(#1817, @zen-xu ) - add multiple tasks support(#1820, @hwdef )
- refactor the cache to support batch bind api for better performance(#1796, @huone1 )
- optimize resource comparision functions for performance(#1769, @huone1 )
- optimize some logs in admission process(#1738, @huone1 )
- add setting MinResources to pg for normal pod(#1666, @huone1 )
- don't return err message when the pod isn't in the nodeinfo cache(#1478, @huone1 )
- update vendor for resource reservation(#1494, @huone1 )
- Proposal: Add Machine Learning Framework Plugins in Volcano(#1806, @LuBingtan )
- upgrade spf13/cobra version to 1.2.1(#1801, @marffin)
- refactor the volcano to support multi-scheduler with each job and node get conresponding scheduler based on hash.(#1795, @william-wang )
- support multi-scheduler for k8s workload deployment, etc(#1792, @huone1 )
- use root context(#1715, @lowang-bh )
- add design docs for task-leve advanced scheduling policy(#1630, @hwdef )
- Add livenessProbe and readinessProbe in Grafana Container(#1788, @dipanjank )
- enhance the admission conf check(#1799, @huone1 )
- Catch add pod out of sync error(#1783, @zhiyuone )
- add UT for elect action(#1780, @Thor-wl )
- Adding oidc import to enable vcctl work with oidc cluster(#1793, @igormishsky)
- Add job conditions (status&lastTransitionTime)(#1764, @HecarimV )
- add ut converage report for v1.4.0(#1766, @Thor-wl )
- refactor the Jobinfo functions to reduce redundant computing(#1745, @william-wang )
Bug Fixes
- fix scheduling process starts even if resource synchronization is not complete(#1916, @huone1 )
- fix: allocate ut for "two Jobs on one node"(#1913, @justadogistaken )
- fix the deep clone of JobInfo(#1883, @lc2705 )
- fix the security alert from Kubernetes(#1873, @Thor-wl )
- fix pod cannot be allocated with sufficient resource(#1851, @aidaizyy )
- fix: avoid chan block within taintTolerationScore(#1848, @justadogistaken )
- fix: scheduler crash fatal error: concurrent map writes(#1847, @Jason-Liu-Dream)
- fix syntax error in function Remove GPUIndexPatch(#1841, @Thor-wl )
- fix: All pods is existing when restart count exceed max retry(#1719, @LuBingtan )
- fix there is nil pointer access in function setNodeState(#1800, @huone1 )
- fix OOM will occur if pod info is sync before node info(#1662, @huone1 )
- fix controller panic when create A large number of pods(#1814, @huone1 )
- fix a problem about equivalence ecache feature (#1593, @huone1 )
- fix(scheduler) gang plugin task min avaliable check(#1732, @king-jingxiang)
- fix: fix possible panic when 'SetNode' is called(#1685, @eggiter )
- fix bug that vcjob is not compeleted when maxRetry is 1(#1746, @Thor-wl )
- fix the security alerts(#1770, @Thor-wl )
- fix(scheduler): improve job/task clone func(#1729, @shinytang6 )
- fix broken grafana dashboard configuration.(#1773, @dipanjank )
v1.4.0
Changes since v1.4.0-Beta
- fix bug about not record queue label in metric(#1722, @lowang-bh )
- fix: do not set taskInfo.NodeName to empty when nodeInfo.RemoveTask is called(#1716, @eggiter )
- fix(underused): Do not check overused when there is no UnderUsedResourceFn added(#1726, @eggiter )
- pass kubeClient to admission service(#1730, @hack-qian)
- upgrade k8s to v1.19.11 because of security notification(#1733, @Thor-wl )
- optimize some logs in admission process(#1738, @huone1 )
- change the Mutex to RWMutex in predicateCache(#1741, @william-wang )
- fix vcjob not work when mount volume(#1742, @Thor-wl )
- e2e cases about pod affinity skip cancel(#1743, @Thor-wl )
- fix bug that vcjob is not compeleted when maxRetry is 1(#1746, @Thor-wl )
- fix gen-admission-secret.sh(#1752, @yahaa )