apis: add MetricPrediction crd #1875

zwzhang0107 · 2024-01-29T12:39:39Z

Ⅰ. Describe what this PR does

define metric prediction crd for recommendation and predction.

the following yaml defines a MetricPrediction called mericprediction-sample.

mericprediction-sample spec claims that it needs the resource prediction for a workload.

prediction is on container level for a Deployment called nginx
metric types are cpu and memory from metric-server
using distribution profiler which statistic its history usage

mericprediction-sample status returns the cpu and memory profiling result for all containers of nginx workload.

apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
  name: mericprediction-sample
  namespace: default
spec:
  target:
    type: workload
    workload:
      apiVersion: apps/v1
      kind: Deployment
      name: nginx
      hierarchy:
        level: container
  metric:
    source: metricServer
    metricServer:
      resources: [cpu, memory]
  profilers:
  - name: recommendation-sample
    model: distribution
    distribution:
      # args
status:
  results:
  - profilerName: recommendation-sample
    model: distribution
    distributionResult:
      items:
      - id:
          level: container
          name: nginx-container
        resources:
        - name: cpu
          avg: 6850m
          quantiles:
            # ...
            p95: 7950m
            p99: 8900m
          stdDev: 759m
          firstSampleTime: 2024-01-29T07:15:56Z
          lastSampleTime: 2024-01-30T07:15:56Z
          totalSamplesCount: 10000
          updateTime: 2024-01-30T07:16:56Z
          conditions: []
        - name: memory
          avg: 1000Mi
          quantiles:
            # ...
            p95: 1100Mi
            p99: 1200Mi
          stdDev: 100Mi
          firstSampleTime: 2024-01-29T07:15:56Z
          lastSampleTime: 2024-01-30T07:15:56Z
          totalSamplesCount: 10000
          updateTime: 2024-01-30T07:16:56Z
          conditions: []

Ⅱ. Does this pull request fix one issue?

more infos can get from #1880

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

Intergrating with Metric Predition Framework
Metric Prediction Framework is kind of a "deep module", providing algorithms and prediction models in backend. There could be multiple profilers built with Metric Prediction as a foundation. Here are some scenarios about how to use the framework.

Resource Recommender for Workload
The spec of Recommendation defines it needs the recommended resources(CPU and memory) for a deployment named nginx-sample, and the recommendResources in status show the result for each container.

apiVersion: analysis.koordinator.sh/v1alpha1
kind: Recommendation
metadata:
  name: recommendation-sample
  namespace: recommender-sample
spec:
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-sample
status:
  recommendResources:
    containerRecommendations:
    - containerName: nginx-container
      target:
        cpu: 4742m
        memory: 262144k

The recommendation is calculated with quantile value of history metrics. If the Using the Metric Prediciton as profiling model, the requirement of recommendation-sample can be expressed in MetricPrediction.
For different kind of workload, the recommendation can select specified quantile value from MetricPrediction, for example p95 for Deployment and average for Job, then increase with a 10–15% margin for safety.

apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
  name: mericprediction-sample
  namespace: default
spec:
  target:
    type: workload
    workload:
      apiVersion: apps/v1
      kind: Deployment
      name: nginx-sample
      hierarchy:
        level: container
  metric:
    source: metricServer
    metricServer:
      resources: [cpu, memory]
  profilers:
  - name: recommendation-sample
    model: distribution
    distribution:
      # args

Hotspot Prediction by Timeseries Metric
Pod orchestration varies over time on node, and each pod has its own cycle on resource usage. The NodeQoS CR below describe the usage prediction according to workload metric prediction based on time series.

apiVersion: analysis.koordinator.sh/v1alpha1
kind: NodeQoS
metadata:
  name: node-sample
spec:
  usagePredictionPolicy: workloadByTime
status:
  usageOverTtime:
  - timeWindow: "0~1" # 1~2 hour
    max:
      cpu: 6039m
      memory: 18594k
    average:
      cpu: 4028m
      memory: 15782k
    p95:
      cpu: 5731m
      memory: 18043k
  - timeWindow: "1~2" # 1~2 hour
    max:
      cpu: 6039m
      memory: 18594k
    average:
      cpu: 4028m
      memory: 15782k
    p95:
      cpu: 5731m
      memory: 18043k

The usageOverTtime result in Node QoS is aggregated from MetricPredicion of all workloads running on the Node now, so that the descheduler can check whether there are nodes overloaded in near future then rebalance some pods to others.

apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
  name: mericprediction-sample
  namespace: default
spec:
  target: # workload
  metric:
    source: metricServer
    metricServer:
      resources: [cpu, memory]
    prometheus:
    - resource: memoryBandwidth
      name: container_memory_bandwidth
  profilers:
  - name: timeseries-sample
    model: timeseries-trend
    timeseries-trend: # args

Interference Detection for Workload Otliers
Pod may got interference during runtime due to the resource contention on node, which can be analysed through CPI, PSI, CPU schedule latency etc. Specifiy algorithm such as OCSVM in MetricPrediction then the model will be available in status.

apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
  name: mericprediction-sample
  namespace: default
spec:
  target: # workload
  metric:
    prometheus:
    - resource: cpi
      name: container_cpi
    - resource: psi_cpu
      name: container_psi_cpu
    - resource: csl
      name: container_cpu_scheduling_latency
  profilers:
  - name: interference-sample
    model: OCSVM
    ocsvm: # args

The Interference Manager will parse and send the corresonding model of workload to koordlet. koordlet will execute QoS strategies once it finds some pod is an outlier according to recent metrics.

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

codecov · 2024-01-29T12:47:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.54%. Comparing base (07e51fa) to head (4531674).
Report is 117 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1875      +/-   ##
==========================================
+ Coverage   67.23%   67.54%   +0.30%     
==========================================
  Files         410      413       +3     
  Lines       45662    46072     +410     
==========================================
+ Hits        30702    31120     +418     
+ Misses      12742    12696      -46     
- Partials     2218     2256      +38

Flag	Coverage Δ
unittests	`67.54% <ø> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

koordinator-bot · 2024-01-30T07:19:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hormes after the PR has been reviewed.
You can assign the PR to them by writing /assign @hormes in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

saintube · 2024-01-30T07:52:49Z

Ⅰ. Describe what this PR does

define metric prediction crd for recommendation and predction.

the following yaml defines a MetricPrediction called mericprediction-sample.

mericprediction-sample spec claims that it needs the resource prediction for a workload.

prediction is on container level for a Deployment called nginx
metric types are cpu and memory from metric-server
using distribution profiler which statistic its history usage

mericprediction-sample status returns the cpu and memory profiling result for all containers of nginx workload.

apiVersion: analysis.koordinator.sh/v1alpha1
kind: MetricPrediction
metadata:
  name: mericprediction-sample
  namespace: default
spec:
  target:
    type: workload
    workload:
      apiVersion: apps/v1
      kind: Deployment
      name: nginx
      hierarchy:
        level: container
  metric:
    source: metricServer
    metricServer:
      resources: [cpu, memory]
  profilers:
  - name: recommendation-sample
    model: distribution
    distribution:
      # args
status:
  results:
  - profilerName: recommendation-sample
    model: distribution
    distributionResult:
      items:
      - id:
          level: container
          name: nginx-container
        resources:
        - name: cpu
          avg: 6850m
          quantiles:
            # ...
            p95: 7950m
            p99: 8900m
          stdDev: 759m
          firstSampleTime: 2024-01-29T07:15:56Z
          lastSampleTime: 2024-01-30T07:15:56Z
          totalSamplesCount: 10000
          updateTime: 2024-01-30T07:16:56Z
          conditions: []
        - name: memory
          avg: 1000Mi
          quantiles:
            # ...
            p95: 1100Mi
            p99: 1200Mi
          stdDev: 100Mi
          firstSampleTime: 2024-01-29T07:15:56Z
          lastSampleTime: 2024-01-30T07:15:56Z
          totalSamplesCount: 10000
          updateTime: 2024-01-30T07:16:56Z
          conditions: []

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in make test

typo: mericprediction -> metricprediction

hormes · 2024-01-31T03:18:44Z

Add some user stories to help understand how the API is used

zwzhang0107 · 2024-01-31T12:07:11Z

corresonding

udpated with more user stories.

saintube

/lgtm

apis/analysis/v1alpha1/condition.go

apis/analysis/v1alpha1/groupversion_info.go

apis/analysis/v1alpha1/metric_spec.go

eahydra · 2024-02-01T06:00:22Z

apis/analysis/v1alpha1/workload.go

+	// API version of the referent
+	APIVersion string `json:"apiVersion,omitempty"`
+	// Hierarchy indicates the hierarchy of the target for profiling
+	Hierarchy ProfileHierarchy `json:"hierarchy,omitempty"`


The WorkloadRef makes sense, but the field Hierarchy doesn't look connected to the workload reference, is the definition really appropriate here? I noticed that PodSelectorRef has the same situation, so Hierarchy itself is a field that needs to be described independently?

The Hierarchy field is related to Workload, which means the metric prediction is container or pod level , which is only effective for K8s workloads.
Metric Prediction can also work for other types of workloads beyond K8s such as a FaaS job. Although the workload definition is not is K8s, the metric is recorded as prometheus format.

function_cpu_usage{service="service-word-count", function="f-map-name", slice="slice-0"} 1.2 function_cpu_usage{service="service-word-count", function="f-map-name", slice="slice-1"} 1.3 function_cpu_usage{service="service-word-count", function="f-reduce-name", slice="all"} 2

It means a workload called(service-word-count), which is consisted of two jobs(map and reduce).
For resource recommendation scenario, we want to profile service-word-count/f-map-name and service-word-count/f-reduce-name.
This workload can be defined as AnalysisTargetPromethuesLabelGroup, and we should support service/function as the key for aggregation in PromethuesLabelGroup, which does not need Hierarchy field.

This information looks important and should be added to the proposal.

we can add more explanation when we support PromethuesLabelGroup.

hormes · 2024-02-01T09:35:22Z

In the case where there is another layer in the usage scenario mentioned earlier, does MetricPrediction need to be a CRD?

apis/analysis/v1alpha1/metricprediction_types.go

apis/analysis/v1alpha1/metric_spec.go

apis/analysis/v1alpha1/condition.go

apis/analysis/v1alpha1/metric_spec.go

zqzten · 2024-02-06T09:00:44Z

apis/analysis/v1alpha1/metric_spec.go

+// PrometheusMetric defines the prometheus metric to be analyzed
+type PrometheusMetric struct {
+	// Resource defines the key of resource to be analyzed
+	Resource v1.ResourceName `json:"name,omitempty"`


I wonder if the name resource is suitable here. For example, is CPI a resource?

maybe just defined as name here

apis/analysis/v1alpha1/profiler.go

zwzhang0107 · 2024-02-20T04:02:07Z

In the case where there is another layer in the usage scenario mentioned earlier, does MetricPrediction need to be a CRD?

@hormes The Recommendation controller in Koordinator does not need create MetricPrediction CR in APIServer, which means the MetricPrediction is a internal protocol in this scenario, converting Recommendation CR to MetricPrediction INTERNAL for framework.

In the following scenarios MetricPrediction CR will be created:

An external controller want to use the Prediction module of Koordiantor, then MetricPrediction CRD acts as an API between the external controller and Koordiantor.
Before developing a new profiler controller, MetricPrediction will be created for performing experiments and demo before implementation. For example we need to compare whether to use ARIMA or Prophet algorithm in NodeQoS controller.

First we will support usage scenario, and the development will take two steps:

MetricPrediction framework with Distribution model, useing the resource prediction scenario to verify the framework works well. Then the framework can be extended with more algorithm models such as Interference Detection.
Recommendation controller based on MetricPrediction framework, considering workload type(Job/Service), OOM event, etc.

Signed-off-by: 佑祎 <zzw261520@alibaba-inc.com>

koordinator-bot · 2024-02-20T04:02:53Z

New changes are detected. LGTM label has been removed.

zqzten · 2024-02-23T09:16:26Z

apis/analysis/v1alpha1/metric_spec.go

+	// Source defines the source of metric, which can be metric server or prometheus
+	Source MetricSourceType `json:"source"`
+	// MetricServer defines the metric server source, which is effective when source is metric server
+	MetricServer *MetricServerSource `json:"metricServer,omitempty"`


Suggested change

MetricServer *MetricServerSource `json:"metricServer,omitempty"`

MetricsAPI *MetricsAPIMetricSource `json:"metricsAPI,omitempty"`

zwzhang0107 · 2024-03-04T12:45:51Z

/hold until we have implemented first user strory

stale · 2024-06-02T13:15:40Z

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed
You can:
Mark this issue or PR as fresh with /remove-lifecycle stale
Close this issue or PR with /close
Thank you for your contributions.

koordinator-bot bot requested review from FillZpp and jasonliu747 January 29, 2024 12:39

koordinator-bot bot added the size/XXL label Jan 29, 2024

zwzhang0107 force-pushed the metric-prediction branch from 0699511 to 42c4bc9 Compare January 30, 2024 07:19

zwzhang0107 force-pushed the metric-prediction branch from 42c4bc9 to 7ea4ab9 Compare January 30, 2024 07:47

zwzhang0107 force-pushed the metric-prediction branch from 7ea4ab9 to f474d61 Compare January 30, 2024 08:54

zwzhang0107 force-pushed the metric-prediction branch from f474d61 to 7c351bf Compare January 31, 2024 12:15

saintube reviewed Feb 1, 2024

View reviewed changes

koordinator-bot bot assigned saintube Feb 1, 2024

koordinator-bot bot added the lgtm label Feb 1, 2024

eahydra reviewed Feb 1, 2024

View reviewed changes

zqzten reviewed Feb 6, 2024

View reviewed changes

apis: add MetricPrediction crd

4531674

Signed-off-by: 佑祎 <zzw261520@alibaba-inc.com>

zwzhang0107 force-pushed the metric-prediction branch from 7c351bf to 4531674 Compare February 20, 2024 04:02

koordinator-bot bot removed the lgtm label Feb 20, 2024

zqzten reviewed Feb 23, 2024

View reviewed changes

koordinator-bot bot added the do-not-merge/hold label Mar 4, 2024

zwzhang0107 mentioned this pull request Mar 4, 2024

apis: add Recommendation crd #1937

Merged

3 tasks

stale bot added the lifecycle/stale label Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apis: add MetricPrediction crd #1875

apis: add MetricPrediction crd #1875

zwzhang0107 commented Jan 29, 2024 •

edited

codecov bot commented Jan 29, 2024 •

edited

koordinator-bot bot commented Jan 30, 2024

saintube commented Jan 30, 2024

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

hormes commented Jan 31, 2024

zwzhang0107 commented Jan 31, 2024

saintube left a comment

eahydra Feb 1, 2024

zwzhang0107 Feb 1, 2024

eahydra Feb 1, 2024

zwzhang0107 Feb 20, 2024

hormes commented Feb 1, 2024

zqzten Feb 6, 2024

zwzhang0107 Feb 20, 2024

zwzhang0107 commented Feb 20, 2024 •

edited

koordinator-bot bot commented Feb 20, 2024

zqzten Feb 23, 2024

zwzhang0107 commented Mar 4, 2024

stale bot commented Jun 2, 2024

	MetricServer *MetricServerSource `json:"metricServer,omitempty"`
	MetricsAPI *MetricsAPIMetricSource `json:"metricsAPI,omitempty"`

apis: add MetricPrediction crd #1875

Are you sure you want to change the base?

apis: add MetricPrediction crd #1875

Conversation

zwzhang0107 commented Jan 29, 2024 • edited

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

codecov bot commented Jan 29, 2024 • edited

Codecov Report

koordinator-bot bot commented Jan 30, 2024

saintube commented Jan 30, 2024

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

hormes commented Jan 31, 2024

zwzhang0107 commented Jan 31, 2024

saintube left a comment

Choose a reason for hiding this comment

eahydra Feb 1, 2024

Choose a reason for hiding this comment

zwzhang0107 Feb 1, 2024

Choose a reason for hiding this comment

eahydra Feb 1, 2024

Choose a reason for hiding this comment

zwzhang0107 Feb 20, 2024

Choose a reason for hiding this comment

hormes commented Feb 1, 2024

zqzten Feb 6, 2024

Choose a reason for hiding this comment

zwzhang0107 Feb 20, 2024

Choose a reason for hiding this comment

zwzhang0107 commented Feb 20, 2024 • edited

koordinator-bot bot commented Feb 20, 2024

zqzten Feb 23, 2024

Choose a reason for hiding this comment

zwzhang0107 commented Mar 4, 2024

stale bot commented Jun 2, 2024

zwzhang0107 commented Jan 29, 2024 •

edited

codecov bot commented Jan 29, 2024 •

edited

zwzhang0107 commented Feb 20, 2024 •

edited