Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode. #12912

Closed
numa1985 opened this issue May 8, 2024 · 10 comments
Labels
area/helm type/bug Somehing is not working as expected upgrade

Comments

@numa1985
Copy link

numa1985 commented May 8, 2024

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode.

We are using azure Kubernetes service consisting of 1 system node in a system node pool and 3 user nodes in user node pool for deploying Loki .

Kubernetes Version : 1.29.2

  1. Affinity was working fine and all the pods were landing up in user node pool in 5.44.4,after upgrade to 6.5.0 when we set affinity we are encountering below error.

Error

coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]])
May 7th 2024 10:59:51Error
coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]])
May 7th 2024 10:59:51Error
Error: UPGRADE FAILED: execution error at (loki/templates/validate.yaml:31:4): You have more than zero replicas configured for both the single binary and simple scalable targets. If this was intentional change the deploymentMode to the transitional 'SingleBinary<->SimpleScalable' mode
May 7th 2024 10:59:51Error
Helm Upgrade returned non-zero exit code: 1. Deployment terminated.
May 7th 2024 10:59:51Fatal
The remote script failed with exit code 1

ubuntu@NARU-Pr5530:~$ kubectl describe pod loki-chunks-cache-0 -n loki|tail -5
Type Reason Age From Message


Warning FailedScheduling 2m43s default-scheduler 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
Warning FailedScheduling 2m42s default-scheduler 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
Normal NotTriggerScaleUp 2m40s cluster-autoscaler pod didn't trigger scale-up: 1 max node group size reached

If we are not using affinity in version 6.5.0 ,few pods are landing up in the system node and ending with the resources issues and failing , and as well we don't pods to land up in system node.
Is there any way to fix this ?

Values.yaml ( used in 5.44.4)

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

loki:
auth_enabled: false
query_scheduler:
max_outstanding_requests_per_tenant: 2048
query_range:
parallelise_shardable_queries: false
split_queries_by_interval: 0
commonConfig:
replication_factor: 1
storage:
type: filesystem

singleBinary:
replicas: 1
persistence:
size: 50Gi
enableStatefulSetAutoDeletePVC: true
affinity: |
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- user
weight: 50

Values.yaml ( used in 6.5.0)

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

deploymentMode: SingleBinary
loki:
auth_enabled: false
query_scheduler:
max_outstanding_requests_per_tenant: 2048
query_range:
parallelise_shardable_queries: false
limits_config:
split_queries_by_interval: 0
commonConfig:
replication_factor: 1
storage:
type: filesystem
schemaConfig:
configs:
- from: 2024-04-01
object_store: filesystem
store: tsdb
schema: v13
index:
prefix: loki_index_
period: 24h
ingester:
chunk_encoding: snappy
tracing:
enabled: true
querier:
max_concurrent: 1

backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0

singleBinary:
replicas: 1
persistence:
size: 50Gi
enableStatefulSetAutoDeletePVC: true
enabled: true
extraArgs:
- -config.expand-env=true

chunksCache:
allocatedMemory: 1024
writebackSizeLimit: 10MB

  1. After updrading to 6.5.0 the loki-0 pod going for crash loopback with below error.

Error

ubuntu@NARU-Pr5530:$ kubectl logs loki-0 -n loki
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
line 2: field Error not found in type loki.ConfigWrapper
ubuntu@NARU-Pr5530:
$

ubuntu@NARU-Pr5530:~$ kubectl get pods -n loki -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
loki-0 0/1 CrashLoopBackOff 1 (11s ago) 74s 10.101.80.28 aks-npu2-21504394-vmss00000f
loki-canary-6qngw 1/1 Running 0 74s 10.101.80.158 aks-npsystem01-10976478-vmss000000
loki-canary-6v6bz 1/1 Running 0 75s 10.101.80.136 aks-npu2-21504394-vmss000000
loki-canary-krnqv 1/1 Running 0 75s 10.101.80.240 aks-npu2-21504394-vmss00000f
loki-canary-twcl5 1/1 Running 0 75s 10.101.80.213 aks-npu2-21504394-vmss00000h
loki-chunks-cache-0 0/2 Pending 0 74s
loki-gateway-668c5dff6c-l7hd5 1/1 Running 0 74s 10.101.80.173 aks-npsystem01-10976478-vmss000000
loki-results-cache-0 2/2 Running 0 74s 10.101.80.175 aks-npsystem01-10976478-vmss000000

Kindly do the needful.

@JStickler JStickler added area/helm type/bug Somehing is not working as expected upgrade labels May 13, 2024
@numa1985
Copy link
Author

Currently we are out of monitoring due to the issue mentioned ,it will be really great if some one can have can assist on this.

@numa1985
Copy link
Author

numa1985 commented May 20, 2024

Was able to fix the affinity issue .Only issue can't able to figure out was config issue. PFB values.yaml.Kindly do the needful.Thanks

image

image

image

@numa1985
Copy link
Author

numa1985 commented May 20, 2024

Was able to fix all the issues .Thanks

@krptg0
Copy link

krptg0 commented May 23, 2024

Was able to fix all the issues .Thanks

Care for sharing how you did solve the issue ?

@numa1985
Copy link
Author

@krptg0 : Issue was related to the "parallelise_shardable_queries: true" variable used to be under "loki.query_range" in the chart version we used in 5.44.4 ,but after upgrade to 6.5.0 it should be moved to loki.structuredConfig.query_range ,which also needs to updated in the grafana documentation page for now until permanent fix . Seems this is a bug in the latest chart and I saw some user already derived case for the same few weeks back.

5.44.4

loki:
query_scheduler:
max_outstanding_requests_per_tenant: 2048
query_range:
parallelise_shardable_queries: false
split_queries_by_interval: 0

6.5.0

loki:
commonConfig:
replication_factor: 1
query_scheduler:
max_outstanding_requests_per_tenant: 2048
structuredConfig:
query_range:
parallelise_shardable_queries: true

Thanks

@sslny57
Copy link

sslny57 commented May 23, 2024

@krptg0 I don't see that solving the issue. would you be able to share your config which worked for you .

helm upgrade --reset-values my-loki -f values-loki.yaml grafana/loki -n vector --debug --version 6.5.2 upgrade.go:155: [debug] preparing upgrade for my-loki upgrade.go:536: [debug] resetting values to the chart's original version coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.read.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:read]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.tableManager.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:table-manager]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.write.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:write]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.gateway.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:gateway]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.backend.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:backend]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.read.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:read]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.gateway.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:gateway]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.backend.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:backend]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.write.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:write]] topologyKey:kubernetes.io/hostname]]]]) coalesce.go:286: warning: cannot overwrite table with non table for loki.tableManager.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:table-manager]] topologyKey:kubernetes.io/hostname]]]]) Error: UPGRADE FAILED: execution error at (loki/templates/validate.yaml:40:4): You must provide a schema_config for Loki, one is not provided as this will be individual for every Loki cluster. See https://grafana.com/docs/loki/latest/operations/storage/schema/ for schema information. For quick testing (with no persistence) add--set loki.useTestSchema=true helm.go:84: [debug] execution error at (loki/templates/validate.yaml:40:4): You must provide a schema_config for Loki, one is not provided as this will be individual for every Loki cluster. See https://grafana.com/docs/loki/latest/operations/storage/schema/ for schema information. For quick testing (with no persistence) add--set loki.useTestSchema=true UPGRADE FAILED main.newUpgradeCmd.func2 helm.sh/helm/v3/cmd/helm/upgrade.go:229 github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra@v1.8.0/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC github.com/spf13/cobra@v1.8.0/command.go:1115 github.com/spf13/cobra.(*Command).Execute github.com/spf13/cobra@v1.8.0/command.go:1039 main.main helm.sh/helm/v3/cmd/helm/helm.go:83 runtime.main runtime/proc.go:267 runtime.goexit runtime/asm_amd64.s:1650

I am currently on 5.47.2

helm ls -a NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION my-grafana vector 2 2024-05-21 11:41:27.3348885 +0530 IST deployed grafana-7.3.11 10.4.1 my-loki vector 1 2024-05-21 10:58:05.3864634 +0530 IST deployed loki-5.47.2 2.9.6

@sslny57
Copy link

sslny57 commented May 23, 2024

values_29042024_loki.txt

helm upgrade --reset-values my-loki -f values_29042024.yaml grafana/loki -n vector --debug --version 6.5.2

The Helm file attached was suitable for upgrading, but a couple of pods encountered errors still.
output.txt

@sslny57
Copy link

sslny57 commented May 23, 2024

in gateway pod:

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  17m                   default-scheduler  Successfully assigned vector/my-loki-gateway-548dd78cd8-wrgnd to ip-10-0-1-223.eu-west-2.compute.internal
  Normal   Pulled     17m                   kubelet            Container image "docker.io/nginxinc/nginx-unprivileged:1.24-alpine" already present on machine
  Normal   Created    17m                   kubelet            Created container nginx
  Normal   Started    17m                   kubelet            Started container nginx
  Warning  Unhealthy  2m8s (x101 over 16m)  kubelet            Readiness probe errored: strconv.Atoi: parsing "http": invalid syntax

@sslny57
Copy link

sslny57 commented May 23, 2024




NAME                                              READY   STATUS             RESTARTS      AGE
loki-backend-0                                    2/2     Running            3 (12m ago)   12m
loki-backend-1                                    1/2     CrashLoopBackOff   3 (13s ago)   75s
loki-canary-9hrdt                                 1/1     Running            0             25m
loki-canary-gqktk                                 1/1     Running            0             24m
loki-canary-q6r28                                 1/1     Running            0             23m
loki-canary-rbgbl                                 1/1     Running            0             26m
loki-read-b76c4bff4-kv9qj                         1/1     Running            0             81s
loki-read-b76c4bff4-sjjg4                         1/1     Running            0             50s
loki-write-0                                      1/1     Running            0             25m
loki-write-1                                      0/1     Running            0             8s
loki-write-2                                      1/1     Running            0             80s
my-grafana-7cfd6ffc59-cjhtp                       1/1     Running            0             27m
my-loki-chunks-cache-0                            2/2     Running            0             12m
my-loki-gateway-548dd78cd8-wrgnd                  0/1     Running            0             27m
my-loki-gateway-66f8b59d65-75z95                  0/1     Running            0             34m
my-loki-grafana-agent-operator-6b4f987557-655hx   1/1     Running            0             27m
my-loki-logs-5sr6b                                2/2     Running            0             2d10h
my-loki-logs-cdskt                                2/2     Running            0             2d11h
my-loki-logs-jvdnv                                2/2     Running            0             21m
my-loki-logs-z28sp                                2/2     Running            0             2d11h
my-loki-results-cache-0                           2/2     Running            0             12m
my-vector-0                                       1/1     Running            0             26m


$  kubectl logs loki-backend-1 -c loki
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x288 pc=0x22f02b0]

goroutine 1 [running]:
github.com/grafana/loki/v3/pkg/loki.(*Loki).updateConfigForShipperStore(0xc000a2ff40?)
        /src/loki/pkg/loki/modules.go:755 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomStore(0xc000bf3500)
        /src/loki/pkg/loki/modules.go:715 +0x68
github.com/grafana/dskit/modules.(*Manager).initModule(0xc000a62708, {0x7ffd42c2c27d, 0x7}, 0x1?, 0xc0016800c0?)
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1f7
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x0?, {0xc0008f4910, 0x1, 0xc000c36360?})
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xd8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc000bf3500, {0x0?, {0x4?, 0x3?, 0x4912940?}})
        /src/loki/pkg/loki/loki.go:453 +0x9d
main.main()
        /src/loki/cmd/loki/main.go:122 +0x113b

@sslny57
Copy link

sslny57 commented May 23, 2024

fixed this making change to helm

https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L337-L345

  readinessProbe:
    httpGet:
      path: /
      port: http-metrics
    initialDelaySeconds: 15
    timeoutSeconds: 1``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm type/bug Somehing is not working as expected upgrade
Projects
None yet
Development

No branches or pull requests

4 participants