Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

log-cache NODE_INDEX incorrect not reference to bpm.yml when multi_az enabled. #1106

Open
ShuangMen opened this issue Aug 31, 2020 · 4 comments
Labels
bug Something isn't working unscheduled

Comments

@ShuangMen
Copy link
Contributor

ShuangMen commented Aug 31, 2020

Describe the bug
When deploy kubecf with multi_az=true, all the log-cache jobs started with NODE_INDEX=0 and this cause the log-cache cluster not working properly.

cf logs not work.
cf push xxx and cf app xxx not working, failed with client timeout.

To Reproduce
cf-operator: 5.2.0
kubecf: v2.2.3
deploy kubecf with multi_az=true
config log-cache with more than 1 instance.

for example:

$ k get pod -n kubecf |grep log-cache
log-cache-z0-0                           10/10   Running     0          18m
log-cache-z0-1                           10/10   Running     0          11m
log-cache-z1-0                           10/10   Running     0          18m
log-cache-z1-1                           10/10   Running     0          11m

login log-cache container (take log-cache-z1-0 for example) and check the environment NODE_INDEX:

sh-4.4# printenv |grep NODE
NODE_INDEX=0
NODE_ADDRS=log-cache-z0-0:8080,log-cache-z0-1:8080,log-cache-z1-0:8080,log-cache-z1-1:8080

check file /var/vcap/jobs/log-cache/config/bpm.yml,

sh-4.4# cat bpm.yml |grep NODE_INDEX
    NODE_INDEX: "2"

check the log-cache logs

$ k logs log-cache-z1-0 -c log-cache-log-cache -n kubecf
2020/08/31 02:16:50 WARNING: proto: file "egress.proto" is already registered
A future release will panic on registration conflicts. See:
https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict

2020/08/31 02:16:50 WARNING: proto: file "ingress.proto" is already registered
A future release will panic on registration conflicts. See:
https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict

2020/08/31 02:16:50.027381 Starting Log Cache...
FIELD NAME:             TYPE:          ENV:                    REQUIRED:  VALUE:
Config.Addr             string         ADDR                    true       :8080
Config.QueryTimeout     time.Duration  QUERY_TIMEOUT           false      10s
Config.MemoryLimit      uint           MEMORY_LIMIT_PERCENT    false      50
Config.MaxPerSource     int            MAX_PER_SOURCE          false      100000
Config.NodeIndex        int            NODE_INDEX              false      0
Config.NodeAddrs        []string       NODE_ADDRS              false      [log-cache-z0-0:8080 log-cache-z0-1:8080 log-cache-z1-0:8080 log-cache-z1-1:8080]
TLS.CAPath              string         CA_PATH                 true       /var/vcap/jobs/log-cache/config/certs/ca.crt
TLS.CertPath            string         CERT_PATH               true       /var/vcap/jobs/log-cache/config/certs/log_cache.crt
TLS.KeyPath             string         KEY_PATH                true       /var/vcap/jobs/log-cache/config/certs/log_cache.key
MetricsServer.Port      uint16         METRICS_PORT            false      6060
MetricsServer.CAFile    string         METRICS_CA_FILE_PATH    false      /var/vcap/jobs/log-cache/config/certs/metrics_ca.crt
MetricsServer.CertFile  string         METRICS_CERT_FILE_PATH  false      /var/vcap/jobs/log-cache/config/certs/metrics.crt
MetricsServer.KeyFile   string         METRICS_KEY_FILE_PATH   false      /var/vcap/jobs/log-cache/config/certs/metrics.key
2020/08/31 02:16:50 Metrics endpoint is listening on [::]:6060

Check the statefulset log-cache in each zone:

$ k describe statefulset log-cache-z0 -n kubecf |grep NODE_INDEX
      NODE_INDEX:              0
$ k describe statefulset log-cache-z1 -n kubecf |grep NODE_INDEX
      NODE_INDEX:              0

NODE_INDEX is set as the Environment of container log-cache-log-cache

 Environment:
      ADDR:                    :8080
      CA_PATH:                 /var/vcap/jobs/log-cache/config/certs/ca.crt
      CERT_PATH:               /var/vcap/jobs/log-cache/config/certs/log_cache.crt
      KEY_PATH:                /var/vcap/jobs/log-cache/config/certs/log_cache.key
      MAX_PER_SOURCE:          100000
      MEMORY_LIMIT_PERCENT:    50
      METRICS_CA_FILE_PATH:    /var/vcap/jobs/log-cache/config/certs/metrics_ca.crt
      METRICS_CERT_FILE_PATH:  /var/vcap/jobs/log-cache/config/certs/metrics.crt
      METRICS_KEY_FILE_PATH:   /var/vcap/jobs/log-cache/config/certs/metrics.key
      METRICS_PORT:            6060
      NODE_ADDRS:              log-cache-z0-0:8080,log-cache-z0-1:8080,log-cache-z1-0:8080,log-cache-z1-1:8080
      NODE_INDEX:              0          
      QUERY_TIMEOUT:           10s
      KUBE_AZ:                 us-south-1
      BOSH_AZ:                 us-south-1
      CF_OPERATOR_AZ:          us-south-1
      AZ_INDEX:                1
      REPLICAS:                2

issue:
The log-cache job start reference to the container environment value NODE_INDEX=0 instead of using the value in bpm.yml, and this cause all the log-cache jobs run with NODE_INDEX=0 and fail to work as a cluster.

Expected behavior
logcach job can join the cluster with correct NODE_INDEX, the value in bpm.yml

Environment
cf-operator: 5.2.0
kubecf: v2.2.3

@ShuangMen ShuangMen added the bug Something isn't working label Aug 31, 2020
@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174568554

The labels on this github issue will be updated when the story is started.

@manno
Copy link
Member

manno commented Oct 8, 2020

This should be fixed by https://www.pivotaltracker.com/story/show/174661471

@manno
Copy link
Member

manno commented Feb 23, 2021

Looks like this was never fixed. I added https://www.pivotaltracker.com/story/show/177062682
However Kubecf uses log-cache as a singleton.

@jandubois
Copy link
Member

However Kubecf uses log-cache as a singleton.

It does, but it is still possible that we'll have to change that in the future... It is only a singleton because of memory leak issues, but making it a singleton cause early expiration of logs, so it is not really a solution, even if you accept that cf logs doesn't need to be HA...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working unscheduled
Projects
None yet
Development

No branches or pull requests

4 participants