log-cache NODE_INDEX incorrect not reference to bpm.yml when multi_az enabled. #1106

ShuangMen · 2020-08-31T02:35:42Z

Describe the bug
When deploy kubecf with multi_az=true, all the log-cache jobs started with NODE_INDEX=0 and this cause the log-cache cluster not working properly.

cf logs not work.
cf push xxx and cf app xxx not working, failed with client timeout.

To Reproduce
cf-operator: 5.2.0
kubecf: v2.2.3
deploy kubecf with multi_az=true
config log-cache with more than 1 instance.

for example:

$ k get pod -n kubecf |grep log-cache
log-cache-z0-0                           10/10   Running     0          18m
log-cache-z0-1                           10/10   Running     0          11m
log-cache-z1-0                           10/10   Running     0          18m
log-cache-z1-1                           10/10   Running     0          11m

login log-cache container (take log-cache-z1-0 for example) and check the environment NODE_INDEX:

sh-4.4# printenv |grep NODE
NODE_INDEX=0
NODE_ADDRS=log-cache-z0-0:8080,log-cache-z0-1:8080,log-cache-z1-0:8080,log-cache-z1-1:8080

check file /var/vcap/jobs/log-cache/config/bpm.yml,

sh-4.4# cat bpm.yml |grep NODE_INDEX
    NODE_INDEX: "2"

check the log-cache logs

$ k logs log-cache-z1-0 -c log-cache-log-cache -n kubecf
2020/08/31 02:16:50 WARNING: proto: file "egress.proto" is already registered
A future release will panic on registration conflicts. See:
https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict

2020/08/31 02:16:50 WARNING: proto: file "ingress.proto" is already registered
A future release will panic on registration conflicts. See:
https://developers.google.com/protocol-buffers/docs/reference/go/faq#namespace-conflict

2020/08/31 02:16:50.027381 Starting Log Cache...
FIELD NAME:             TYPE:          ENV:                    REQUIRED:  VALUE:
Config.Addr             string         ADDR                    true       :8080
Config.QueryTimeout     time.Duration  QUERY_TIMEOUT           false      10s
Config.MemoryLimit      uint           MEMORY_LIMIT_PERCENT    false      50
Config.MaxPerSource     int            MAX_PER_SOURCE          false      100000
Config.NodeIndex        int            NODE_INDEX              false      0
Config.NodeAddrs        []string       NODE_ADDRS              false      [log-cache-z0-0:8080 log-cache-z0-1:8080 log-cache-z1-0:8080 log-cache-z1-1:8080]
TLS.CAPath              string         CA_PATH                 true       /var/vcap/jobs/log-cache/config/certs/ca.crt
TLS.CertPath            string         CERT_PATH               true       /var/vcap/jobs/log-cache/config/certs/log_cache.crt
TLS.KeyPath             string         KEY_PATH                true       /var/vcap/jobs/log-cache/config/certs/log_cache.key
MetricsServer.Port      uint16         METRICS_PORT            false      6060
MetricsServer.CAFile    string         METRICS_CA_FILE_PATH    false      /var/vcap/jobs/log-cache/config/certs/metrics_ca.crt
MetricsServer.CertFile  string         METRICS_CERT_FILE_PATH  false      /var/vcap/jobs/log-cache/config/certs/metrics.crt
MetricsServer.KeyFile   string         METRICS_KEY_FILE_PATH   false      /var/vcap/jobs/log-cache/config/certs/metrics.key
2020/08/31 02:16:50 Metrics endpoint is listening on [::]:6060

Check the statefulset log-cache in each zone:

$ k describe statefulset log-cache-z0 -n kubecf |grep NODE_INDEX
      NODE_INDEX:              0

$ k describe statefulset log-cache-z1 -n kubecf |grep NODE_INDEX
      NODE_INDEX:              0

NODE_INDEX is set as the Environment of container log-cache-log-cache

 Environment:
      ADDR:                    :8080
      CA_PATH:                 /var/vcap/jobs/log-cache/config/certs/ca.crt
      CERT_PATH:               /var/vcap/jobs/log-cache/config/certs/log_cache.crt
      KEY_PATH:                /var/vcap/jobs/log-cache/config/certs/log_cache.key
      MAX_PER_SOURCE:          100000
      MEMORY_LIMIT_PERCENT:    50
      METRICS_CA_FILE_PATH:    /var/vcap/jobs/log-cache/config/certs/metrics_ca.crt
      METRICS_CERT_FILE_PATH:  /var/vcap/jobs/log-cache/config/certs/metrics.crt
      METRICS_KEY_FILE_PATH:   /var/vcap/jobs/log-cache/config/certs/metrics.key
      METRICS_PORT:            6060
      NODE_ADDRS:              log-cache-z0-0:8080,log-cache-z0-1:8080,log-cache-z1-0:8080,log-cache-z1-1:8080
      NODE_INDEX:              0          
      QUERY_TIMEOUT:           10s
      KUBE_AZ:                 us-south-1
      BOSH_AZ:                 us-south-1
      CF_OPERATOR_AZ:          us-south-1
      AZ_INDEX:                1
      REPLICAS:                2

issue:
The log-cache job start reference to the container environment value NODE_INDEX=0 instead of using the value in bpm.yml, and this cause all the log-cache jobs run with NODE_INDEX=0 and fail to work as a cluster.

Expected behavior
logcach job can join the cluster with correct NODE_INDEX, the value in bpm.yml

Environment
cf-operator: 5.2.0
kubecf: v2.2.3

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2020-08-31T02:35:44Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174568554

The labels on this github issue will be updated when the story is started.

manno · 2020-10-08T09:37:28Z

This should be fixed by https://www.pivotaltracker.com/story/show/174661471

manno · 2021-02-23T17:10:53Z

Looks like this was never fixed. I added https://www.pivotaltracker.com/story/show/177062682
However Kubecf uses log-cache as a singleton.

jandubois · 2021-02-23T17:51:49Z

However Kubecf uses log-cache as a singleton.

It does, but it is still possible that we'll have to change that in the future... It is only a singleton because of memory leak issues, but making it a singleton cause early expiration of logs, so it is not really a solution, even if you accept that cf logs doesn't need to be HA...

ShuangMen added the bug Something isn't working label Aug 31, 2020

cf-gitbot added unscheduled scheduled in progress and removed unscheduled scheduled labels Aug 31, 2020

ShuangMen mentioned this issue Sep 9, 2020

incorrect NODE_INDEX and AZ_INDEX in log-cache cloudfoundry-incubator/kubecf#1307

Open

manno closed this as completed Oct 8, 2020

cf-gitbot added delivered accepted and removed in progress delivered labels Oct 8, 2020

manno reopened this Feb 23, 2021

cf-gitbot added unscheduled and removed accepted labels Feb 23, 2021

cloudfoundry-incubator deleted a comment from cf-gitbot Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log-cache NODE_INDEX incorrect not reference to bpm.yml when multi_az enabled. #1106

log-cache NODE_INDEX incorrect not reference to bpm.yml when multi_az enabled. #1106

ShuangMen commented Aug 31, 2020 •

edited

cf-gitbot commented Aug 31, 2020

manno commented Oct 8, 2020

manno commented Feb 23, 2021

jandubois commented Feb 23, 2021

log-cache NODE_INDEX incorrect not reference to bpm.yml when multi_az enabled. #1106

log-cache NODE_INDEX incorrect not reference to bpm.yml when multi_az enabled. #1106

Comments

ShuangMen commented Aug 31, 2020 • edited

cf-gitbot commented Aug 31, 2020

manno commented Oct 8, 2020

manno commented Feb 23, 2021

jandubois commented Feb 23, 2021

ShuangMen commented Aug 31, 2020 •

edited