Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splunk Operator: add some troubleshooting advices. #1293

Open
yaroslav-nakonechnikov opened this issue Feb 22, 2024 · 5 comments
Open

Splunk Operator: add some troubleshooting advices. #1293

yaroslav-nakonechnikov opened this issue Feb 22, 2024 · 5 comments
Assignees
Labels
documentation Improvements or additions to documentation Q2

Comments

@yaroslav-nakonechnikov
Copy link

Please select the type of request

Enhancement

Tell us more

Describe the request
Need a bit better understanding status of phases for crd's:

$ for crd in $(kubectl get crd | grep splunk | awk -F ' ' '{print $1}'); do echo $crd; kubectl get $crd -n splunk-operator; done
clustermanagers.enterprise.splunk.com
NAME    PHASE   MANAGER   DESIRED   READY   AGE
39188   Ready                               6d1h
clustermasters.enterprise.splunk.com
No resources found in splunk-operator namespace.
indexerclusters.enterprise.splunk.com
NAME          PHASE   MASTER   MANAGER   DESIRED   READY   AGE
site1-39188   Error            Ready     3         3       6d1h
site2-39188   Error            Ready     3         3       6d1h
site3-39188   Error            Ready     3         3       6d1h
site4-39188   Error            Ready     4         3       6d1h
site5-39188   Error            Ready     3         3       6d1h
site6-39188   Error            Ready     3         3       6d1h
licensemanagers.enterprise.splunk.com
NAME    PHASE   AGE
39188   Ready   6d1h
licensemasters.enterprise.splunk.com
No resources found in splunk-operator namespace.
monitoringconsoles.enterprise.splunk.com
NAME    PHASE     DESIRED   READY   AGE
39188   Pending                     6d1h
searchheadclusters.enterprise.splunk.com
NAME      PHASE   DEPLOYER   DESIRED   READY   AGE
e-39188   Error   Error      3         3       6d1h
standalones.enterprise.splunk.com
NAME      PHASE   DESIRED   READY   AGE
c-39188   Ready   1         1       6d1h

Ready - is clear. All works as expected.
Pending - is understandable, as pod currently not running.
Error - not clear at all.
for example:

$ kubectl get pods -n splunk-operator | grep site
splunk-site1-39188-indexer-0                          1/1     Running   0               67m
splunk-site1-39188-indexer-1                          1/1     Running   0               67m
splunk-site1-39188-indexer-2                          1/1     Running   0               67m
splunk-site2-39188-indexer-0                          1/1     Running   0               67m
splunk-site2-39188-indexer-1                          1/1     Running   0               67m
splunk-site2-39188-indexer-2                          1/1     Running   0               67m
splunk-site3-39188-indexer-0                          1/1     Running   0               67m
splunk-site3-39188-indexer-1                          1/1     Running   0               67m
splunk-site3-39188-indexer-2                          1/1     Running   0               67m
splunk-site5-39188-indexer-0                          1/1     Running   0               67m
splunk-site5-39188-indexer-1                          1/1     Running   0               67m
splunk-site5-39188-indexer-2                          1/1     Running   0               67m
splunk-site6-39188-indexer-0                          1/1     Running   0               67m
splunk-site6-39188-indexer-1                          1/1     Running   0               67m
splunk-site6-39188-indexer-2                          1/1     Running   0               67m

so, i'd expect that it should be Ready for 5 records of 6, but it says Error for each indexer resource.

$ kubectl describe indexerclusters.enterprise.splunk.com site4-39188 -n splunk-operator
....
Status:
  Idxc Password Changed Secrets:
  Cluster Manager Phase:  Ready
  indexer_secret_changed_flag:
  indexing_ready_flag:                       true
  initialized_flag:                          true
  maintenance_mode:                          false
  namespace_scoped_secret_resource_version:  23540
  Peers:
    active_bundle_id:  28AE1272BABED6776EC277C6C661CCCC
    bucket_count:      64
    GUID:              681F5B56-E71A-43F6-ABC1-A9C11B7D1A9E
    is_searchable:     true
    Name:              splunk-site4-39188-indexer-0
    Status:            Up
    active_bundle_id:  28AE1272BABED6776EC277C6C661CCCC
    bucket_count:      56
    GUID:              1C571956-06BF-43E1-B706-7626B15B0C32
    is_searchable:     true
    Name:              splunk-site4-39188-indexer-1
    Status:            Up
    active_bundle_id:  28AE1272BABED6776EC277C6C661CCCC
    bucket_count:      64
    GUID:              4564371D-1FC6-4A9C-A5D2-18A26DB70783
    is_searchable:     true
    Name:              splunk-site4-39188-indexer-2
    Status:            Up
  Phase:               Error
  Ready Replicas:      3
  Replicas:            4
  Selector:            app.kubernetes.io/instance=splunk-site4-39188-indexer
  service_ready_flag:  true
Events:                <none>

doesn't give any error code. just Phase: error.

Expected behavior
There is an article with troubleshooting section, where are some advices given to check statuses.

@yaroslav-nakonechnikov
Copy link
Author

so, i tried to downgrade 2.5.1 (and 2.5.0) to 2.4.0 - and it worked. What, actually, not expected at all.

@vivekr-splunk
Copy link
Collaborator

Hi @yaroslav-nakonechnikov , greetings, when you say it worked after downgrading to 2.4.0 can you elaborate what worked. Are you saying phase changed to "Ready"?

@yaroslav-nakonechnikov
Copy link
Author

@vivekr-splunk yes, phase status became ready, and site4 was created back, as expected.

@yaroslav-nakonechnikov
Copy link
Author

it is starting to be a critical.

as crd created:

[yn@ip-10-216-35-48 /]$ kubectl get indexerclusters -n splunk-operator
NAME          PHASE   MASTER   MANAGER   DESIRED   READY   AGE
site1-32002   Error            Ready     3         0       77m
site2-32002   Error            Ready     3         0       77m
site3-32002   Error            Ready     3         0       77m
site4-32002   Error            Ready     3         0       77m
site5-32002   Error            Ready     3         0       77m
site6-32002   Error            Ready     3         5       5d4h

in manager i see next logs:

2024-03-19T13:49:32.244551457Z  INFO    start   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "indexercluster": "splunk-operator/site5-32002", "CR version": "4044269"}
2024-03-19T13:49:32.244647588Z  INFO    ApplyConfigMap  No changes for ConfigMap        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-site5-32002-indexer-defaults", "namespace": "splunk-operator"}
2024-03-19T13:49:32.413422389Z  INFO    ApplyService    No update to existing Service   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-site5-32002-indexer-headless", "namespace": "splunk-operator"}
2024-03-19T13:49:32.413454413Z  INFO    ApplyService    No update to existing Service   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-site5-32002-indexer-service", "namespace": "splunk-operator"}
2024-03-19T13:49:32.4136215Z    INFO    ApplyConfigMap  No changes for ConfigMap        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-splunk-operator-probe-configmap", "namespace": "splunk-operator"}
2024-03-19T13:49:32.413793087Z  INFO    getLivenessProbe        LivenessProbe   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "Configured": "&Probe{ProbeHandler:ProbeHandler{Exec:&ExecAction{Command:[/mnt/probes/livenessProbe.sh],},HTTPGet:nil,TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:240,TimeoutSeconds:30,PeriodSeconds:30,SuccessThreshold:0,FailureThreshold:60,TerminationGracePeriodSeconds:nil,}"}
2024-03-19T13:49:32.413815103Z  INFO    getReadinessProbe       ReadinessProbe  {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "Configured": "&Probe{ProbeHandler:ProbeHandler{Exec:&ExecAction{Command:[/mnt/probes/readinessProbe.sh],},HTTPGet:nil,TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:240,TimeoutSeconds:30,PeriodSeconds:30,SuccessThreshold:0,FailureThreshold:60,TerminationGracePeriodSeconds:nil,}"}
2024-03-19T13:49:32.41382608Z   INFO    getStartupProbe StartupProbe    {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "Configured": "&Probe{ProbeHandler:ProbeHandler{Exec:&ExecAction{Command:[/mnt/probes/startupProbe.sh],},HTTPGet:nil,TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:40,TimeoutSeconds:30,PeriodSeconds:60,SuccessThreshold:0,FailureThreshold:60,TerminationGracePeriodSeconds:nil,}"}
2024-03-19T13:49:32.41399514Z   INFO    isClusterManagerReadyForUpgrade kind is set to  {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "kind": "IndexerCluster"}
2024-03-19T13:49:32.414144671Z  INFO    updateCRStatus  Trying to update        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "original cr version": "4044269", "count": 0}
2024-03-19T13:49:32.428603698Z  INFO    updateCRStatus  Status update successful        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "original cr version": "4044269", "current CR version": "4044269", "updated CR version":"4044269"}
2024-03-19T13:49:32.428634239Z  INFO    updateCRStatus  Cache is reflecting the latest CR       {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "original cr version": "4044269", "updated CR version": "4044269"}
2024-03-19T13:49:32.428639088Z  INFO    Requeued        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "indexercluster": "splunk-operator/site5-32002", "period(seconds)": 5}

but statefulset is not created:

[yn@ip-10-216-35-48 /]$ kubectl get statefulset -n splunk-operator
NAME                              READY   AGE
splunk-32002-cluster-manager      1/1     41m
splunk-32002-license-manager      1/1     62m
splunk-32002-monitoring-console   0/1     24m
[yn@ip-10-216-35-48 /]$

why? what is wrong?

@vivekr-splunk
Copy link
Collaborator

@yaroslav-nakonechnikov, we acknowledge that error codes and messages are not clearly documented. We're currently planning to revamp the error handling process and include error messages in the status to provide a clearer explanation of why the reconciliation fails. We'll keep you informed once this is completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Q2
Projects
None yet
Development

No branches or pull requests

4 participants