Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splunk Operator: indexers don't start with 9.1.2 #1260

Closed
yaroslav-nakonechnikov opened this issue Dec 13, 2023 · 15 comments
Closed

Splunk Operator: indexers don't start with 9.1.2 #1260

yaroslav-nakonechnikov opened this issue Dec 13, 2023 · 15 comments

Comments

@yaroslav-nakonechnikov
Copy link

Please select the type of request

Bug

Tell us more

Describe the request
All nodes starting as expected, but only indexers can't

Expected behavior
all works as it was

Splunk setup on K8S
eks

Reproduction/Testing steps

  • start cluster with 9.1.1
    and then replace splunk image with 9.1.2
  • or just try to start cluster with 9.1.2

K8s environment
1.28

Additional context(optional)

@yaroslav-nakonechnikov
Copy link
Author

yaroslav-nakonechnikov commented Dec 13, 2023

so, after several tests i can give what breaks.

in cluster manager definition we had that:

 smartstore:
    defaults:
      maxGlobalDataSizeMB: 0
      maxGlobalRawDataSizeMB: 0
      volumeName: smartstore
    indexes:
    - hotlistBloomFilterRecencyHours: 1
      hotlistRecencySecs: 3600
      name: tf-test
      remotePath: tf-test/
      volumeName: smartstore
    volumes:
    - endpoint: https://s3-eu-central-1.amazonaws.com
      name: smartstore
      path: bucket-for-smart-store
      provider: aws
      region: eu-central-1
      storageType: s3

and when we removed that block and recreated cm and indexers - all started to work.

and it has same behavior with splunk-operator versions 2.4.0 and latest

@yaroslav-nakonechnikov
Copy link
Author

and, final tests showing, that problem in defaults section.

@yaroslav-nakonechnikov
Copy link
Author

so, my further investigation leads that splunk-operator creates default settings:

[splunk@splunk-site1-indexer-0 splunk]$ bin/splunk btool indexes  list --debug | grep "\[default\]"
/opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf                 [default]
[splunk@splunk-site1-indexer-0 splunk]$ cat /opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf
[default]
repFactor = auto
maxDataSize = auto
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb

[volume:smartstore]
storageType = remote
path = s3://bucket-for-smart-store
remote.s3.endpoint = https://s3-eu-central-1.amazonaws.com
remote.s3.auth_region = eu-central-1

and doesn't work with definition from crd.

also, we had some default settings defined in our custom created app, and it also breaks indexer startup. so something changed which shouldn't be touched.

@vivekr-splunk
Copy link
Collaborator

hello @yaroslav-nakonechnikov are you using IRSA with privatelink

@yaroslav-nakonechnikov
Copy link
Author

yaroslav-nakonechnikov commented Dec 18, 2023

@vivekr-splunk, no, we don't use privatelink.

main point, that with 9.1.1 same config was working fine.

@vivekr-splunk
Copy link
Collaborator

Hello @yaroslav-nakonechnikov this has been fixed in upcoming release of 9.1.3 and 9.0.7 and also in 9.2.1.

@yaroslav-nakonechnikov
Copy link
Author

still same issue with 9.1.3

FAILED - RETRYING: Restart the splunkd service - Via CLI (5 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (4 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (3 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (2 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (1 retries left).

RUNNING HANDLER [splunk_common : Restart the splunkd service - Via CLI] ********
fatal: [localhost]: FAILED! => {
    "attempts": 60,
    "changed": true,
    "cmd": [
        "/opt/splunk/bin/splunk",
        "restart",
        "--answer-yes",
        "--accept-license"
    ],
    "delta": "0:00:11.173687",
    "end": "2024-01-25 15:06:11.736729",
    "rc": 10,
    "start": "2024-01-25 15:06:00.563042"
}

STDOUT:

splunkd is not running.

Splunk> 4TW

Checking prerequisites...
        Checking mgmt port [8089]: open
        Checking kvstore port [8191]: open
        Checking configuration... Done.


STDERR:

ERROR: pid 5825 terminated with signal 11 (core dumped)
Validating databases (splunkd validatedb) failed with code '-1'.  If you cannot resolve the issue(s) above after consulting documentation, please file a case online at http://www.splunk.com/page/submit_issue


MSG:

non-zero return code
Thursday 25 January 2024  15:06:11 +0000 (0:22:44.302)       0:23:46.336 ******
Thursday 25 January 2024  15:06:11 +0000 (0:00:00.000)       0:23:46.336 ******
Thursday 25 January 2024  15:06:11 +0000 (0:00:00.000)       0:23:46.337 ******

PLAY RECAP *********************************************************************
localhost                  : ok=106  changed=20   unreachable=0    failed=1    skipped=67   rescued=0    ignored=0

Thursday 25 January 2024  15:06:11 +0000 (0:00:00.003)       0:23:46.341 ******
===============================================================================
splunk_common : Restart the splunkd service - Via CLI ---------------- 1364.30s
splunk_common : Restart the splunkd service - Via CLI ------------------ 18.39s
splunk_common : Set options in saml ------------------------------------- 6.26s
splunk_common : Set options in roleMap_SAML ----------------------------- 6.04s
splunk_common : Get Splunk status --------------------------------------- 1.43s
splunk_common : Set node as license slave ------------------------------- 1.17s
splunk_indexer : Update HEC token configuration ------------------------- 1.17s
Gathering Facts --------------------------------------------------------- 1.14s
splunk_indexer : Set current node as indexer cluster peer --------------- 1.12s
splunk_common : Update /opt/splunk/etc ---------------------------------- 0.97s
splunk_indexer : Setup Peers with Associated Site ----------------------- 0.97s
splunk_common : Set options in authentication --------------------------- 0.88s
splunk_common : Test basic https endpoint ------------------------------- 0.79s
splunk_indexer : Setup global HEC --------------------------------------- 0.70s
splunk_indexer : Check for required restarts ---------------------------- 0.68s
Check for required restarts --------------------------------------------- 0.67s
splunk_indexer : Get existing HEC token --------------------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.66s
splunk_common : Check Splunk instance is running ------------------------ 0.66s

@yaroslav-nakonechnikov
Copy link
Author

that one looks like fixed in 9.2.*
but still testing

@fabiusgoh
Copy link

Still hitting with the same error on 9.2.0 and Splunk Operator 2.5.0

@yaroslav-nakonechnikov
Copy link
Author

@fabiusgoh have you raised ticket in splunk support? may i ask you for its number?

@fabiusgoh
Copy link

i have not raised a support ticket yet, am in the midst to test it out on 9.1.3 as it is the officially supported version for the operator

@yaroslav-nakonechnikov
Copy link
Author

i can confirm, 9.2 and 9.2.0.1 starts with our config.
which wasn't working with 9.1.2 and 9.1.3

@vivekr-splunk
Copy link
Collaborator

@yaroslav-nakonechnikov, As we discussed in our meeting, we now understand the issue. This problem arose due to the upgrade path we followed in the 2.5.0 release. Previously, we expected the search head clusters to be running before starting the indexers (if both indexers and SHC are pointing to the same CM). However, since the SHC had trouble starting, the indexers were never created.
As agreed, we will modify the logic to start the indexers parallel to the search head. We'll keep you updated on our progress with these changes.

@yaroslav-nakonechnikov
Copy link
Author

@vivekr-splunk yep, i agree, it was informative meeting. But this ticket is different, as it is about Splunk logic itself(or splunk-ansible), which was fixed in splunk container starting from 9.2.0.

we were discussing : #1293

@yaroslav-nakonechnikov
Copy link
Author

also, today i've rechecked 9.1.4 - is it not working as well.
so, 9.1.1 last working version and the last supported version.

all others are broken or not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants