Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK 1.7.4 does not stream all events in one chunk in a distributed search #541

Open
FritzWittwer opened this issue Sep 5, 2023 · 6 comments

Comments

@FritzWittwer
Copy link

FritzWittwer commented Sep 5, 2023

Describe the bug
A distributed streaming custom command does not receive all events of the search in one invocation, it gets the events in several chunks.

We have a distributed streaming custom command which searches for hits of URLS in a list of IOCs. It has to process 10 millions of events with an URL and compare them against a list of about million IOCs. The custom command builds an optimized cache data structure to be able to process this load. This cache is stored in a local file on the indexer, and loaded each time the command is invoked on an indexer.

This works well in our current environment consisting of 96 indexers with Splunk 8.2.9 and using Python 2 and splunklib 1.6.12. In average each indexer processes about 100'000 events each time the custom command is called.
Switching to Python 3 and splunklib 1.7.4 as a preparation for our planed migration to Splunk 9x, we found the command failing. Further diagnosis showed that the command is called several times on each indexer, with just a small batch of events streamed on each invocation. This makes the whole caching mechanism useless as it takes about 120 seconds to load the cache, while processing the 100'000 events takes only a few seconds.
Further investigation shows that the problem lies in the splunklib 1.7.4, 1.6.12 works as expected as long as the size of the streamed data is not to big, otherwise SDK 1.6.12 just hangs.

To Reproduce
Create a an app containing a custom command p3s7lookupurldebugempty:
in ./bin, the python file p3s7lookupurldebugempty.py

import sys
import time
from splunklib.searchcommands import \
dispatch, StreamingCommand, Configuration, Option, validators


@Configuration()
class p3s7lookupurldebugempty(StreamingCommand):

def stream(self, records):

cnt = 0
file="/tmp/debug.log"
for record in records:

cnt = cnt + 1

yield {'_serial': 73+cnt, '_time': time.time(), '_raw': "Counter %d" % cnt}
with open(file, 'a') as logfile:
logfile.write("%f build=202308210731p3s7 Finished with: %d events\n" % (time.time(), cnt))

_globals = {'__builtins__': __builtins__}


if __name__ == "__main__":
dispatch(p3s7lookupurldebugempty, sys.argv, sys.stdin, sys.stdout, __name__)

Add it to commands.conf:

[p3s7testlookupurldebugempty]
filename = p3s7testlookupurl-debug-empty.py
python.version=python3
chunked = true

add the splunklib as ./bin/splunklib

Prepare test data:

| makeresults count=50000
| streamstats count as nr
| eval batch="50000Short"
| eval _time=time()-86400+nr
| table *
| collect index=main

Run the search:

index=main batch="50000Short" | p3s7testlookupurldebugempty

See the results in /tmp/debug on the indexers (only one will show it as only one indexer will have to process the events)

1692596001.205690 build=202308210731p3s7 Finished with: 50 events
1692596001.240201 build=202308210731p3s7 Finished with: 449 events
1692596001.431701 build=202308210731p3s7 Finished with: 2500 events
1692596001.633610 build=202308210731p3s7 Finished with: 2490 events
1692596001.832549 build=202308210731p3s7 Finished with: 2500 events
1692596002.020129 build=202308210731p3s7 Finished with: 2500 events
1692596002.211033 build=202308210731p3s7 Finished with: 2500 events
1692596002.396127 build=202308210731p3s7 Finished with: 2500 events
1692596002.593029 build=202308210731p3s7 Finished with: 2500 events
1692596002.780966 build=202308210731p3s7 Finished with: 2500 events
1692596002.966760 build=202308210731p3s7 Finished with: 2500 events
1692596003.164372 build=202308210731p3s7 Finished with: 2500 events
1692596003.551674 build=202308210731p3s7 Finished with: 4999 events
1692596004.515754 build=202308210731p3s7 Finished with: 12500 events
1692596005.037160 build=202308210731p3s7 Finished with: 7012 events
1692596005.038402 build=202308210731p3s7 Finished with: 0 events
1692596005.038932 build=202308210731p3s7 Finished with: 0 events
1692596005.039385 build=202308210731p3s7 Finished with: 0 events

Expected behavior
Expected result would be:

1692595462.334300 build=202308180849p2 Finished with: 50000 events

Splunk:

  • Version: 8.2.0
  • OS: Red Hat Enterprise Linux release 8.3 (Ootpa) (Oracle 5.4.17-2102.201.3.el8uek.x86_64)
  • Deployment: Distributed, see below

SDK (please complete the following information):

  • Version: 1, 7, 4
  • Language Runtime Version: Python 3.7.10
  • OS: Red Hat Enterprise Linux release 8.3 (Ootpa) (Orale 5.4.17-2102.201.3.el8uek.x86_64)

Additional context
This Test have been Executed on following environment

  • single search head
  • 1 master node
  • 4 node indexer cluster

All Splunk 8.2.0 (the same error also occurs on Splunk 8.2.9)

@akaila-splunk
Copy link
Contributor

Hi @FritzWittwer , we are trying to reproduce the issue but we are not sure on the indentations, so request you to provide the script with proper indentations.

@FritzWittwer
Copy link
Author

FritzWittwer commented Sep 18, 2023

Sorry for the delay, I was actually on vacation and did not see your comments. Below is the code whit the indentions as it is used (slightly newer build number as I did some more tests)

import sys
import time
from splunklib.searchcommands import \
    dispatch, StreamingCommand, Configuration, Option, validators


@Configuration()
class p3s7lookupurldebugempty(StreamingCommand):

    def stream(self, records):

        cnt = 0
        file="/tmp/debug.log"
        for record in records:
          cnt = cnt + 1

        yield {'_serial': 73+cnt+int(time.time()), '_time': time.time(), '_raw': "Counter %d" % cnt}
        with open(file, 'a') as logfile:
          logfile.write("%f build=202309181412p3s7 Finished with: %d events\n" % (time.time(), cnt))

    _globals = {'__builtins__': __builtins__}


if __name__ == "__main__":
    dispatch(p3s7lookupurldebugempty, sys.argv, sys.stdin, sys.stdout, __name__)

@ashah-splunk
Copy link
Contributor

Hi @FritzWittwer, sorry for the delay in response. We are looking into this and will provide an update soon.
Thanks for your patience.

@FritzWittwer
Copy link
Author

...an update, we assumed we could use Splunklib 1.6.12 but as we started to use it with production data, we learned that it also fails as soon as the result set streamed to it reaches a certain size. This size is larger then the 50'000 events we used to test the behavior, thus it went unnoticed. Large in this context seems to be the number of bytes, not the number of events. Splunklib 1.6.12 doesn't make chunks as Splunklib 1.7.2, it just hangs. I suspect this was the reason to change something so 1.7.2 is now making chunks of the streamed events.

@ncanumalla-splunk ncanumalla-splunk changed the title SDK 1.7.4 does not stream all events in one junk in a distitibuted search SDK 1.7.4 does not stream all events in one chunk in a distributed search Oct 26, 2023
@ncanumalla-splunk
Copy link
Contributor

@FritzWittwer We appreciate the update with additional information. The team is investigating this issue.
Can you please clarify a few points? Thank you for your patience.

  1. Which version of the SDK do you intend to use? Is it 1.7.4 where you see the problem?
  2. Which is the most recent version where you are seeing the expected results? Is it 1.7.2?
  3. Is your team blocked by this issue and is there a timeline you need this resolved?

@FritzWittwer
Copy link
Author

FritzWittwer commented Oct 26, 2023

  1. We want use SDK 1.7.4 (well any working version would be fine)
  2. This is SDK 1.6.12, Python 2 on Splunk 8.2.9
  3. Yes it is blocking as the platform has to be upgraded to Splunk 9.x we are now trying to use bloom filters in the hope to reduce the startup time by factors ,so large number of calls with small chunks have not such an devastating impact. We are under pressure to have a quick fix as the upgrade will happen in a week.

We assumed to be able to use SDK 1.6.12 with not distributed searches, and we only learned last week that this also fails with Python 3 and our real data. The test data we had used was good enough to show the different behavior between SCK 1.6.12 on Python 2 and SDK 1.7.4 with Python 3, but it did not show the blocking behavior of SDK 1.6.12 on Python 3, see also my earlier comments.

Thinking back I assume the problem with SDK blocking might be quit old, more than two years ago I tried to use Python 3, and the custom command was just hanging, we did not investigate further as there has been no need to use Python 3 by then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants