Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce batching to MISP feed output bot #2473

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

arvchristos
Copy link

Description

Implementation of batch mode for MISP feed output to resolve slow performance when queue has many events. The existing code is actually prone to performance issues:

The following code is being executed for every event in the queue, making the bot extremely slow as events arrive and feed becomes larger:

feed_output = self.current_event.to_feed(with_meta=False)

with self.current_file.open('w') as f: # File opened for every event
     json.dump(feed_output, f)

feed_meta_generator(self.output_dir)     # Metadata updated on every event     

Motivation

We are trying to create feeds based on Alienvault OTX pulses including thousands of IOCs per day. This is basically not possible with the current MISP feed output bot performance.

Fix

With this MR, batched feed creation is introduced. The user can now configure the batch size using the batch_size parameter. Batch functionality is based on the actual internal queue used from the bot.

Benchmark

On an average server, before this improvement a feed of 8k events required several hours to be created while now requires less than 5 minutes (depends on the available resources).

Copy link
Contributor

@kamil-certat kamil-certat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, this is a very good idea, I've recently started using the Feed Output bot, and I have similar thoughts that it may not be the best way how it works.

However, I have some concerns about the PR:

  1. Have you tested your bot with a feed producing data very slow? Or very quick? Line 123 checks the size of the source queue, which holds events incoming to the bot, not how many events were accumulated. If I haven't missed anything, if your bot will never get 100 events in the source queue, it will never save data. And if your bot quickly gets 100 events in the queue (e.g. wasn't running for some time), it will do the same as the original code: save on every event.
  2. You acknowledge messages and then keep events in a list (event_batch). This means they are in-memory only, and any surprising restart/crash will cause data lost.
  3. You haven't implemented the shutdown method, so any restart or reload will not flush the batch, and so - cause data lost.
  4. You have made an assumption that source queue is always Redis-compatible. When it's the most common case, it can theoretically be e.g. AMQP (see pipeline.py).

I have a few suggestions:

  1. This is in fact a kind of caching. Let's re-use CacheMixin von intelmq/lib/mixins/cache.py to connect with redis. It doesn't have support for storing events in a queue, but you can then use the client by getting it from cache_get_redis_instance.
  2. Store events in a separated redis queue. This will move a responsibility for keeping data away from the bot.
  3. I would suggest changing the logic a little: let use interval_event and batch_size as two conditions, and when any of them is set and reached the threshold, create a new event and save. It would be for me a little more logical.
  • E.g. for very slow feeds, the interval would hit first, and we will save a few collected data e.g. every hour.
  • For quick feeds, the batch limit would hist first, and to avoid creating very big MISP events, we should also create a new MISP event for every batch.
  • Removing interval_event value should trigger new MISP event only when the batch size was reached.
  • Removing the batch_size (= setting to 0) should preserve the current behaviour: save immediately, but create a new event only when the interval is reached.
  • Setting batch_size to 1 should act as saving every event immediately in new MISP event.

What do you think?

@sebix
Copy link
Member

sebix commented Apr 18, 2024

Store events in a separated redis queue. This will move a responsibility for keeping data away from the bot.

And prevent data loss too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants