New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control Redis memory usage: limit stix_ids array length in redis stream entries to 1000 ids #6774
base: master
Are you sure you want to change the base?
Conversation
413d7f1
to
8d82aae
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6774 +/- ##
==========================================
- Coverage 68.01% 67.58% -0.43%
==========================================
Files 538 545 +7
Lines 65711 66477 +766
Branches 5568 5583 +15
==========================================
+ Hits 44691 44931 +240
- Misses 21020 21546 +526 ☔ View full report in Codecov by Sentry. |
Very interesting. |
Maybe, it is certainly possible - both types you describe are entity types like the In my proposed fix, I'm assuming that the newest item(s) will be at the end of the list, and that any instance following the data stream will have consumed the earlier records (with earlier parts of the STIX id alias list) first, and will be retaining the old, ignoring the duplicates, and adding the new, as part of the stream ingest process. |
Also, just another observation: I am curious if this behavior might be contributing to the pattern that some have observed of ingest speed (bundles / second) gradually falling over the course of long ingests of large data sets...as it is reflecting data that's being sent/read from the elasticsearch back-end too. |
Thanks for your explanation. I've requested our team to create only v5 UUIDs, so at least our own bundles will not add to the problem. I supose 3rd party will still create a random UUID for the same objects so they will acumulate and create the situation you explained. On your last message observation, I think merging bundles is a time consuming operation, so probably you are right. If you receive the same object a lot of times, and it is merged as its UUID is different each time, it most probably makes ingestion slower. |
@AlexSanchezN Another issue is that tons and tons of updates to the same object(s) in the queue means that those STIX objects all get locked and force the platform to single-thread the updates regardless of the number of platform or worker instances running. The other workers have to wait until the first worker to lock the objects completes the time-consuming update operation writing that long list of ids to the entity. |
I wonder if some sort of background consolidation process would be possible. I don't know if I've explained myself! :-) |
I think this pull request cannot be integrated like this. We have different use cases where stix_ids must be correctly maintained.
With this 3 use cases I have no idea how we can do the diff between them. my 2 cent about what could be a good solution.
|
Thanks. I like this idea, and then the current behavior can be voluntary for any connectors which are incompatible with overwriting the external STIX IDs, and which then can be evaluated to ensure they aren't blowing up the STIX IDs aliases. One thing to keep in mind for cases where the ID replacement occurs: any relationships or other references also need to be updated with the new STIX ID. It makes sense that there are user cases out there where the STIX ID we receive from an external source is unique and persistent, and we want to keep track of it in the platform too, so that it may be used for future reference in the external system (which may not be using the same STIX ID generation algorithm we are). |
@SouadHadjiat fixed |
Proposed changes
When redis
stream.opencti
entries have aextension-definition--ea279b3e-5c71-4632-ac08-831c66a786ba
modification which changes thestix_ids
field, this appears to sometimes grow extremely large during ingestion for common entity types which don't have a deterministic formula for producing a STIX id.This seems to be able to result in a long sequence of entries in redis, during re-ingestion of the same entity (such as a
malware
STIX type) with new randomly-assigned STIX ids. My observation is that these get added to the stream once for each new STIX id encountered for the same item, and the list ofstix_ids
in the added/updated entity can grow infinitely. This is the source of one of the reasons whyredis
memory consumption can grow so high, under some circumstances, despite implementing a low TRIMMING size (such as100000
).This change proposes to identify when this list exceeds an upper limit (set to
1000
in this case) and will cut the list down such that it only contains the trailing1000
items from the provided list, in redis stream inserts only. This appears to fix the Out-Of-Memory conditions I was experiencing while ingesting from the ESET connector (issue linked below), and also seems to continue to work elsewhere that I have tested, from what I can tell. Additionally, with TRIMMING set to100000
, this change appears to keep theredis
memory usage around 200-500MB, where I have tested, rather than it consuming multiple GB.Definitely recognize that this might not be an ideal solution, due to the potential for data loss in the
stix_ids
, so proposing this change for some broader testing & feedback, particularly from people who leverage the streams features, as well as those who better understand how the local system uses its own stream.The problem we are running into, and this tries to address, is that redis memory consumption can easily grow to 10-20GB, even with a very low TRIMMING value, due to the way
stix_ids
is reported in the data stream, per the situation above. This creates an unpredictable availability bug in the system, as well as putting extremely high RAM resource pressure on the system (that can be expensive to run) for lots of highly-duplicative information that is going into the stream.If the community/maintainers don't agree with this change, we would appreciate some alternative recommendations to address this. Allocating more RAM for
redis
is not a viable mitigation in this case.Related issues
Checklist