New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ft.aggregate slowdown with high frequency updates #4508
Comments
We have also encountered a memory issue, causing replicas to slowly use up all configured memory amount that wasn't the case before with the regular redis server, without modules and no RediSearch indexing, so now we opted for a custom LUA Redis Function to do the aggregation, which calls KEYS prefix*, then loads the HashMaps with HGETALL, does the aggregation and returns an Array reply with Array items consisting of the tag and 4 counters. The service performs at a stable 100ms response time, which is 2-3 times worse than it would be with FT.AGGREGATE, if it wouldn't slow down as max_doc_id increases. I have looked at the BITFIELD type, which would be another way to implement this, but that would raise additional problems to solve around consistency, reading back data in pipelined workflows, in a use-case where everything changes every second. |
I had the same problem with the latest version(redis/redis-stack:7.2.0-v9): (1) Queries are very fast at first. |
We're using redis-stack server with only redisearch module enabled ver 20813. We observing similar behavior, where we have index with a lot of
|
Describe the bug
Our use-case is aggregation of origin-edge stream viewer counts, small amount of documents changing by the second. Even by setting all NO* flags on the schema and NOINDEX on all fields, FT.AGGREGATE is slowing down, in real world use our HTTP service starts with a response time of 60ms, then it slows down by 100ms/hour. We have even seen that it doesn't stop at the 500ms TIMEOUT, it was stopping there for around 10 minutes, but then slowed down further, after a day we were at 2 seconds of response time, the HTTP service is only executing the aggregate command and returns with the result.
To Reproduce
Copy-paste the following bash script and run it on a docker enabled host with access to docker hub.
As you can read from the script comments, it will create 20k hashmaps with 1k variation of the groupby field and random values for the aggregated numeric count; then it will run it as pipeline, run the aggregation and show the first timing, then it will start a cycle of 10k pipeline runs, running the aggregate command every 50th run (after 1M key updates) and shows a clear sign of increasing response time.
Sample:
Running aggregation after 1M updates: 0m0.046s
Running aggregation after 2M updates: 0m0.033s
Running aggregation after 3M updates: 0m0.046s
Running aggregation after 4M updates: 0m0.054s
Running aggregation after 5M updates: 0m0.079s
Expected behavior
The aggregation should have a stable response time below 50ms for 20k keys indexed, regardless the update frequency.
Screenshots
The service is called once a minute, when it reaches above 300ms we recreate the index, because we have not found any way we could make the aggregation stable.
On the left side you can see that it was progressing upwards to 500ms, then workaround with the index recreation was released, then depending on actual traffic recreations happen sooner or later.
Environment (please complete the following information):
Additional context
We have tried changing the FORK_GC configuration variables, changed MAXDOCTABLESIZE (1 makes it much worse, 100M behaves the same as the default 1M), after adding NOINDEX the GC wasn't collecting any bytes, therefore tried with NOGC then too, with no affect, as expected. Tried TAG SORTABLE, DIALECT 1-2-3-4, nothing helps.
Before adding NOINDEX num_records was increasing to much more than it should be, that could be offset by having the GC run every second, then load testing with continuous updates like the script above does eventually it jumps up to BIGINT and keeps increasing.
FT.EXPLAIN only shows wildcard, so no info from there.
FT.PROFILE shows that the result aggregation slows down on the indexer type only, as max_doc_id is increasing.
Usually a redis command is in the sub-millisecond territory, now due to the single threaded nature of redis we see that this aggregation blocks all other commands; the normal 20-30ms response time we have seen from the above testing would be acceptable, but blocking all other commands for hundreds of milliseconds is an issue, we might need to move this workload to a separate cluster to workaround this blocking it causes.
The bypass of the 500ms TIMEOUT setting is also something that should be addressed, the above test script does reproduce that too, sample:
Running aggregation after 45M updates: 0m0.503s
Running aggregation after 46M updates: 0m0.513s
Running aggregation after 47M updates: 0m0.519s
Running aggregation after 48M updates: 0m0.533s
Running aggregation after 49M updates: 0m0.561s
...
Running aggregation after 60M updates: 0m0.646s
Running aggregation after 61M updates: 0m0.703s
Running aggregation after 62M updates: 0m0.685s
The text was updated successfully, but these errors were encountered: