Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats #487

Open
KarthikSubbarao opened this issue May 11, 2024 · 9 comments

Comments

@KarthikSubbarao
Copy link

KarthikSubbarao commented May 11, 2024

The problem/use-case that the feature addresses

When using LUA to reply with custom error messages, the prefix of the messages be included in the INFO ERRORSTATS section as a unique entry. This can spam the INFO ALL command or any INFO command variant that includes the ERRORSTATS section.

redis/redis#13141 address this spamming of the INFO command's ERRORSTATS section and the memory usage of the errors RAX. However, it has the following drawbacks:

  1. To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
  2. When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.
127.0.0.1:6379> EVAL "return server.error_reply('3692896699 a');" 0
(error) 3692896699 a 
127.0.0.1:6379> EVAL "return server.error_reply('4164681858 b');" 0
(error) 4164681858 b
127.0.0.1:6379> EVAL "return server.error_reply('5558209445 c');" 0
(error) 5558209445 c
127.0.0.1:6379> EVAL "return server.error_reply('8877681120 d');" 0
(error) 8877681120 d
127.0.0.1:6379> info errorstats
# Errorstats
errorstat_3692896699:count=1
errorstat_4164681858:count=1
errorstat_5558209445:count=1
errorstat_8877681120:count=1

Description of the feature

The proposal in this issue is to prevent tracking new error messages from LUA if the number of LUA error messages is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA and if a non-LUA error (e.g. MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore functionality - as normal error messages continue to be tracked.

Based on feedback and thoughts from this discussion, I can propose a PR to support this feature.

Alternatives you've considered

NA

@KarthikSubbarao KarthikSubbarao changed the title [NEW] Config to disable including custom error messages in the INFO ERRORSTATS section [NEW] Config to disable including custom error messages (LUA) in the INFO ERRORSTATS section May 11, 2024
@enjoy-binbin
Copy link
Member

does this redis/redis#13141 can somehow solve your problem?

@madolson
Copy link
Member

madolson commented May 12, 2024

does this redis/redis#13141 can somehow solve your problem?

I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors. I feel like he was just trying to merge stuff before the license change.

I think this should solve the issue @KarthikSubbarao brought up though.

@enjoy-binbin
Copy link
Member

I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors.

yes, that is indeed a potential breaking change. I have thought about add a new configuration item and defaulting to infinity to avoid the breaking change, but Oran does't seem to mention it and probable doesn't want to add new a configuration item for this.

I feel like he was just trying to merge stuff before the license change.

Looking at it now, it may indeed be.

@KarthikSubbarao
Copy link
Author

KarthikSubbarao commented May 13, 2024

does this redis/redis#13141 can somehow solve your problem?

@enjoy-binbin - (Just read the top comment) Yes, this will solve the issue of spamming INFO ERRORSTATS and reduce additional data stored on error stats.

For completeness - with redis/redis#13141, ERRORSTATS will be disabled when misuse is detected. This means non LUA errors (which do not spam) will also be disabled until we use CONFIG RESETSTAT.

Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.

But the main issue we wanted to address is misuse (spamming), so this should work. Thank you

Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.

If we think this is useful, we can consider making this improvement - by adding onto the existing logic

@srgsanky
Copy link

srgsanky commented May 14, 2024

Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.

Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?

@KarthikSubbarao
Copy link
Author

KarthikSubbarao commented May 14, 2024

Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.

Yes, looks like with the current implementation (from redis/redis#13141):

  1. To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
  2. When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.

These issues aside, the PR addresses the concerns we have perfectly.

Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?

The proposal in this issue is to prevent tracking new error messages from LUA if the number of LUA error messages is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA and if a non-LUA error (MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore functionality - as normal error messages continue to be tracked.

@madolson
Copy link
Member

I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning, since the script could generate normal error messages. I can't think of anything off the top of my head to tell if it was a user generated error message or not.

@KarthikSubbarao KarthikSubbarao changed the title [NEW] Config to disable including custom error messages (LUA) in the INFO ERRORSTATS section [NEW] Handle spamming of custom error messages (LUA) in the INFO ERRORSTATS section with continued functioning of non LUA errors May 15, 2024
@KarthikSubbarao KarthikSubbarao changed the title [NEW] Handle spamming of custom error messages (LUA) in the INFO ERRORSTATS section with continued functioning of non LUA errors [NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats May 15, 2024
@ranshid
Copy link
Contributor

ranshid commented May 15, 2024

@KarthikSubbarao Thank you for driving this! I am reading through the PR and will provide technical comments there (if I have any) but wanted to ask here:
Why is that so different than the modules case? I think this will provide solution for LUA, but modules can cause the same kind of missuse right? I feel this error stats mechanism lacks some better structure than just be based on string parsing. Maybe we should consider improving that in the future?

@KarthikSubbarao
Copy link
Author

KarthikSubbarao commented May 15, 2024

@ranshid Thank you. Modules are similar in that they can result in custom error messages - but I would place Modules closer to the Valkey server in that it is the responsibility of those running the server to ensure that the modules they load do not lead to spamming the ERRORSTATS section. EVAL / EVALSHA commands, however, are completely customer driven. This is why the Issue here started off focusing only on LUA.

But (if required) we can indeed add support for preventing misuse by Modules by passing a flag in the VM_ReplyWithError and VM_ReplyWithErrorFormat APIs. I can update the PR if so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants