RFC: User-defined limits for AI Gateway. Feedback requested! #9456

dbczumar · 2023-08-25T05:53:05Z

User-defined limits for AI Gateway

Motivation

Important now: DevOps / IT professionals need to set quotas to prevent runaway SaaS LLM workloads (e.g. a UDF that calls an LLM per row inadvertently invoked on a huge dataframe) from exhausting a project’s budget during R&D. They don’t want to manage a bunch of different API keys in different vendor portals to accomplish this.

Important soon: As organizations begin to roll out production applications based on SaaS and OSS LLMs, they'll need to:

Ensure that production applications relying on hosted OSS LLMs remain available and that access is shared fairly.
Control costs for production applications that rely on SaaS LLMs, i.e. limit spend from end-user traffic

Proposal

We propose to extend the MLflow AI Gateway API so that DevOps / IT professionals can set one or more limits on their AI Gateway Routes:

Setting limits will be optional, but AI Gateway docs will encourage it
Limits can be set on Routes for SaaS LLMs and OSS LLMs powered by MLflow Model Serving
Limits are defined / applied per-route
- In the future, this can be extended so that DevOps / IT professionals can define limits on a per-user basis
Limits can be enforced on the number of requests
- In the future, this can be extended to limits on the number of tokens (as-defined by the LLM)
Limits are reset on a per-minute basis
- In the future, this can be extended so that DevOps / IT professionals can choose a different renewal
  period (per second, per hour, etc.)
When a Route is queried, all of its defined limits are enforced. If a limit is exceeded, the request is rejected with a 429 response code.

Object & API definitions

We will introduce a LimitsConfiguration to each AI Gateway Route, which is a set of Limits. We will provide SetLimits and GetLimits REST APIs for CRUDing these limits.

We prefer a separate GetLimits API, rather than making the limits a property of the Route, because we may want to require elevated permissions for retrieving the limits.

Limit & Limits Configuration definition (proto syntax)

message LimitsConfiguration {
    repeated Limit limits = 1;
}

message Limit {
    # The number of tokens
    oneof value {
        int calls = 1;
        # Later on, we can limit by # of tokens, etc.
    } [(validate_required = true)];
    required LimitRenewalPeriod renewal_period = 2;
}

enum LimitRenewalPeriod {
    # Renew the limit counter of tokens every minute
    MINUTE = 1;
    # <We can add more renewal options later>
}

The Limits Configuration can be created / updated / deleted via a SetLimits API call, for example:

Example: Limit creation with the MLflow Python client

mlflow.gateway.set_limits(
    route="dev-gpt-3.5-completions-route",
    limits=[
        {
            # Make at most 200 requests (i.e. spend at most ~ $2 based on
            # average request size) to GPT-3.5 per data scientist per minute
            "calls": 200
            "renewal_period": "minute"
        }
    ]
)

(The Limits Configuration can also be specified as part of the existing CreateRoute API call)

The Limits Configuration can be fetched via a GetLimits API call, for example:

Example: Getting a limit with the MLflow Python client

limits = mlflow.gateway.get_limits(
    route="dev-gpt-3.5-completions-route",
)

assert limits == [
        {
            "calls": 200
            "renewal_period": "minute"
        }
]

Why limit on requests-per-minute (RPM)?

RPM limits have some nice properties for controlling R&D costs:

RPM limits are unlikely to break / interrupt workloads: When limits are reached, it's common for applications to retry for up to 60 seconds with exponential backoff, at which point the limit will have been renewed. Limits over longer horizons are likely to lead to errors and user / workload lockouts.
Requests are intuitive: Requests are easier for data scientists / analysts to reason about than tokens. The number of requests is an integer multiple of the number of records being processed. The number of tokens varies widely based on the LLM and the size of the records.
RPM limits support all OSS LLMs: Many OSS LLMs deployments don't produce token usage information for requests, making it difficult to enforce token-based limits. In contrast, the number of requests can always be measured, so RPM limits can always be enforced.

Why not distinguish between "quotas" and "rate limits"?

Quotas are "longer-term" (minutes, hours, days, months, lifetime) limits on total # of calls meant to restrict project costs
Rate limits are "shorter-term" (seconds, minutes) limits on # of calls meant to protect application availability against traffic bursts (e.g. DDoS) and promote fair sharing of resources

The fields required to specify a quota and a rate limit are nearly identical. The only difference is that rate limits are typically enforced second-by-second or minute-by-minute, whereas quotas are enforced minute-by-minute, hour-by-hour and beyond. So, a "Limits" concept that covers both of these seems appropriate.

Future work

Given the immediate importance of setting quotas during R&D to prevent runaway costs, we propose to begin with request-per-minute limits on a per-route basis. This will provide a great foundation for future extensions geared towards production applications, such as:

Token rate limits
Per-user rate limits
Rate limits over additional time horizons (second, hour, day, etc)

We also acknowledge that the following capabilities may become important in the future:

Setting long-term (e.g. monthly) budgets for R&D
- Alerting customers when a budget limit is reached
- Querying how much of a quota has been used / how much is remaining
QoS networking for Routes, e.g. high priority and low priority traffic

Finally, usage tracking / reporting is another active topic of conversation that deserves its own RFC. We're actively investigating the requirements for this capability.

The text was updated successfully, but these errors were encountered:

mlflow-automation · 2023-09-02T00:55:58Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

dbczumar changed the title ~~RFC: User-defined limits for AI Gateway Routes. Feedback requested!~~ RFC: User-defined limits for AI Gateway. Feedback requested! Aug 25, 2023

dbczumar pinned this issue Aug 25, 2023

dbczumar mentioned this issue Sep 13, 2023

Add set_limits and get_limits API in mlflow client for AI Gateway #9516

Merged

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: User-defined limits for AI Gateway. Feedback requested! #9456

RFC: User-defined limits for AI Gateway. Feedback requested! #9456

dbczumar commented Aug 25, 2023 •

edited

mlflow-automation commented Sep 2, 2023

RFC: User-defined limits for AI Gateway. Feedback requested! #9456

RFC: User-defined limits for AI Gateway. Feedback requested! #9456

Comments

dbczumar commented Aug 25, 2023 • edited

User-defined limits for AI Gateway

Motivation

Proposal

Object & API definitions

Why limit on requests-per-minute (RPM)?

Why not distinguish between "quotas" and "rate limits"?

Future work

mlflow-automation commented Sep 2, 2023

dbczumar commented Aug 25, 2023 •

edited