Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: load balancing Google Vertex AI API across US/CA regions #2795

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

msg7086
Copy link

@msg7086 msg7086 commented May 19, 2024

Summary

Google Vertex AI API provided by Google Cloud has a request limit quota of 1 request per minute per region as of now. If you are having a conversation with Gemini 1.5 Pro / Flash, and you reply more than twice in a minute, you'll hit quota limit, and have to wait. Load balancing across multiple region solves this problem.

It also improves load on Google side, preventing us-central1 region from being flooded by requests from the same app.

The code change is minimal, so that it doesn't impact user experience. The list only includes US/CA regions for now because they are close to the previous option us-central1. Those who live close to US central should not see any performance impact. Those who don't connect to US central well may see performance improvement.

This is a premature implementation to mitigate #2723.

Change Type

  • New feature (non-breaking change which adds functionality)

Testing

TBD

Checklist

  • My code adheres to this project's style guidelines
  • I have performed a self-review of my own code
  • I have commented in any complex areas of my code
  • My changes do not introduce new warnings
  • Local unit tests pass with my changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant