Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] deflake autoscaling tests #45358

Merged
merged 1 commit into from
May 22, 2024
Merged

[serve] deflake autoscaling tests #45358

merged 1 commit into from
May 22, 2024

Conversation

zcin
Copy link
Contributor

@zcin zcin commented May 15, 2024

[serve] deflake autoscaling tests

Deflake test_autoscaling_policy.py which contains e2e autoscaling tests.

  1. While debugging, we saw that test_handle_deleted_on_crashed_replica would sometimes spit out an error (after the test passes) because the restarted Router replica fails to initialize because the controller already started shutting down all deployments. This PR adds a wait condition to make sure router replica is restarted before test ends.
  2. SignalActor, with the same actor name, is used in a few of the tests. If Ray doesn't garbage collect fast enough, then the actor won't be killed fast enough and consecutive test runs will error out saying an actor of the same name already exists. This PR makes sure to kill the signal actor at the end of tests that use it.

Signed-off-by: Cindy Zhang cindyzyx9@gmail.com

@zcin zcin added the go Trigger full test run on premerge label May 15, 2024
@@ -748,6 +756,9 @@ def test_downscaling_with_fractional_scaling_factor(
# replica when the signal actor goes out of scope and gets killed
ray.get(signal.send.remote())

# Delete signal actor so there is no conflict between tests
ray.kill(signal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fixture pls

@zcin zcin removed the go Trigger full test run on premerge label May 15, 2024
@zcin zcin force-pushed the pr45358 branch 4 times, most recently from 714986d to e4c5136 Compare May 21, 2024 21:10
Deflake `test_autoscaling_policy.py` which contains e2e autoscaling tests.

1. While debugging, we saw that `test_handle_deleted_on_crashed_replica` would sometimes spit out an error (after the test passes) because the restarted Router replica fails to initialize because the controller already started shutting down all deployments. This PR adds a wait condition to make sure router replica is restarted before test ends.
2. `SignalActor`, with the same actor name, is used in a few of the tests. If Ray doesn't garbage collect fast enough, then the actor won't be killed fast enough and consecutive test runs will error out saying an actor of the same name already exists. This PR makes sure to kill the signal actor at the end of tests that use it.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin marked this pull request as ready for review May 21, 2024 21:32
@zcin zcin requested a review from edoakes May 22, 2024 15:53
@edoakes edoakes enabled auto-merge (squash) May 22, 2024 18:05
@github-actions github-actions bot added the go Trigger full test run on premerge label May 22, 2024
@edoakes edoakes merged commit 42d9e05 into ray-project:master May 22, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Trigger full test run on premerge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants