Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samples.snippets.create_training_pipeline_tabular_regression_sample_test: test_ucaip_generated_create_training_pipeline_sample failed #580

Closed
flaky-bot bot opened this issue Jul 30, 2021 · 5 comments
Assignees
Labels
api: aiplatform Issues related to the AI Platform API. flakybot: flaky Tells the Flaky Bot not to close or comment on this issue. flakybot: issue An issue filed by the Flaky Bot. Should not be added manually. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. samples Issues that are directly related to samples. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@flaky-bot
Copy link

flaky-bot bot commented Jul 30, 2021

Note: #413 was also for this test, but it was closed more than 10 days ago. So, I didn't mark it flaky.


commit: 6a99b12
buildURL: Build Status, Sponge
status: failed

Test output
shared_state = {'training_pipeline_name': 'projects/580378083368/locations/us-central1/trainingPipelines/8703078736544661504'}
pipeline_client = 
@pytest.fixture()
def teardown_training_pipeline(shared_state, pipeline_client):
    yield

    pipeline_client.cancel_training_pipeline(
        name=shared_state["training_pipeline_name"]
    )

    # Waiting for training pipeline to be in CANCELLED state
    helpers.wait_for_job_state(
        get_job_method=pipeline_client.get_training_pipeline,
      name=shared_state["training_pipeline_name"],
    )

conftest.py:168:


get_job_method = <bound method PipelineServiceClient.get_training_pipeline of <google.cloud.aiplatform_v1.services.pipeline_service.client.PipelineServiceClient object at 0x7f5245813a90>>
name = 'projects/580378083368/locations/us-central1/trainingPipelines/8703078736544661504'
expected_state = 'CANCELLED', timeout = 90, freq = 1.5

def wait_for_job_state(
    get_job_method: Callable[[str], "proto.Message"],  # noqa: F821
    name: str,
    expected_state: str = "CANCELLED",
    timeout: int = 90,
    freq: float = 1.5,
) -> None:
    """ Waits until the Job state of provided resource name is a particular state.

    Args:
        get_job_method: Callable[[str], "proto.Message"]
            Required. The GAPIC getter method to poll. Takes 'name' parameter
            and has a 'state' attribute in its response.
        name (str):
            Required. Complete uCAIP resource name to pass to get_job_method
        expected_state (str):
            The state at which this method will stop waiting.
            Default is "CANCELLED".
        timeout (int):
            Maximum number of seconds to wait for expected_state. If the job
            state is not expected_state within timeout, a TimeoutError will be raised.
            Default is 90 seconds.
        freq (float):
            Number of seconds between calls to get_job_method.
            Default is 1.5 seconds.
    """

    for _ in range(int(timeout / freq)):
        response = get_job_method(name=name)
        if expected_state in str(response.state):
            return None
        time.sleep(freq)

    raise TimeoutError(
      f"Job state did not become {expected_state} within {timeout} seconds"
        "\nTry increasing the timeout in sample test"
        f"\nLast recorded state: {response.state}"
    )

E TimeoutError: Job state did not become CANCELLED within 90 seconds
E Try increasing the timeout in sample test
E Last recorded state: 6

helpers.py:55: TimeoutError

@flaky-bot flaky-bot bot added flakybot: issue An issue filed by the Flaky Bot. Should not be added manually. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jul 30, 2021
@product-auto-label product-auto-label bot added api: aiplatform Issues related to the AI Platform API. samples Issues that are directly related to samples. labels Jul 30, 2021
@flaky-bot
Copy link
Author

flaky-bot bot commented Jul 31, 2021

commit: c24251f
buildURL: Build Status, Sponge
status: failed

Test output
shared_state = {'training_pipeline_name': 'projects/580378083368/locations/us-central1/trainingPipelines/7281841210388381696'}
pipeline_client = 
@pytest.fixture()
def teardown_training_pipeline(shared_state, pipeline_client):
    yield

    pipeline_client.cancel_training_pipeline(
        name=shared_state["training_pipeline_name"]
    )

    # Waiting for training pipeline to be in CANCELLED state
    helpers.wait_for_job_state(
        get_job_method=pipeline_client.get_training_pipeline,
      name=shared_state["training_pipeline_name"],
    )

conftest.py:168:


get_job_method = <bound method PipelineServiceClient.get_training_pipeline of <google.cloud.aiplatform_v1.services.pipeline_service.client.PipelineServiceClient object at 0x7f01b0661310>>
name = 'projects/580378083368/locations/us-central1/trainingPipelines/7281841210388381696'
expected_state = 'CANCELLED', timeout = 90, freq = 1.5

def wait_for_job_state(
    get_job_method: Callable[[str], "proto.Message"],  # noqa: F821
    name: str,
    expected_state: str = "CANCELLED",
    timeout: int = 90,
    freq: float = 1.5,
) -> None:
    """ Waits until the Job state of provided resource name is a particular state.

    Args:
        get_job_method: Callable[[str], "proto.Message"]
            Required. The GAPIC getter method to poll. Takes 'name' parameter
            and has a 'state' attribute in its response.
        name (str):
            Required. Complete uCAIP resource name to pass to get_job_method
        expected_state (str):
            The state at which this method will stop waiting.
            Default is "CANCELLED".
        timeout (int):
            Maximum number of seconds to wait for expected_state. If the job
            state is not expected_state within timeout, a TimeoutError will be raised.
            Default is 90 seconds.
        freq (float):
            Number of seconds between calls to get_job_method.
            Default is 1.5 seconds.
    """

    for _ in range(int(timeout / freq)):
        response = get_job_method(name=name)
        if expected_state in str(response.state):
            return None
        time.sleep(freq)

    raise TimeoutError(
      f"Job state did not become {expected_state} within {timeout} seconds"
        "\nTry increasing the timeout in sample test"
        f"\nLast recorded state: {response.state}"
    )

E TimeoutError: Job state did not become CANCELLED within 90 seconds
E Try increasing the timeout in sample test
E Last recorded state: 6

helpers.py:55: TimeoutError

@flaky-bot
Copy link
Author

flaky-bot bot commented Aug 1, 2021

Looks like this issue is flaky. 😟

I'm going to leave this open and stop commenting.

A human should fix and close this.


When run at the same commit (c24251f), this test passed in one build (Build Status, Sponge) and failed in another build (Build Status, Sponge).

@flaky-bot flaky-bot bot added the flakybot: flaky Tells the Flaky Bot not to close or comment on this issue. label Aug 1, 2021
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Aug 7, 2021
@nicain nicain self-assigned this Sep 27, 2021
@nicain
Copy link
Contributor

nicain commented Sep 28, 2021

The flakiness of this test appears to be caused by the service, and not by the client library. Using the following snippet to evaluate flakiness, I see:

Result: FAILURE, mean=104.65, max=108.74, count=21
Result: SUCCESS, mean=4.42, max=4.89, count=29
import pytest
from timeit import default_timer as timer
import collections

file_name = 'create_training_pipeline_tabular_regression_sample_test.py'
test_name = 'test_ucaip_generated_create_training_pipeline_sample'

delta_dict = collections.defaultdict(list)
for ri in range(50):
  start = timer()
  result = pytest.main([f'{file_name}::{test_name}'])
  end = timer()
  delta = end-start
  if result == pytest.ExitCode.OK:
    delta_dict['SUCCESS'].append(delta)
  else:
    delta_dict['FAILURE'].append(delta)

for key, delta_list in delta_dict.items():
  mean_time = sum(delta_list)/len(delta_list)
  max_time = max(delta_list)
  report_string = f'Result: {key}, mean={mean_time:3.2f}, max={max_time:3.2f}, count={len(delta_list)}'
  print(report_string)

@nicain nicain closed this as completed Sep 28, 2021
@nicain nicain reopened this Sep 29, 2021
@nicain
Copy link
Contributor

nicain commented Sep 29, 2021

Further diagnosis revealed that the test is flaky-failing because of a timeout on the response of cancelling the training pipeline. Sometimes this happens within 5 seconds, but sometimes this can take as long as 215 seconds (max out of 20 runs).

@vinnysenthil
Copy link
Contributor

Closing issue, was fixed by #734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: aiplatform Issues related to the AI Platform API. flakybot: flaky Tells the Flaky Bot not to close or comment on this issue. flakybot: issue An issue filed by the Flaky Bot. Should not be added manually. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. samples Issues that are directly related to samples. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants