Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot submit a custom training job to a VAI Persistent Resource #6285

Open
adriangay opened this issue Sep 14, 2023 · 13 comments
Open

Cannot submit a custom training job to a VAI Persistent Resource #6285

adriangay opened this issue Sep 14, 2023 · 13 comments

Comments

@adriangay
Copy link

adriangay commented Sep 14, 2023

System information

  • TFX Version (you are using): Applies to all versions up to latest 1.14.0

  • Environment in which you plan to use the feature: Google Cloud

  • Are you willing to contribute it (Yes/No): Yes

Do you have a workaround or are completely blocked by this? : blocked, can't get workaround to work

Name of your Organization (Optional): Sky/NBCU

Describe the feature and the current behavior/state.

TFX tfx.extensions.google_cloud_ai_platform.training_clients.pyuses google.cloud.aiplatform_v1. Update TFX to use google.cloud.aiplatform_v1beta1 or later.

Google Vertex AI has a new feature in preview - VAI Persistent Resource. This allows customers to reserve and use a cluster with GPU and appropriate CPU for model training. Using this feature is highly desirable due to ongoing, global, GPU resource shortage, causing very frequent 'stockout' errors ("resources insufficient in region") causing custom training pipeline jobs to fail, resulting in stale models. Creating the cluster works fine; submitting custom training jobs from TFX Trainer does not work.

The reason for this is that in order for the job to be submitted to a VAI Persistent Resource, a new field, persistent_resource_id must be added to the CustomJobSpec provided on job submission. This was introduced at some point in google.cloud.aiplatform_v1beta1. and is defined here:

https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1beta1.types.CustomJobSpec

It must be added to the TFX Trainer ai_platform_training_args args like this:

  "ai_platform_training_args": {
    "persistent_resource_id": "persistent-preview-test",
    "project": "<redacted>",
    "worker_pool_specs": [
      {
        "container_spec": {
          "image_uri": "<redacted>"
        },
        "machine_spec": {
          "accelerator_count": 1,
          "accelerator_type": "NVIDIA_TESLA_A100",
          "machine_type": "a2-highgpu-1g"
        },
        "replica_count": 1
      }
    ]
  },

This results in ValueError: Protocol message CustomJobSpec has no "persistent_resource_id" field on job submission:

.
.
ERROR 2023-09-07T08:12:38.490196998Z [resource.labels.taskName: workerpool0-0] File "/usr/local/lib/python3.9/site-packages/google/cloud/aiplatform_v1/services/job_service/client.py", line 850, in create_custom_job
ERROR 2023-09-07T08:12:38.491669020Z [resource.labels.taskName: workerpool0-0] request.custom_job = custom_job
ERROR 2023-09-07T08:12:38.491745037Z [resource.labels.taskName: workerpool0-0] File "/usr/local/lib/python3.9/site-packages/proto/message.py", line 776, in __setattr__
ERROR 2023-09-07T08:12:38.492876096Z [resource.labels.taskName: workerpool0-0] pb_value = marshal.to_proto(pb_type, value)
ERROR 2023-09-07T08:12:38.492892700Z [resource.labels.taskName: workerpool0-0] File "/usr/local/lib/python3.9/site-packages/proto/marshal/marshal.py", line 217, in to_proto
ERROR 2023-09-07T08:12:38.493612261Z [resource.labels.taskName: workerpool0-0] pb_value = rule.to_proto(value)
ERROR 2023-09-07T08:12:38.493631935Z [resource.labels.taskName: workerpool0-0] File "/usr/local/lib/python3.9/site-packages/proto/marshal/rules/message.py", line 36, in to_proto
ERROR 2023-09-07T08:12:38.494092928Z [resource.labels.taskName: workerpool0-0] return self._descriptor(**value)
ERROR 2023-09-07T08:12:38.494105950Z [resource.labels.taskName: workerpool0-0] ValueError: Protocol message CustomJobSpec has no "persistent_resource_id" field.

Because TFX uses v1 API and v1 CustomJobSpec.

In an attempt to patch packages in a TFX container to workaround this, we reverse-engineered the code path and modified the TFX container we build to replace imports referencing CustomJob, and CustomJobSpec in various places with:

from google.cloud.aiplatform_v1beta1.types.custom_job import CustomJob
from google.cloud.aiplatform_v1beta1.types.job_state import JobState

While this fixes the ValueError and job submission now succeeds, the job is not routed to the persistent resource cluster. We think that the issue is that TFX training_clients.py is still 'calling' the google.cloud.aiplatform_v1 API, so the Google service is just ignoring the extra field of the v1beta1 CustomJobSpec we are passing?

We can see gapic is the API, and references to gapic_version being set, but don't really understand how that is selected or can be patched, if that is the issue now? If this is the case, we would appreciate some advice and guidance on what further patching on the TFX container would be required to enable training_clients.py to 'call' the v1beta1 API.

@singhniraj08
Copy link
Contributor

@adriangay,

I think you need make few changes in training_clients.py to use the google.cloud.aiplatform_v1beta1.

create client and create_custom_job needs to be updated as shown in create_custom_job example code. Also, get_custom_job needs to be updated as shown here.

Let us know if this works for you. Thank you!

@adriangay
Copy link
Author

@singhniraj08 Thank you for responding quickly. It's unlikely I would have stumbled upon the create_custom_job example code, so i really appreciate that! tl;dr: made the changes, but the symptom is still the same, ie. does not appear to submit the job to VAI Persistent Resource Cluster. The changes I made were in class VertexJobClient only - this may be the issue, ie. I should have done the same for class CAIPJobClient also, as its a CAIP job that submits the VAI Trainer job?

In training_clients.py:

from google.cloud.aiplatform_v1beta1 import JobServiceClient,CreateCustomJobRequest,GetCustomJobRequest
.
.
from google.cloud.aiplatform_v1beta1.types.custom_job import CustomJob
from google.cloud.aiplatform_v1beta1.types.job_state import JobState

Note that existing TFX code did:

  self._client = gapic.JobServiceClient(
        client_options=dict(api_endpoint=self._region +
                            _VERTEX_ENDPOINT_SUFFIX))

I changed this to use the additional v1beta1 import shown above as per the example code:

   self._client = JobServiceClient(
        client_options=dict(api_endpoint=self._region +
                            _VERTEX_ENDPOINT_SUFFIX))

In launch_job():

    request = CreateCustomJobRequest(
        parent=parent,
        custom_job=training_job,
    )
    response = self._client.create_custom_job(request)

in get_job():

    request = GetCustomJobRequest(name=self._job_name)
    return self._client.get_custom_job(request)

I could not see anywhere else that needed to be changed.

@adriangay
Copy link
Author

@singhniraj08 hi, not sure why this issue was closed - I did not knowingly close it. Thx for re-opening!

@briron
Copy link
Member

briron commented Sep 21, 2023

I should have done the same for class CAIPJobClient also
-> If you set enable_vertex=True when calling training_clients.get_job_client, you don't have to.

If you already changed JobServiceClient, it seems to be enough. But, the job is not routed to the persistent resource cluster, right?

I've found some configurations [1] that need to be done when using a persistent resource like

- Specify the persistent_resource_id parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
- Specify the worker_pool_specs parameter such that the values of machine_spec and disk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify one machine_spec for single node training and multiple for distributed training.
- Specify a replica_count less than or equal to the replica_count or max_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

It seems that you already specified the persistent_resource_id, but I have no idea whether machine_spec and disk_spec in worker_pool_specs match exactly with a corresponding resource pool from the persistent resource.

Could you please check this?

[1] https://cloud.google.com/vertex-ai/docs/training/persistent-resource-train#create_a_training_job_that_runs_on_a_persistent_resource

@adriangay
Copy link
Author

@briron hi, thanks for the reply. yes, have added persistent_resource_id. This is accepted on CustomJobSpec with the v1beta1 changes. The worker_pool_specs we provide are the same as before, and the persistent cluster was provisioned with the same machine type and GPU. The persistent cluster has a replica_count of 1 and maxReplicaCount of 2 (so we can see if scaling works), and the replica_count in the worker_pool_specs is 1. The relevant parts of the custom_config passed by TFX Trainer are:

  "ai_platform_training_args": {
    "persistent_resource_id": "persistent-preview-test",
    "project": "<redacted>",
    "worker_pool_specs": [
      {
        "container_spec": {
          "image_uri": "<redacted>"
        },
        "machine_spec": {
          "accelerator_count": 1,
          "accelerator_type": "NVIDIA_TESLA_A100",
          "machine_type": "a2-highgpu-1g"
        },
        "replica_count": 1
      }
    ]
  },

which I think aligns with the VAI Persistent Resource cluster we provisioned:

$ gcloud beta ai persistent-resources list --project nbcu-disco-int-nft-003 --region us-central1
Using endpoint [https://us-central1-aiplatform.googleapis.com/]
---
createTime: '2023-08-30T10:09:35.302158Z'
displayName: persistent-preview-test
name: projects/<redacted>/locations/us-central1/persistentResources/persistent-preview-test
resourcePools:
- autoscalingSpec:
    maxReplicaCount: '2'
    minReplicaCount: '1'
  diskSpec:
    bootDiskSizeGb: 100
    bootDiskType: pd-ssd
  id: a2-highgpu-1g-nvidia-tesla-a100-1
  machineSpec:
    acceleratorCount: 1
    acceleratorType: NVIDIA_TESLA_A100
    machineType: a2-highgpu-1g
  replicaCount: '1'
startTime: '2023-08-30T10:14:34.743355734Z'
state: RUNNING
updateTime: '2023-08-30T10:14:36.212384Z'

@briron
Copy link
Member

briron commented Sep 21, 2023

@adriangay
Thanks for the detailed information. How about disk_spec?
custom_config doesn't include that. Have you set this before?

@adriangay
Copy link
Author

@briron No, we don't normally set it. You think this is required? I can try that 😸

@briron
Copy link
Member

briron commented Sep 21, 2023

@adriangay Let's try. I'll investigate more apart from that.

@adriangay
Copy link
Author

adriangay commented Sep 22, 2023

@briron added:

"disk_spec":  {
    "boot_disk_size_gb": 100, 
    "boot_disk_type": "pd-ssd"
}

to worker_pool_specs, job was submitted, saw this logged:

INFO 2023-09-21T21:31:21.613955595Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:31:29.469867460Z [resource.labels.taskName: service] Resources are insufficient in region: us-central1. Please try a different region. If you use K80, please consider using P100 or V100 instead.
INFO 2023-09-21T21:32:00.155728709Z [resource.labels.taskName: service] Job failed.
INFO 2023-09-21T21:32:00.174426138Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:34:18.042487875Z [resource.labels.taskName: workerpool0-0] 2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
ERROR 2023-09-21T21:34:27.701628499Z [resource.labels.taskName: workerpool0-0] I0921 21:34:27.701316 139906954585920 run_executor.py:139] Executor tfx.components.trainer.executor.GenericExecutor do: inputs: {'base_model': [Artifact(artifact: id: 8957739397649224294
.
.
INFO 2023-09-21T21:39:54.742265861Z [resource.labels.taskName: service] Job completed successfully.

"Resources are insufficient..." job failure, then resource was acquired after a retry and training started. So I'm assuming I got lucky on the retry and this is not running on the persistent cluster. I have no direct way of observing where execution occurred other than the labels for the log messages:

{
insertId: "1v5dcqofdf6b25"
jsonPayload: {
attrs: {
tag: "workerpool0-0"
}
levelname: "ERROR"
message: "2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
"
}
labels: {
compute.googleapis.com/resource_id: "741144473188239551"
compute.googleapis.com/resource_name: "cmle-training-17744171291569385386"
compute.googleapis.com/zone: "us-central1-f"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/tpu_worker_id: ""
ml.googleapis.com/trial_id: ""
ml.googleapis.com/trial_type: ""
}
logName: "projects/nbcu-disco-int-nft-003/logs/workerpool0-0"
receiveTimestamp: "2023-09-21T21:34:41.154586863Z"
resource: {
labels: {
job_id: "3598193926936199168"
project_id: "nbcu-disco-int-nft-003"
task_name: "workerpool0-0"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2023-09-21T21:34:18.042487875Z"
}

@adriangay
Copy link
Author

adriangay commented Sep 22, 2023

@briron The logging to the VAI pipeline console UI does not show any of the logging I see in Stackdriver logs. All I see in VAI UI is:

2023-09-21 22:31:20.655 BST
I0921 21:31:20.655704 140564913690432 training_clients.py:419] Submitting custom job='tfx_20230921213120_fa456afa', parent='projects/nbcu-disco-int-nft-003/locations/us-central1' to Vertex AI Training.
2023-09-21 22:37:53.336 BST
I0921 21:37:53.336246 140564913690432 runner.py:123] Job 'projects/636088981528/locations/us-central1/customJobs/3598193926936199168' successful.
2023-09-21 22:37:59.688 BST
Tearing down training program.

The messages re: insufficient resources and retry are not there. But retry may be happening on other successful jobs and I wouldn't see them regardless of where it ran?

@adriangay
Copy link
Author

@briron i've uploaded my modified training_clients.py module. Maybe you can check I've made the changes correctly? Thank you.
training_clients.py.zip

@briron
Copy link
Member

briron commented Sep 25, 2023

If you're using VertexClient, it looks right. The job seems to be submitted well, but I have no idea how VAI works on its side.

@adriangay
Copy link
Author

@briron ok, thank you for investigating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants