Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFX 1.13.0 Issues #5887

Open
rtg0795 opened this issue May 4, 2023 · 5 comments
Open

TFX 1.13.0 Issues #5887

rtg0795 opened this issue May 4, 2023 · 5 comments
Assignees
Labels

Comments

@rtg0795
Copy link
Contributor

rtg0795 commented May 4, 2023

Please comment or link any issues you find with TFX 1.13.0.

Thanks.

@EdwardCuiPeacock
Copy link
Contributor

EdwardCuiPeacock commented Jul 26, 2023

I want to document an issue we found recently, but it should be resolved once TFX 1.14.0 is released. About a couple months ago, Google made L4 GPU types generally available. We attempted to configure using L4 GPUs this morning and saw the following error message:

Screen Shot 2023-07-26 at 11 22 00 AM

We confirmed that we have configured the GPU machine types correctly, based on https://cloud.google.com/ai-platform/training/docs/reference/rest/v1/AcceleratorType.

The error is traceback from google.cloud.aiplatform. The latest version that both TFX 1.12.0 and TFX 1.13.0 are supporting is version 1.17.1 per

'google-cloud-aiplatform>=1.6.2,<1.18',

Looking at the source code from goole.cloud.aiplatform, we trace it back to the file aiplatform_v1/types/accelerator_type

https://github.com/googleapis/python-aiplatform/blame/main/google/cloud/aiplatform_v1/types/accelerator_type.py#L69

showing that the L4 machine type was added on May 2, 2023, and released by version 1.25.0. This prevents any existing released version of TFX from using the new L4 GPU types.

The reason that this issue should be resolved is that, in the current main branch of this repository, the dependency constraints on google.cloud.aiplatform was updated as

'google-cloud-aiplatform>=1.6.2,<2',

This means future releases of TFX pipelines can install the newer version of goole.cloud.aiplatform to start using L4 machines.

We will confirm if this is the case once 1.14.0 is released. Really looking forward to it.

@axeltidemann
Copy link
Contributor

The Transform component running on Dataflow stopped working when upgrading from TFX 1.8.0. No changes were made, the same machine type (n1-standard-4) and accelerator (nvidia-tesla-t4) was used. The Dataflow job is killed after 1 hour, due to lack of progress. These two screenshots show that it silently OOMkills.

The memory usage goes to max, then down, and up again.

image

In a similar pattern, new processes are started.

image

Somehow, the memory usage has gone up tremendously.

However, when trying to switch to machine type that has more memory, the same pattern is observed, making me believe something systematically is wrong.

Here, the memory grows and falls down again - but not exceeding the limit.

image

Same pattern observed, processes starting in tandem with the memory growth in the previous screenshot.

image

@singhniraj08
Copy link
Contributor

@axeltidemann,

We are already working on memory leak issue transform/#143 where one workaround suggested is beam.io.WriteToTFRecord reimplementation using beam.ParDo function. Instead using tf.io.TFRecordWriter is reported effective for low memory usage on large datasets. Can you try this these workarounds meanwhile we are working on the issue. Thank you!

@singhniraj08 singhniraj08 self-assigned this Jul 28, 2023
@axeltidemann
Copy link
Contributor

@singhniraj08 I'd rather not make any changes to the TFX Transform component at all, I just instantiate it as part of a TFX pipeline. Maybe I am misunderstanding you, though. A code example would help clarify.

@singhniraj08
Copy link
Contributor

@axeltidemann, Apologies, above workarounds work when using tf.Transform for preprocessing data. Since you are working with TFX pipeline, you can follow similar issue #5777.
In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag but this is experimental and not exposed to Transform component. Please follow similar issue for updates. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants