Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFX components in GCP does not display component logs in GCP Vertex AI #6539

Open
crbl1122 opened this issue Dec 22, 2023 · 13 comments
Open
Assignees
Labels

Comments

@crbl1122
Copy link

crbl1122 commented Dec 22, 2023

If the bug is related to a specific library below, please raise an issue in the
respective repo directly: TFX

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

  • Have I specified the code to reproduce the issue (Yes, No):
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows),
    Interactive Notebook, Google Cloud, etc):
  • TensorFlow version: 2.13.1
  • TFX Version: TFX version: 1.14.0
  • KFP version: 1.8.22
  • Python version:3.8.18
  • Python dependencies (from pip freeze output):
  • google-api-core 2.15.0
    google-api-python-client 1.12.11
    google-apitools 0.5.31
    google-auth 2.25.2
    google-auth-httplib2 0.1.1
    google-auth-oauthlib 1.0.0
    google-cloud-aiplatform 1.37.0
    google-cloud-appengine-logging 1.4.0
    google-cloud-audit-log 0.2.5
    google-cloud-bigquery 2.34.4
    google-cloud-bigquery-storage 2.23.0
    google-cloud-bigtable 2.21.0
    google-cloud-core 2.4.1
    google-cloud-datastore 2.18.0
    google-cloud-dlp 3.14.0
    google-cloud-language 2.12.0
    google-cloud-logging 3.9.0
    google-cloud-pubsub 2.19.0
    google-cloud-pubsublite 1.8.3
    google-cloud-recommendations-ai 0.10.6
    google-cloud-resource-manager 1.11.0
    google-cloud-spanner 3.40.1
    google-cloud-storage 2.13.0
    google-cloud-videointelligence 2.12.0
    google-cloud-vision 3.5.0
    google-crc32c 1.5.0
    google-pasta 0.2.0
    google-resumable-media 2.6.0
    googleapis-common-protos 1.62.0
    grpc-google-iam-v1 0.13.0

Describe the current behavior

I am running in GCP Vertex AI Kubeflow pipelines with TFX components. The problem is that no component logs are displayed in the Vertex interface (neither main job nor pipeline job) while in the Logs Explorer only framework messages are displayed. This is irrespective of the component type (ExamplesGen, Trainer, Transform, etc) and leads to very difficult blindly debugging of TFX components. I submit the pipelines using a service account which has Logs Writer/Reader privileges.

image

Describe the expected behavior
Be able to view the component logs for code debugging.

Standalone code to reproduce the issue

Providing a bare minimum test case or step(s) to reproduce the problem will
greatly help us to debug the issue. If possible, please share a link to
Colab/Jupyter/any notebook.

Name of your Organization (Optional)

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached.

@singhniraj08
Copy link
Contributor

@lego0901,

This issue has been raised for viewing component level logs in Logs explorer while running TFX pipelines in Vertex AI. I was unable to find any settings which can enable in the container logs. Please let me know if I am missing anything. Thank you!

@ImmanuelXIV
Copy link

+1. Logs are also not displayed when using PyTorch + Kubeflow pipelines. Please fix it, this seems to be a general issue. Not only makes debugging tricky but I also can't get information if the specified GPU and memory is utilized when training.

@singhniraj08
Copy link
Contributor

@ImmanuelXIV, This repo is for issues you face while implementing TFX pipelines. I would request you to open a issue with cloud support team. You can follow Get Support to raise an issue. Thank you!

@crbl1122
Copy link
Author

+1. Logs are also not displayed when using PyTorch + Kubeflow pipelines. Please fix it, this seems to be a general issue. Not only makes debugging tricky but I also can't get information if the specified GPU and memory is utilized when training.

Strange that this is a general issue.

@adriangay
Copy link

We are experiencing the same issue in VAI trying to migrate our training pipelines to 1.14. I have raised a Google Support Case. Has anyone else experiencing this issue raised a case? Would be good to compare notes.

@lego0901
Copy link
Member

lego0901 commented Feb 1, 2024

Hello, we also ran several VAI pipelines with our hands but we were able to see the component logs, regardless if a component run failed or not. This is very weird and I want to check if “all” components logs are not displayed regardless if it failed or not, @crbl1122.

But, I can give you a general way to debug.

  1. We usually can't see the component logs if the orchestrator fails to launch a component.

  2. If that's the case, we have to see the orchestrator's log and you can find this in Error Reporting. So please visit there and see if there is a relevant error.

  3. Otherwise, you can follow Get Support to raise an issue.

@crbl1122
Copy link
Author

crbl1122 commented Feb 5, 2024

@lego0901 I confirm that no logs or errors are seen neither for components running successfully, nor for the ones which are crashing during execution.

@lego0901
Copy link
Member

lego0901 commented Feb 6, 2024

I would like to express my gratitude for your confirmation.

May I request further information from you so that we can conduct a more thorough investigation into this matter? Since we are unable to reproduce the issue on our end (despite the fact that numerous users are encountering the same problem), we require additional input regarding your specific situation.

Could you kindly provide responses to the following questions:

  1. Did this phenomenon occur prior to TFX version 1.14.0?
    If not, we can confirm that this is an issue with the TFX codebase, which will allow us to narrow down our investigation.

  2. Could you please provide more detailed information about your running environment?
    I would like to have the output of the pip freeze command in its entirety so that I can attempt to reproduce the issue in my own environment.

  3. In the scenario you described, would it be possible for you to provide me with a simple example code that reproduces the error?
    Even a very brief pipeline with a single component would be sufficient.

Thank you very much for your assistance.

@adriangay
Copy link

adriangay commented Feb 9, 2024

we also do not see anything in Error Reporting.

we did not see this before TFX 1.14

We manage depedencies using Poetry, Github does not support uplaod of lock files, but here is output of pip freeze

absl-py==1.4.0
anyio==4.2.0
apache-beam==2.50.0
appnope==0.1.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
astunparse==1.6.3
attrs==21.4.0
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.1.0
cachetools==5.3.2
certifi==2023.11.17
cffi==1.16.0
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
comm==0.2.1
crcmod==1.7
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
distlib==0.3.8
dnspython==2.4.2
docker==4.2.2
docopt==0.6.2
docstring-parser==0.15
dynaconf==3.2.4
entrypoints==0.4
exceptiongroup==1.2.0
fastavro==1.9.3
fasteners==0.19
fastjsonschema==2.19.1
filelock==3.13.1
fire==0.5.0
flake8==3.9.2
flatbuffers==23.5.26
fqdn==1.5.1
gast==0.4.0
gensim==4.3.2
google-api-core==2.15.0
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==2.26.1
google-auth-httplib2==0.1.1
google-auth-oauthlib==1.0.0
google-cloud-aiplatform==1.39.0
google-cloud-bigquery==2.34.4
google-cloud-bigquery-storage==2.24.0
google-cloud-bigtable==2.22.0
google-cloud-core==2.4.1
google-cloud-datastore==2.19.0
google-cloud-dlp==3.14.0
google-cloud-language==2.12.0
google-cloud-pubsub==2.19.0
google-cloud-pubsublite==1.9.0
google-cloud-recommendations-ai==0.10.6
google-cloud-resource-manager==1.11.0
google-cloud-spanner==3.40.1
google-cloud-storage==2.14.0
google-cloud-videointelligence==2.12.0
google-cloud-vision==3.5.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
grpc-google-iam-v1==0.13.0
grpcio==1.60.0
grpcio-status==1.48.2
h5py==3.10.0
hdfs==2.7.3
httplib2==0.22.0
identify==2.5.33
idna==3.6
iniconfig==2.0.0
ipykernel==6.28.0
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==7.8.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.2
joblib==1.3.2
jsonpointer==2.4
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter_client==7.4.9
jupyter_core==5.7.1
jupyter_server==2.10.0
jupyter_server_terminals==0.5.1
jupyterlab-widgets==1.1.7
jupyterlab_pygments==0.3.0
keras==2.13.1
keras-tuner==1.4.6
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kt-legacy==1.0.5
kubernetes==12.0.1
libclang==16.0.6
llvmlite==0.41.1
Markdown==3.5.1
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mccabe==0.6.1
mistune==3.0.2
ml-metadata==1.14.0
ml-pipelines-sdk==1.14.0
mock==4.0.3
nbclassic==1.0.0
nbclient==0.9.0
nbconvert==7.14.0
nbformat==5.9.2
nest-asyncio==1.5.8
nodeenv==1.8.0
notebook==6.5.6
notebook_shim==0.2.3
nptyping==2.5.0
numba==0.58.1
numba-progress==1.1.0
numpy==1.24.3
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.6.1
opt-einsum==3.3.0
orjson==3.9.10
overrides==7.4.0
packaging==20.9
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
pecanpy==2.0.8
pexpect==4.9.0
pickleshare==0.7.5
pillow==10.2.0
platformdirs==4.1.0
pluggy==1.3.0
portpicker==1.6.0
pre-commit==2.13.0
prometheus-client==0.19.0
prompt-toolkit==3.0.43
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.7
ptyprocess==0.7.0
pyarrow==10.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycodestyle==2.7.0
pycparser==2.21
pydantic==1.10.13
pydot==1.4.2
pyfarmhash==0.3.2
pyflakes==2.3.1
Pygments==2.17.2
pymongo==4.6.1
pyparsing==3.1.1
pyrsistent==0.20.0
pytest==7.4.0
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==24.0.1
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
scikit-learn==1.3.2
scipy==1.11.4
Send2Trash==1.8.2
Shapely==1.8.5.post1
six==1.16.0
smart-open==6.4.0
sniffio==1.3.0
soupsieve==2.5
sqlparse==0.4.4
strip-hints==0.1.10
tabulate==0.9.0
tensorboard==2.13.0
tensorboard-data-server==0.7.2
tensorflow==2.13.1
tensorflow-addons==0.23.0
tensorflow-data-validation==1.14.0
tensorflow-estimator==2.13.0
tensorflow-hub==0.13.0
tensorflow-io-gcs-filesystem==0.35.0
tensorflow-metadata==1.14.0
tensorflow-model-analysis==0.45.0
tensorflow-serving-api==2.13.1
tensorflow-transform==1.14.0
termcolor==2.4.0
terminado==0.18.0
tfx==1.14.0
tfx-bsl==1.14.0
threadpoolctl==3.2.0
tinycss2==1.2.1
toml==0.10.2
tomli==2.0.1
tornado==6.4
tqdm==4.66.1
traitlets==5.14.1
typeguard==2.13.3
typer==0.9.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.5.0
uri-template==1.3.0
uritemplate==3.0.1
urllib3==1.26.18
virtualenv==20.25.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
Werkzeug==3.0.1
widgetsnbextension==3.6.6
wrapt==1.16.0
zstandard==0.22.0

@crbl1122
Copy link
Author

crbl1122 commented Feb 9, 2024

I would like to express my gratitude for your confirmation.

May I request further information from you so that we can conduct a more thorough investigation into this matter? Since we are unable to reproduce the issue on our end (despite the fact that numerous users are encountering the same problem), we require additional input regarding your specific situation.

Could you kindly provide responses to the following questions:

1. Did this phenomenon occur prior to TFX version 1.14.0?
   If not, we can confirm that this is an issue with the TFX codebase, which will allow us to narrow down our investigation.

2. Could you please provide more detailed information about your running environment?
   I would like to have the output of the `pip freeze` command in its entirety so that I can attempt to reproduce the issue in my own environment.

3. In the scenario you described, would it be possible for you to provide me with a simple example code that reproduces the error?
   Even a very brief pipeline with a single component would be sufficient.

Thank you very much for your assistance.

Hi,

TFX==1.12.0.
The problem is for any standard TFX component.
absl-py==1.4.0
aiohttp-cors==0.7.0
aiorwlock==1.3.0
ansiwrap==0.8.4
apache-beam==2.45.0
astunparse==1.6.3
asynctest==0.13.0
attrs==20.3.0
Babel==2.12.1
backoff==2.2.1
blessed==1.20.0
cachetools==4.2.4
certifi==2023.7.22
click==8.1.7
cloud-tpu-client==0.10
cloud-tpu-profiler==2.4.0
cloudpickle==2.2.1
colorama==0.4.6
colorful==0.5.5
comm==0.1.4
conda==22.9.0
crcmod==1.7
cycler==0.11.0
Cython==3.0.2
dacite==1.8.1
db-dtypes==1.1.1
Deprecated==1.2.14
dill==0.3.1.1
distlib==0.3.7
dm-tree==0.1.8
docker==4.4.4
docopt==0.6.2
docstring-parser==0.15
etils==0.9.0
explainable-ai-sdk==1.3.3
Farama-Notifications==0.0.4
fastapi==0.103.1
fastavro==1.8.0
fasteners==0.19
filelock==3.12.2
flatbuffers==2.0.7
fonttools==4.38.0
fsspec==2023.1.0
future==0.18.3
gast==0.3.3
gcsfs==2023.1.0
gitdb==4.0.10
GitPython==3.1.37
google-api-core==1.34.0
google-api-python-client==1.8.0
google-apitools==0.5.31
google-auth-httplib2==0.1.1
google-auth-oauthlib==0.4.6
google-cloud-aiplatform==1.17.1
google-cloud-artifact-registry==1.8.3
google-cloud-bigquery==2.34.4
google-cloud-bigquery-storage==2.16.2
google-cloud-bigtable==1.7.3
google-cloud-dlp==3.9.2
google-cloud-language==1.3.2
google-cloud-monitoring==2.15.1
google-cloud-pubsub==2.13.11
google-cloud-pubsublite==1.6.0
google-cloud-recommendations-ai==0.7.1
google-cloud-resource-manager==1.6.3
google-cloud-spanner==3.26.0
google-cloud-storage==2.11.0
google-cloud-videointelligence==1.16.3
google-cloud-vision==3.1.4
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.6.0
gpustat==1.0.0
greenlet==2.0.2
grpc-google-iam-v1==0.12.6
grpcio==1.58.0
gviz-api==1.10.0
gymnasium==0.28.1
h11==0.14.0
h5py==2.10.0
hdfs==2.7.2
htmlmin==0.1.12
httplib2==0.20.4
ImageHash==4.3.1
imageio==2.31.2
importlib-resources==5.12.0
ipython-genutils==0.2.0
ipython-sql==0.5.0
ipywidgets==7.8.1
jaraco.classes==3.2.3
jax-jumpy==1.0.0
jeepney==0.8.0
Jinja2==2.11.3
joblib==1.3.2
json5==0.9.14
jupyter-http-over-ws==0.0.8
jupyter-server-mathjax==0.2.6
jupyter-server-proxy==3.2.2
jupyterlab==3.4.8
jupyterlab-widgets==1.1.7
jupyterlab_git==0.43.0
jupyterlab_server==2.24.0
jupytext==1.15.2
keras==2.11.0
keras-core==0.0.0
Keras-Preprocessing==1.1.2
keras-tuner==1.4.1
keyring==24.1.1
keyrings.google-artifactregistry-auth==1.1.2
kfp==2.6.0
kfp-pipeline-spec==0.3.0
kfp-server-api==2.0.5
kiwisolver==1.4.5
kt-legacy==1.0.5
kubernetes==11.0.0
libclang==16.0.6
llvmlite==0.39.1
lz4==4.3.2
Markdown==3.4.4
markdown-it-py==2.2.0
MarkupSafe==2.0.1
matplotlib==3.5.3
mdit-py-plugins==0.3.5
mdurl==0.1.2
mistune==0.8.4
ml-metadata==1.12.0
ml-pipelines-sdk==1.12.0
more-itertools==9.1.0
msgpack==1.0.5
multimethod==1.9.1
nbclient==0.5.13
nbconvert==6.4.5
nbdime==3.2.0
networkx==2.6.3
numba==0.56.4
numpy==1.21.6
nvidia-ml-py==11.495.46
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.6.1
opencensus==0.11.3
opencensus-context==0.1.3
opentelemetry-api==1.20.0
opentelemetry-exporter-otlp==1.20.0
opentelemetry-exporter-otlp-proto-common==1.20.0
opentelemetry-exporter-otlp-proto-grpc==1.20.0
opentelemetry-exporter-otlp-proto-http==1.20.0
opentelemetry-proto==1.20.0
opentelemetry-sdk==1.20.0
opentelemetry-semantic-conventions==0.41b0
opt-einsum==3.3.0
orjson==3.9.7
overrides==6.5.0
packaging==20.9
pandas==1.3.5
pandas-profiling==3.6.6
papermill==2.4.0
patsy==0.5.3
phik==0.12.3
Pillow==9.5.0
platformdirs==3.10.0
plotly==5.17.0
pluggy==1.2.0
portpicker==1.6.0
prettytable==3.7.0
promise==2.3
proto-plus==1.22.3
protobuf==3.20.1
py-spy==0.3.14
pyarrow==6.0.1
pydantic==1.10.12
pydot==1.4.2
pyfarmhash==0.3.2
PyJWT==2.8.0
pymongo==3.13.0
pyparsing==3.1.1
pytz==2023.3.post1
PyWavelets==1.3.0
PyYAML==5.4.1
ray==2.7.0
ray-cpp==2.7.0
regex==2023.8.8
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
retrying==1.3.3
rich==13.5.3
scikit-image==0.19.3
scikit-learn==1.0.2
scipy==1.7.3
seaborn==0.12.2
SecretStorage==3.3.3
simpervisor==0.4
smart-open==6.4.0
smmap==5.0.1
SQLAlchemy==2.0.21
sqlparse==0.4.4
starlette==0.27.0
statsmodels==0.13.5
tabulate==0.9.0
tangled-up-in-unicode==0.2.0
tenacity==8.2.3
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-profile==2.13.1
tensorboard-plugin-wit==1.8.1
tensorboardX==2.6
tensorflow==2.11.0
tensorflow-cloud==0.1.16
tensorflow-data-validation==1.12.0
tensorflow-datasets==4.8.2
tensorflow-estimator==2.11.0
tensorflow-hub==0.9.0
tensorflow-io==0.29.0
tensorflow-io-gcs-filesystem==0.29.0
tensorflow-metadata==1.12.0
tensorflow-model-analysis==0.43.0
tensorflow-probability==0.19.0
tensorflow-serving-api==2.11.0
tensorflow-transform==1.12.0
termcolor==2.3.0
testpath==0.6.0
textwrap3==0.9.2
tfx==1.12.0
tfx-bsl==1.12.0
threadpoolctl==3.1.0
tifffile==2021.11.2
toml==0.10.2
tomli==2.0.1
tqdm==4.66.1
typeguard==2.13.3
typer==0.9.0
uritemplate==3.0.1
uvicorn==0.22.0
virtualenv==20.21.0
visions==0.7.5
watchfiles==0.20.0
Werkzeug==2.1.2
widgetsnbextension==3.6.6
witwidget==1.8.1
wordcloud==1.9.2
wrapt==1.15.0
ydata-profiling==4.5.1

@lego0901
Copy link
Member

Thanks for providing your environments!

However, I was not able to reproduce the phenomenon for both configurations, using the VAI example running locally:

  • From @adriangay: The component logs were displayed below the screen.
  • From @crbl1122: I was not able to install dependencies with the below error message.

I think some configurations, not the TFX, are outdated so the logs are not displayed. Let me contact to Vertex AI team engineer internally to figure out the problem. Thank you.

Screenshot 2024-02-16 at 4 52 17 PM
ERROR: Ignored the following yanked versions: 3.0.6, 3.5.0, 3.7.0, 3.17.0, 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.0.4, 4.0.5, 4.0.7, 4.0.8, 4.0.9, 4.1.2, 4.1.6, 4.2.6, 4.2.7, 4.3.13, 4.3.16
ERROR: Ignored the following versions that require a different python version: 2.10.0 Requires-Python >=2.7,<3.0; 2.3.0 Requires-Python >=2.7,<3.0; 2.4.0 Requires-Python >=2.7,<3.0; 2.5.0 Requires-Python >=2.7,<3.0; 2.6.0 Requires-Python >=2.7,<3.0; 2.7.0 Requires-Python >=2.7,<3.0; 2.8.0 Requires-Python >=2.7,<3.0; 2.9.0 Requires-Python >=2.7,<3.0
ERROR: Could not find a version that satisfies the requirement conda==22.9.0 (from versions: none)
ERROR: No matching distribution found for conda==22.9.0

@adriangay
Copy link

@lego0901 Hi, thank you for pursuing this. I see you cannot reproduce with the Penguin Example. I will try to reproduce with a simple pipeline to further aid problem determination...

@crbl1122
Copy link
Author

@lego0901 Hi, I want to add that the same problem occurs for Apache Beam jobs in Dataflow. No logs are displayed. So far, except Kubeflow all other pipeline types I tested (TFX, Dataflow/Beam), does not produce any logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants