Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No reloads when using S3 compatible storage #6712

Open
AlonKellner opened this issue Dec 30, 2023 · 5 comments
Open

No reloads when using S3 compatible storage #6712

AlonKellner opened this issue Dec 30, 2023 · 5 comments
Labels
core:notf Things related to No TensorFlow mode.

Comments

@AlonKellner
Copy link

AlonKellner commented Dec 30, 2023

Im addition to this issue, I posted a question in stackoverflow.

Environment information

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=10, micro=13, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='c5ff1db54ce4', release='5.15.133.1-microsoft-standard-WSL2', version='#1 SMP Thu Oct 5 21:02:42 UTC 2023', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.15.1
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.15.1'

--- check: tensorflow_python_version
Traceback (most recent call last):
  File "//diagnose_tensorboard.py", line 511, in main
    suggestions.extend(check())
  File "//diagnose_tensorboard.py", line 81, in wrapper
    result = fn()
  File "//diagnose_tensorboard.py", line 267, in tensorflow_python_version
    import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'

--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.10/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'c5ff1db54ce4'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: .tensorboard-info directory does not exist

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.10/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.0.0
aiobotocore==2.9.0
aiohttp==3.9.1
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
botocore==1.33.13
cachetools==5.3.2
certifi==2023.11.17
charset-normalizer==3.3.2
filelock==3.13.1
frozenlist==1.4.1
fsspec==2023.12.2
google-auth==2.25.2
google-auth-oauthlib==1.2.0
grpcio==1.60.0
idna==3.6
Jinja2==3.1.2
jmespath==1.0.1
lightning==2.1.3
lightning-utilities==0.10.0
Markdown==3.5.1
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
networkx==3.2.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
packaging==23.2
pip==23.0.1
protobuf==4.23.4
pyasn1==0.5.1
pyasn1-modules==0.3.0
python-dateutil==2.8.2
pytorch-lightning==2.1.3
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
s3fs==2023.12.2
setuptools==65.5.1
six==1.16.0
sympy==1.12
tensorboard==2.15.1
tensorboard-data-server==0.7.2
tensorflow-io==0.35.0
tensorflow-io-gcs-filesystem==0.35.0
torch==2.1.2
torchmetrics==1.2.1
tqdm==4.66.1
triton==2.1.0
typing_extensions==4.9.0
urllib3==2.0.7
Werkzeug==3.0.1
wheel==0.42.0
wrapt==1.16.0
yarl==1.9.4

Issue description

Here is a repository with a full reproduction of my issue:
https://github.com/AlonKellner/s3-tensorboard-issue-reproduction

When using tensorboard with an s3 compatible storage, only the first experiments that the server comes across are shown in the UI.
All experiments that are present during startup are shown fully, if no experiment is present during start up, the first detected experiment will be shown partially.
After an experiment is detected and shown, no further steps and experiments will be reloaded and shown.
When using the --reload_task process option, no experiment is shown whatsoever.

I have personally reproduced this unexpected behavior with both ceph (with an on-prem instance) and minio (with a local docker image, see reproduction repo).

The expected behavior is that any new experiment that is written to the s3 compatible storage should be reloaded in the UI when pressing the reload button, as well as new steps in that new experiment.
Also, I expect this behavior to work correctly with the --reload_multifile=true option.

Workarounds are also welcome, thanks :)

@arcra
Copy link
Member

arcra commented Jan 4, 2024

Can you clarify what you mean by "all experiments"? Are you referring to the runs? Could you share a screenshot, to make that clearer?

Do you see any errors in the console logs?

Could it be something related or similar to what is reported in #6713?

@AlonKellner
Copy link
Author

Can you clarify what you mean by "all experiments"? Are you referring to the runs? Could you share a screenshot, to make that clearer?

Yes, sorry for using the wrong terminology, when I wrote "experiments" I was referring to "runs".
As for screenshots, I've added screenshots to the reproduction repo, there are 6 of them that explain the full problem, so I won't share all of them here, but here is the one from step-3, when there are 2 runs in minio, but only 1 run is visible in tensorboard:
step-3

Do you see any errors in the console logs?

No, the only console logs are:

TensorFlow installation not found - running with reduced feature set.
TensorBoard 2.15.1 at http://2666274e9da3:6006/ (Press CTRL+C to quit)

Could it be something related or similar to what is reported in #6713?

It does not seem like it, since I do not see any errors.

@arcra
Copy link
Member

arcra commented Jan 8, 2024

Yes, sorry for using the wrong terminology, when I wrote "experiments" I was referring to "runs".

No worries, I just wanted to make sure I understand what the issue is correctly.

I believe it's (similarly to #6713) an issue with our "no-TF compatibility" implementation of the GFile interface, particularly the support for the S3 files. I believe a workaround might be to install tensorflow, so it would use the TF implementation. If you do, please confirm whether that solves the problem for you.

Unfortunately, we don't have the bandwidth to investigate this with more detail. Our compat support for the S3 filesystem is done as best-effort.

@arcra arcra added the core:notf Things related to No TensorFlow mode. label Jan 8, 2024
@AlonKellner
Copy link
Author

I tried to install tensorflow, but then I get the error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/lightning/fabric/loggers/tensorboard.py", line 208, in log_metrics
    self.experiment.add_scalar(k, v, step)
  File "/usr/local/lib/python3.10/site-packages/lightning/fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
  File "/usr/local/lib/python3.10/site-packages/lightning/fabric/loggers/tensorboard.py", line 191, in experiment
    self._experiment = SummaryWriter(log_dir=self.log_dir, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 243, in __init__
    self._get_file_writer()
  File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 273, in _get_file_writer
    self.file_writer = FileWriter(
  File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 72, in __init__
    self.event_writer = EventFileWriter(
  File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in __init__
    tf.io.gfile.makedirs(logdir)
  File "/usr/local/lib/python3.10/site-packages/tensorflow/python/lib/io/file_io.py", line 513, in recursive_create_dir_v2
    _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 's3' not implemented (file: 's3://tensorboard/test-test/version_0')

My workaround is a bad one, I wrote a simple bash script that restarts tensorboard every minute, that way it reloads all runs every minute, which works for my use-case.

@arcra
Copy link
Member

arcra commented Jan 23, 2024

Looks like there's a separate package that might provide support for that filesystem. Can you try installing tensorflow-io and see if that solves your problem?

Sources:
https://discuss.tensorflow.org/t/access-s3-on-tensorflow/8633
https://blog.ukjae.io/posts/enabling-s3-filesystem-support-for-tensorflow-serving/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core:notf Things related to No TensorFlow mode.
Projects
None yet
Development

No branches or pull requests

2 participants