Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory while learning on Cifar with 100 clients #3238

Open
MilowB opened this issue Apr 8, 2024 · 1 comment
Open

Out of Memory while learning on Cifar with 100 clients #3238

MilowB opened this issue Apr 8, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@MilowB
Copy link

MilowB commented Apr 8, 2024

Describe the bug

I launch 100 clients supposed to learn how to classify images from the cifar-100 dataset.
I have 2 GPUs and 6 CPUs and enable gpu growth. Each client has access to 1 CPU and the whole GPU (2x32G VRAM). I expect to have enough GPU memory for this task!

At the first round a device is created with 32G: 2024-04-08 14:49:08.296728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31141 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0

But later at the second round, a new one is created with only 494MB: (DefaultActor pid=460472) 2024-04-08 14:49:32.679854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0

The OOM seems to come from a lack of memory on the 494MB device.
Why isn't the memory released after the first round?

Steps/Code to Reproduce

client_resources = {
            "num_cpus": 1,
            "num_gpus": 1.0
    }

# start simulation
result=fl.simulation.start_simulation(
  client_fn=client_training_fn,
  num_clients=min_available_clients,
  config=fl.server.ServerConfig(num_rounds=FLAGS.num_rounds),
  strategy=strategy,
  ray_init_args=ray_server_config,
  client_resources=client_resources,
  actor_kwargs={
    "on_actor_init_fn": enable_tf_gpu_growth  # Enable GPU growth upon actor init
  },
)

Expected Results

I expect the memory to be released after each round in order to run a new round on the same GPU.

Actual Results

2024-04-08 14:48:23.651346: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x154d2c8b5aa0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-04-08 14:48:23.651385: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2024-04-08 14:48:23.725837: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2024-04-08 14:49:08.293991: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 2
2024-04-08 14:49:08.294123: I tensorflow/core/grappler/clusters/single_machine.cc:361] Starting new session
2024-04-08 14:49:08.296728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31141 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
2024-04-08 14:49:08.297003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 31141 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1c:00.0, compute capability: 7.0
2024-04-08 14:49:09.506876: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 2
2024-04-08 14:49:09.507014: I tensorflow/core/grappler/clusters/single_machine.cc:361] Starting new session
2024-04-08 14:49:09.509594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31141 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
2024-04-08 14:49:09.509865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 31141 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1c:00.0, compute capability: 7.0
2024-04-08 14:49:10.052274: W tensorflow/compiler/tf2tensorrt/convert/trt_optimization_pass.cc:186] Calibration with FP32 or FP16 is not implemented. Falling back to use_calibration = False.Note that the default value of use_calibration is True.
2024-04-08 14:49:10.192426: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:970] 

TensorRT unsupported/non-converted OP Report:
	- NoOp -> 2x
	- Cast -> 1x
	- Identity -> 1x
	- Placeholder -> 1x
	- Total nonconverted OPs: 5
	- Total nonconverted OP Types: 4
For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops.

2024-04-08 14:49:10.192981: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:1298] The environment variable TF_TRT_MAX_ALLOWED_ENGINES=20 has no effect since there are only 1 TRT Engines with  at least minimum_segment_size=3 nodes.
2024-04-08 14:49:10.193039: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:799] Number of TensorRT candidate segments: 1
2024-04-08 14:49:10.300789: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:913] Replaced segment 0 consisting of 79 nodes by TRTEngineOp_000_000.
2024-04-08 14:49:13.086942: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2024-04-08 14:49:13.086983: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2024-04-08 14:49:13.087426: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /linkhome/rech/genwej01/uov71di/experiments/examples/federated/cifar100/my_trials_2024-04-08--14:47:50_federatedListicCFL_strategy_minCl100_minFit100_learningRate0.001_nbEpoch1_pro
cID0_dropout0.2_dataConfig1_isFLserverTrue_num_experiments1/expe_0/exported_models/1
2024-04-08 14:49:13.093707: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-04-08 14:49:13.107495: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-04-08 14:49:13.317161: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 229736 microseconds.

INFO flwr 2024-04-08 14:49:27,263 | server.py:104 | FL starting
INFO:flwr:FL starting
DEBUG flwr 2024-04-08 14:49:30,687 | server.py:222 | fit_round 1: strategy sampled 100 clients (out of 100)
DEBUG:flwr:fit_round 1: strategy sampled 100 clients (out of 100)

(DefaultActor pid=460472)   if distutils.version.LooseVersion(
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(DefaultActor pid=460472)   if (distutils.version.LooseVersion(tf.__version__) <
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tf_agents/utils/common.py:91: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(DefaultActor pid=460472)   distutils.version.LooseVersion(tf.__version__)
(DefaultActor pid=460472) /gpfsssd/jobscratch/uov71di_1419691/session_2024-04-08_14-48-11_378165_459197/runtime_resources/py_modules_files/_ray_pkg_aa459d48e12ddf8a/deeplearningtools/tools/gpu.py:78: DeprecationWarning: invalid escape sequence '\ '
(DefaultActor pid=460472)   f.write('\n    -> /!\ Layer not tensorcore compliant (index, name, input, output):'+str((i,layer.name,layer.input_shape, layer.output_shape)))
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
(DefaultActor pid=460472) 
(DefaultActor pid=460472) TensorFlow Addons (TFA) has ended development and introduction of new features.
(DefaultActor pid=460472) TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
(DefaultActor pid=460472) Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
(DefaultActor pid=460472) 
(DefaultActor pid=460472) For more information see: https://github.com/tensorflow/addons/issues/2807 
(DefaultActor pid=460472) 
(DefaultActor pid=460472)   warnings.warn(
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tensorflow_model_optimization/__init__.py:65: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(DefaultActor pid=460472)   if (distutils.version.LooseVersion(tf.version.VERSION) <

(DefaultActor pid=460472) 2024-04-08 14:49:32.679854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:32.681863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute 
capability: 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:32.683875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:32.685186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0

(DefaultActor pid=460472) 2024-04-08 14:49:37.205480: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
(DefaultActor pid=460472) 2024-04-08 14:49:37.205519: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
(DefaultActor pid=460472) 2024-04-08 14:49:37.205559: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1694] Profiler found 1 GPUs
(DefaultActor pid=460472) 2024-04-08 14:49:37.242056: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
(DefaultActor pid=460472) 2024-04-08 14:49:37.242199: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1828] CUPTI activity buffer flushed
(DefaultActor pid=460472) 2024-04-08 14:49:37.288573: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
(DefaultActor pid=460472) 2024-04-08 14:49:37.288606: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
(DefaultActor pid=460472) 2024-04-08 14:49:37.324481: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
(DefaultActor pid=460472) 2024-04-08 14:49:37.324669: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1828] CUPTI activity buffer flushed
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/flwr/simulation/ray_transport/ray_actor.py:72: DeprecationWarning:  Ensure your client is of type `flwr.client.Client`. Please convert it using the `.to_client()` method before returning it in the `clie
nt_fn` you pass to `start_simulation`. We have applied this conversion on your behalf. Not returning a `Client` might trigger an error in future versions of Flower.
(DefaultActor pid=460472)   client = check_clientfn_returns_client(client_fn(cid))
(DefaultActor pid=460472) 2024-04-08 14:49:37.446197: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
(DefaultActor pid=460472) 2024-04-08 14:49:37.446238: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
(DefaultActor pid=460472) 2024-04-08 14:49:37.482799: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
(DefaultActor pid=460472) 2024-04-08 14:49:37.482979: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1828] CUPTI activity buffer flushed
(DefaultActor pid=460472) 2024-04-08 14:49:39.050371: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape insequential/dropout/dropout/SelectV2-2-TransposeNHW
CToNCHW-LayoutOptimizer
(DefaultActor pid=460472) 2024-04-08 14:49:39.376274: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600
(DefaultActor pid=460472) 2024-04-08 14:49:39.650979: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 563.19MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.663984: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.664051: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.667988: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.678127: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.678184: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.682286: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:40.488444: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15761a00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
(DefaultActor pid=460472) 2024-04-08 14:49:40.488503: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:40.562673: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
(DefaultActor pid=460472) 2024-04-08 14:49:40.667952: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:40.683735: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:40.683801: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
@MilowB MilowB added the bug Something isn't working label Apr 8, 2024
@jafermarq
Copy link
Contributor

Hi @MilowB, I wonder if this is because you are also making use of the model in the main thread (i.e. where the strategy/server run) and it's causing that to use the whole GPU (by default with Tensorflow). Note that in the examples/simulation-tensorflow we set another enable_tf_gpu_growth() outside the start_simulation(). See this line. I hope this fixes your issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants