Out of Memory while learning on Cifar with 100 clients #3238

MilowB · 2024-04-08T14:11:34Z

Describe the bug

I launch 100 clients supposed to learn how to classify images from the cifar-100 dataset.
I have 2 GPUs and 6 CPUs and enable gpu growth. Each client has access to 1 CPU and the whole GPU (2x32G VRAM). I expect to have enough GPU memory for this task!

At the first round a device is created with 32G: 2024-04-08 14:49:08.296728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31141 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0

But later at the second round, a new one is created with only 494MB: (DefaultActor pid=460472) 2024-04-08 14:49:32.679854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0

The OOM seems to come from a lack of memory on the 494MB device.
Why isn't the memory released after the first round?

Steps/Code to Reproduce

client_resources = {
            "num_cpus": 1,
            "num_gpus": 1.0
    }

# start simulation
result=fl.simulation.start_simulation(
  client_fn=client_training_fn,
  num_clients=min_available_clients,
  config=fl.server.ServerConfig(num_rounds=FLAGS.num_rounds),
  strategy=strategy,
  ray_init_args=ray_server_config,
  client_resources=client_resources,
  actor_kwargs={
    "on_actor_init_fn": enable_tf_gpu_growth  # Enable GPU growth upon actor init
  },
)

Expected Results

I expect the memory to be released after each round in order to run a new round on the same GPU.

Actual Results

2024-04-08 14:48:23.651346: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x154d2c8b5aa0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-04-08 14:48:23.651385: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2024-04-08 14:48:23.725837: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2024-04-08 14:49:08.293991: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 2
2024-04-08 14:49:08.294123: I tensorflow/core/grappler/clusters/single_machine.cc:361] Starting new session
2024-04-08 14:49:08.296728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31141 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
2024-04-08 14:49:08.297003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 31141 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1c:00.0, compute capability: 7.0
2024-04-08 14:49:09.506876: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 2
2024-04-08 14:49:09.507014: I tensorflow/core/grappler/clusters/single_machine.cc:361] Starting new session
2024-04-08 14:49:09.509594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31141 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
2024-04-08 14:49:09.509865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 31141 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1c:00.0, compute capability: 7.0
2024-04-08 14:49:10.052274: W tensorflow/compiler/tf2tensorrt/convert/trt_optimization_pass.cc:186] Calibration with FP32 or FP16 is not implemented. Falling back to use_calibration = False.Note that the default value of use_calibration is True.
2024-04-08 14:49:10.192426: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:970] 

TensorRT unsupported/non-converted OP Report:
	- NoOp -> 2x
	- Cast -> 1x
	- Identity -> 1x
	- Placeholder -> 1x
	- Total nonconverted OPs: 5
	- Total nonconverted OP Types: 4
For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops.

2024-04-08 14:49:10.192981: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:1298] The environment variable TF_TRT_MAX_ALLOWED_ENGINES=20 has no effect since there are only 1 TRT Engines with  at least minimum_segment_size=3 nodes.
2024-04-08 14:49:10.193039: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:799] Number of TensorRT candidate segments: 1
2024-04-08 14:49:10.300789: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:913] Replaced segment 0 consisting of 79 nodes by TRTEngineOp_000_000.
2024-04-08 14:49:13.086942: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2024-04-08 14:49:13.086983: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2024-04-08 14:49:13.087426: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /linkhome/rech/genwej01/uov71di/experiments/examples/federated/cifar100/my_trials_2024-04-08--14:47:50_federatedListicCFL_strategy_minCl100_minFit100_learningRate0.001_nbEpoch1_pro
cID0_dropout0.2_dataConfig1_isFLserverTrue_num_experiments1/expe_0/exported_models/1
2024-04-08 14:49:13.093707: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-04-08 14:49:13.107495: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-04-08 14:49:13.317161: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 229736 microseconds.

INFO flwr 2024-04-08 14:49:27,263 | server.py:104 | FL starting
INFO:flwr:FL starting
DEBUG flwr 2024-04-08 14:49:30,687 | server.py:222 | fit_round 1: strategy sampled 100 clients (out of 100)
DEBUG:flwr:fit_round 1: strategy sampled 100 clients (out of 100)

(DefaultActor pid=460472)   if distutils.version.LooseVersion(
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(DefaultActor pid=460472)   if (distutils.version.LooseVersion(tf.__version__) <
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tf_agents/utils/common.py:91: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(DefaultActor pid=460472)   distutils.version.LooseVersion(tf.__version__)
(DefaultActor pid=460472) /gpfsssd/jobscratch/uov71di_1419691/session_2024-04-08_14-48-11_378165_459197/runtime_resources/py_modules_files/_ray_pkg_aa459d48e12ddf8a/deeplearningtools/tools/gpu.py:78: DeprecationWarning: invalid escape sequence '\ '
(DefaultActor pid=460472)   f.write('\n    -> /!\ Layer not tensorcore compliant (index, name, input, output):'+str((i,layer.name,layer.input_shape, layer.output_shape)))
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 
(DefaultActor pid=460472) 
(DefaultActor pid=460472) TensorFlow Addons (TFA) has ended development and introduction of new features.
(DefaultActor pid=460472) TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
(DefaultActor pid=460472) Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 
(DefaultActor pid=460472) 
(DefaultActor pid=460472) For more information see: https://github.com/tensorflow/addons/issues/2807 
(DefaultActor pid=460472) 
(DefaultActor pid=460472)   warnings.warn(
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/tensorflow_model_optimization/__init__.py:65: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(DefaultActor pid=460472)   if (distutils.version.LooseVersion(tf.version.VERSION) <

(DefaultActor pid=460472) 2024-04-08 14:49:32.679854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:32.681863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute 
capability: 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:32.683875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:32.685186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 494 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0

(DefaultActor pid=460472) 2024-04-08 14:49:37.205480: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
(DefaultActor pid=460472) 2024-04-08 14:49:37.205519: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
(DefaultActor pid=460472) 2024-04-08 14:49:37.205559: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1694] Profiler found 1 GPUs
(DefaultActor pid=460472) 2024-04-08 14:49:37.242056: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
(DefaultActor pid=460472) 2024-04-08 14:49:37.242199: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1828] CUPTI activity buffer flushed
(DefaultActor pid=460472) 2024-04-08 14:49:37.288573: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
(DefaultActor pid=460472) 2024-04-08 14:49:37.288606: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
(DefaultActor pid=460472) 2024-04-08 14:49:37.324481: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
(DefaultActor pid=460472) 2024-04-08 14:49:37.324669: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1828] CUPTI activity buffer flushed
(DefaultActor pid=460472) /usr/local/lib/python3.11/dist-packages/flwr/simulation/ray_transport/ray_actor.py:72: DeprecationWarning:  Ensure your client is of type `flwr.client.Client`. Please convert it using the `.to_client()` method before returning it in the `clie
nt_fn` you pass to `start_simulation`. We have applied this conversion on your behalf. Not returning a `Client` might trigger an error in future versions of Flower.
(DefaultActor pid=460472)   client = check_clientfn_returns_client(client_fn(cid))
(DefaultActor pid=460472) 2024-04-08 14:49:37.446197: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
(DefaultActor pid=460472) 2024-04-08 14:49:37.446238: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
(DefaultActor pid=460472) 2024-04-08 14:49:37.482799: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
(DefaultActor pid=460472) 2024-04-08 14:49:37.482979: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1828] CUPTI activity buffer flushed
(DefaultActor pid=460472) 2024-04-08 14:49:39.050371: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape insequential/dropout/dropout/SelectV2-2-TransposeNHW
CToNCHW-LayoutOptimizer
(DefaultActor pid=460472) 2024-04-08 14:49:39.376274: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600
(DefaultActor pid=460472) 2024-04-08 14:49:39.650979: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 563.19MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.663984: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.664051: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.667988: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.678127: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.678184: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:39.682286: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:40.488444: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15761a00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
(DefaultActor pid=460472) 2024-04-08 14:49:40.488503: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
(DefaultActor pid=460472) 2024-04-08 14:49:40.562673: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
(DefaultActor pid=460472) 2024-04-08 14:49:40.667952: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 602.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:40.683735: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
(DefaultActor pid=460472) 2024-04-08 14:49:40.683801: W tensorflow/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.

The text was updated successfully, but these errors were encountered:

jafermarq · 2024-04-08T15:35:43Z

Hi @MilowB, I wonder if this is because you are also making use of the model in the main thread (i.e. where the strategy/server run) and it's causing that to use the whole GPU (by default with Tensorflow). Note that in the examples/simulation-tensorflow we set another enable_tf_gpu_growth() outside the start_simulation(). See this line. I hope this fixes your issue!

MilowB added the bug Something isn't working label Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory while learning on Cifar with 100 clients #3238

Out of Memory while learning on Cifar with 100 clients #3238

MilowB commented Apr 8, 2024

jafermarq commented Apr 8, 2024

Out of Memory while learning on Cifar with 100 clients #3238

Out of Memory while learning on Cifar with 100 clients #3238

Comments

MilowB commented Apr 8, 2024

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

jafermarq commented Apr 8, 2024