Macbook M2 failed to model3d.train with 'GPU' #238

blancfrederic · 2023-07-07T12:43:32Z

Describe the bug

I try to train a 3D model with my experimental 3D images.
It works with 'CPU' but fails with 'GPU'.

With 'GPU', model.train run only 1 step of the first epoch (very slowly) and it freezes the console.

A strange issue is that it works (but slowly) when backbone = 'resnet' instead of 'unet'

Note that I need to

with tf.device(cpu.name):        
        model.train(X_trn, Y_trn, validation_data=(X_val,Y_val), augmenter=augmenter,
                    epochs=10)

if I want to train with 'CPU'

To reproduce
from the Jupyter notebook 2_training.ipynb

#%% Configuration
# A `StarDist3D` model is specified via a `Config3D` object.
extents = calculate_extents(Y)
anisotropy = tuple(np.max(extents) / extents)
print('empirical anisotropy of labeled objects = %s' % str(anisotropy))

# 96 is a good default choice (see 1_data.ipynb)
n_rays = 96

# Use OpenCL-based computations for data generator during training (requires 'gputools')
use_gpu = True and gputools_available()

# Predict on subsampled grid for increased efficiency and larger field of view
grid = tuple(1 if a > 1.5 else 2 for a in anisotropy)

# Use rays on a Fibonacci lattice adjusted for measured anisotropy of the training data
rays = Rays_GoldenSpiral(n_rays, anisotropy=anisotropy)

#backbone 'unet' or 'resnet'
backbone = 'unet'

conf = Config3D (
    rays             = rays,
    grid             = grid,
    anisotropy       = anisotropy,
    use_gpu          = use_gpu,
    n_channel_in     = n_channel,
    # adjust for your data below (make patch size as large as possible)
    train_patch_size = (48,96,96),
    train_batch_size = 2,
    backbone         = backbone
)
print(conf)
vars(conf)

if use_gpu:
    from csbdeep.utils.tf import limit_gpu_memory
    # adjust as necessary: limit GPU memory to be used by TensorFlow to leave some to OpenCL-based computations
    #limit_gpu_memory(0.8)
    # alternatively, try this:
    limit_gpu_memory(None, allow_growth=True)

# **Note:** The trained `StarDist3D` model will *not* predict completed shapes for partially visible objects at the image boundary.

model = StarDist3D(conf, name='stardist', basedir='models')

# Check if the neural network has a large enough field of view to see up to the boundary of most objects.
median_size = calculate_extents(Y, np.median)
fov = np.array(model._axes_tile_overlap('ZYX'))
print(f"median object size:      {median_size}")
print(f"network field of view :  {fov}")
if any(median_size > fov):
    print("WARNING: median object size larger than field of view of the neural network.")

#%%Train the model
gpus = tf.config.list_logical_devices('GPU')
for gpu in gpus:
    with tf.device(gpu.name):        
        model.train(X_trn, Y_trn, validation_data=(X_val,Y_val), augmenter=augmenter,
                    epochs=10)

Expected behavior
That it works ;o)

Data and screenshots

runcell('Train the model', '/Users/Fred/Documents/Recherche/THESE_ERIC/pour Fred/2_training.py')
WARNING:absl | At this time, the v2.11+ optimizer `tf.keras.optimizers.Adam` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.Adam`.
WARNING:absl | There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.Adam`.
Epoch 1/10
WARNING:tensorflow:AutoGraph could not transform <function _gcd_import at 0x1025103a0> and will run it as-is.
Cause: Unable to locate the source code of <function _gcd_import at 0x1025103a0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow | AutoGraph could not transform <function _gcd_import at 0x1025103a0> and will run it as-is.
Cause: Unable to locate the source code of <function _gcd_import at 0x1025103a0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function _gcd_import at 0x1025103a0> and will run it as-is.
Cause: Unable to locate the source code of <function _gcd_import at 0x1025103a0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
  1/100 [..............................] - ETA: 39:40 - loss: 2.1242 - prob_loss: 0.6624 - dist_loss: 7.3086 - prob_kld: 0.5013 - dist_relevant_mae: 7.3086 - dist_relevant_mse: 69.2684 - dist_dist_iou_metric: 0.0067

Environment (please complete the following information):

os: macOS-13.3.1-arm64-arm-64bit
stardist: 0.8.3
csbdeep: 0.7.3
tensorflow: 2.12.0
tensorflow GPU: True
Metal device set to: Apple M2 Max
systemMemory: 64,00 GB
maxCacheSize: 24,00 GB

uschmidt83 · 2023-07-17T11:23:13Z

Hi @blancfrederic

I try to train a 3D model with my experimental 3D images. It works with 'CPU' but fails with 'GPU'.
With 'GPU', model.train run only 1 step of the first epoch (very slowly) and it freezes the console.

you're training with relatively big patch sizes, hence it could be that it freezes because there's no more memory available. Try setting train_batch_size = 1 and see if the problem goes away.

A strange issue is that it works (but slowly) when backbone = 'resnet' instead of 'unet'

It could be that the UNet uses slightly more memory and therefore causes the problem.

blancfrederic · 2023-07-31T09:54:45Z

Hi @uschmidt83

Thank you very much for your answer. It works like a charm !

blancfrederic added the bug Something isn't working label Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Macbook M2 failed to model3d.train with 'GPU' #238

Macbook M2 failed to model3d.train with 'GPU' #238

blancfrederic commented Jul 7, 2023

uschmidt83 commented Jul 17, 2023

blancfrederic commented Jul 31, 2023

Macbook M2 failed to model3d.train with 'GPU' #238

Macbook M2 failed to model3d.train with 'GPU' #238

Comments

blancfrederic commented Jul 7, 2023

uschmidt83 commented Jul 17, 2023

blancfrederic commented Jul 31, 2023