Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Macbook M2 failed to model3d.train with 'GPU' #238

Open
blancfrederic opened this issue Jul 7, 2023 · 2 comments
Open

Macbook M2 failed to model3d.train with 'GPU' #238

blancfrederic opened this issue Jul 7, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@blancfrederic
Copy link

Describe the bug

I try to train a 3D model with my experimental 3D images.
It works with 'CPU' but fails with 'GPU'.

With 'GPU', model.train run only 1 step of the first epoch (very slowly) and it freezes the console.

A strange issue is that it works (but slowly) when backbone = 'resnet' instead of 'unet'

Note that I need to

with tf.device(cpu.name):        
        model.train(X_trn, Y_trn, validation_data=(X_val,Y_val), augmenter=augmenter,
                    epochs=10)

if I want to train with 'CPU'

To reproduce
from the Jupyter notebook 2_training.ipynb

#%% Configuration
# A `StarDist3D` model is specified via a `Config3D` object.
extents = calculate_extents(Y)
anisotropy = tuple(np.max(extents) / extents)
print('empirical anisotropy of labeled objects = %s' % str(anisotropy))

# 96 is a good default choice (see 1_data.ipynb)
n_rays = 96

# Use OpenCL-based computations for data generator during training (requires 'gputools')
use_gpu = True and gputools_available()

# Predict on subsampled grid for increased efficiency and larger field of view
grid = tuple(1 if a > 1.5 else 2 for a in anisotropy)

# Use rays on a Fibonacci lattice adjusted for measured anisotropy of the training data
rays = Rays_GoldenSpiral(n_rays, anisotropy=anisotropy)

#backbone 'unet' or 'resnet'
backbone = 'unet'

conf = Config3D (
    rays             = rays,
    grid             = grid,
    anisotropy       = anisotropy,
    use_gpu          = use_gpu,
    n_channel_in     = n_channel,
    # adjust for your data below (make patch size as large as possible)
    train_patch_size = (48,96,96),
    train_batch_size = 2,
    backbone         = backbone
)
print(conf)
vars(conf)

if use_gpu:
    from csbdeep.utils.tf import limit_gpu_memory
    # adjust as necessary: limit GPU memory to be used by TensorFlow to leave some to OpenCL-based computations
    #limit_gpu_memory(0.8)
    # alternatively, try this:
    limit_gpu_memory(None, allow_growth=True)

# **Note:** The trained `StarDist3D` model will *not* predict completed shapes for partially visible objects at the image boundary.

model = StarDist3D(conf, name='stardist', basedir='models')

# Check if the neural network has a large enough field of view to see up to the boundary of most objects.
median_size = calculate_extents(Y, np.median)
fov = np.array(model._axes_tile_overlap('ZYX'))
print(f"median object size:      {median_size}")
print(f"network field of view :  {fov}")
if any(median_size > fov):
    print("WARNING: median object size larger than field of view of the neural network.")

#%%Train the model
gpus = tf.config.list_logical_devices('GPU')
for gpu in gpus:
    with tf.device(gpu.name):        
        model.train(X_trn, Y_trn, validation_data=(X_val,Y_val), augmenter=augmenter,
                    epochs=10)

Expected behavior
That it works ;o)

Data and screenshots

runcell('Train the model', '/Users/Fred/Documents/Recherche/THESE_ERIC/pour Fred/2_training.py')
WARNING:absl | At this time, the v2.11+ optimizer `tf.keras.optimizers.Adam` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.Adam`.
WARNING:absl | There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.Adam`.
Epoch 1/10
WARNING:tensorflow:AutoGraph could not transform <function _gcd_import at 0x1025103a0> and will run it as-is.
Cause: Unable to locate the source code of <function _gcd_import at 0x1025103a0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow | AutoGraph could not transform <function _gcd_import at 0x1025103a0> and will run it as-is.
Cause: Unable to locate the source code of <function _gcd_import at 0x1025103a0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function _gcd_import at 0x1025103a0> and will run it as-is.
Cause: Unable to locate the source code of <function _gcd_import at 0x1025103a0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
  1/100 [..............................] - ETA: 39:40 - loss: 2.1242 - prob_loss: 0.6624 - dist_loss: 7.3086 - prob_kld: 0.5013 - dist_relevant_mae: 7.3086 - dist_relevant_mse: 69.2684 - dist_dist_iou_metric: 0.0067

Environment (please complete the following information):

  • os: macOS-13.3.1-arm64-arm-64bit
  • stardist: 0.8.3
  • csbdeep: 0.7.3
  • tensorflow: 2.12.0
  • tensorflow GPU: True
  • Metal device set to: Apple M2 Max
  • systemMemory: 64,00 GB
  • maxCacheSize: 24,00 GB
@blancfrederic blancfrederic added the bug Something isn't working label Jul 7, 2023
@uschmidt83
Copy link
Member

Hi @blancfrederic

I try to train a 3D model with my experimental 3D images. It works with 'CPU' but fails with 'GPU'.
With 'GPU', model.train run only 1 step of the first epoch (very slowly) and it freezes the console.

you're training with relatively big patch sizes, hence it could be that it freezes because there's no more memory available. Try setting train_batch_size = 1 and see if the problem goes away.

A strange issue is that it works (but slowly) when backbone = 'resnet' instead of 'unet'

It could be that the UNet uses slightly more memory and therefore causes the problem.

@blancfrederic
Copy link
Author

Hi @uschmidt83

Thank you very much for your answer. It works like a charm !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants