Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running code cnn-train code chunk in 02-cats-vs-dogs.Rmd #10

Open
dpastling opened this issue Jan 28, 2020 · 4 comments
Open

Error running code cnn-train code chunk in 02-cats-vs-dogs.Rmd #10

dpastling opened this issue Jan 28, 2020 · 4 comments

Comments

@dpastling
Copy link

When running the following code chunk from a fresh session I get the following error:

history <- model %>% fit_generator(
  train_generator,
  steps_per_epoch = 100,
  epochs = 30,
  validation_data = validation_generator,
  validation_steps = 50,
  callbacks = callback_early_stopping(patience = 5)
)
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
2020-01-28 00:41:21.650047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-28 00:41:21.887585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-28 00:41:22.620623: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

Error in py_call_impl(callable, dots$args, dots$keywords) : ResourceExhaustedError: OOM when allocating tensor with shape[6272,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node MatMul_3 (defined at /util/deprecation.py:324) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_distributed_function_1290] Function call stack: distributed_function

@OmaymaS
Copy link
Contributor

OmaymaS commented Jan 28, 2020

Just adding a note that this issue is related to running code on Rstudio Server (including GPU).

@dougmet
Copy link

dougmet commented Jan 28, 2020

I can reproduce this problem. Investigating.

I think I'm running out of GPU memory. Which wasn't a problem before. I do have two sessions running but not sure if that's relevant.

@dougmet
Copy link

dougmet commented Jan 28, 2020

Dropping the batch size to 5 has got it moving again.

@dpastling
Copy link
Author

dpastling commented Jan 29, 2020

I've tried dropping the batch size to 5, but am still getting errors. The code progresses through all 20 epochs, whereas it was stopping at the first with a larger batch size

> history <- 
+   model %>% 
+   fit_generator(
+     train_generator,
+     steps_per_epoch = 100,
+     epochs = 30,
+     validation_data = validation_generator,
+     validation_steps = 50,
+     callbacks = callback_early_stopping(patience = 5)
+   )
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
2020-01-29 00:01:20.648842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-29 00:01:20.827311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-29 00:01:21.513376: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 98.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

... snip ...

Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Train for 100 steps, validate for 50 steps
Epoch 1/30
100/100 [==============================] - 7s 73ms/step - loss: 0.6969 - accuracy: 0.5080 - val_loss: 0.6818 - val_accuracy: 0.5480
2020-01-29 00:01:27.273011: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Epoch 2/30
100/100 [==============================] - 3s 32ms/step - loss: 0.6927 - accuracy: 0.5180 - val_loss: 0.6750 - val_accuracy: 0.5480ernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-01-29 00:01:30.518277: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants