Running training in a loop (M1 chip) #1345

cadyyuheng opened this issue Jul 25, 2022 · 3 comments

I'm trying to repeat my training and prediction in a loop for 20 times. My code I have worked fine for Intel-based MacBook. However, I recently changed to an M1-based MacBook, and my loop repetition seems to get some trouble -- Although I didn't get any errors, the program never came to a finish for the 5th repeat in the loop of my training. If I change the loop number to 3, the loop can finish without any issue. I wonder if this is because some memory quota has been reached and if there's any way to raise the quota. Really appreciate the help.

> reticulate::py_config()
python:         /Users/yfy6677/Library/r-miniconda-arm64/envs/r-reticulate/bin/python
libpython:      /Users/yfy6677/Library/r-miniconda-arm64/envs/r-reticulate/lib/libpython3.8.dylib
pythonhome:     /Users/yfy6677/Library/r-miniconda-arm64/envs/r-reticulate:/Users/yfy6677/Library/r-miniconda-arm64/envs/r-reticulate
version:        3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:05:16)  [Clang 12.0.1 ]
numpy:          /Users/yfy6677/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.8/site-packages/numpy
numpy_version:  1.22.4
> tensorflow::tf_config()
TensorFlow v2.9.2 ()
Python v3.8 (~/Library/r-miniconda-arm64/envs/r-reticulate/bin/python)
> reticulate::import("tensorflow")
> reticulate::py_last_error()
> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.9        sp_1.5-0           SeuratObject_4.1.0 Seurat_4.1.1      

loaded via a namespace (and not attached):
  [1] Rtsne_0.16            colorspace_2.0-3      ggsignif_0.6.3        deldir_1.0-6          ellipsis_0.3.2       
  [6] ggridges_0.5.3        rprojroot_2.0.3       base64enc_0.1-3       rstudioapi_0.13       spatstat.data_2.2-0  
 [11] farver_2.1.0          matchingR_1.3.3       ggpubr_0.4.0          leiden_0.4.2          listenv_0.8.0        
 [16] bit64_4.0.5           ggrepel_0.9.1         RSpectra_0.16-1       fansi_1.0.3           codetools_0.2-18     
 [21] splines_4.2.1         knitr_1.39            zeallot_0.1.0         polyclip_1.10-0       jsonlite_1.8.0       
 [26] broom_1.0.0           ica_1.0-3             cluster_2.1.3         tfruns_1.5.0          png_0.1-7            
 [31] rgeos_0.5-9           uwot_0.1.11           shiny_1.7.2           sctransform_0.3.3     spatstat.sparse_2.1-1
 [36] compiler_4.2.1        httr_1.4.3            backports_1.4.1       Matrix_1.4-1          fastmap_1.1.0        
 [41] lazyeval_0.2.2        cli_3.3.0             later_1.3.0           htmltools_0.5.2       tools_4.2.1          
 [46] igraph_1.3.2          gtable_0.3.0          glue_1.6.2            RANN_2.6.1            reshape2_1.4.4       
 [51] Rcpp_1.0.9            carData_3.0-5         scattermore_0.8       vctrs_0.4.1           nlme_3.1-157         
 [56] progressr_0.10.1      lmtest_0.9-40         spatstat.random_2.2-0 xfun_0.31             stringr_1.4.0        
 [61] globals_0.15.1        mime_0.12             miniUI_0.1.1.1        lifecycle_1.0.1       irlba_2.3.5          
 [66] rstatix_0.7.0         goftest_1.2-3         future_1.26.1         MASS_7.3-57           zoo_1.8-10           
 [71] scales_1.2.0          spatstat.core_2.4-4   promises_1.2.0.1      spatstat.utils_2.3-1  parallel_4.2.1       
 [76] RColorBrewer_1.1-3    yaml_2.3.5            reticulate_1.25-9000  pbapply_1.5-0         gridExtra_2.3        
 [81] ggplot2_3.3.6         keras_2.9.0.9000      rpart_4.1.16          stringi_1.7.6         tensorflow_2.9.0.9000
 [86] rlang_1.0.4           pkgconfig_2.0.3       matrixStats_0.62.0    pracma_2.3.8          evaluate_0.15        
 [91] lattice_0.20-45       ROCR_1.0-11           purrr_0.3.4           tensor_1.5            labeling_0.4.2       
 [96] patchwork_1.1.1       htmlwidgets_1.5.4     bit_4.0.4             cowplot_1.1.1         tidyselect_1.1.2     
[101] here_1.0.1            parallelly_1.32.0     RcppAnnoy_0.0.19      plyr_1.8.7            magrittr_2.0.3       
[106] R6_2.5.1              generics_0.1.3        whisker_0.4           mgcv_1.8-40           pillar_1.8.0         
[111] fitdistrplus_1.1-8    survival_3.3-1        abind_1.4-5           tibble_3.1.7          future.apply_1.9.0   
[116] hdf5r_1.3.5           car_3.1-0             KernSmooth_2.23-20    utf8_1.2.2            spatstat.geom_2.4-0  
[121] plotly_4.10.0         rmarkdown_2.14        grid_4.2.1            data.table_1.14.2     digest_0.6.29        
[126] xtable_1.8-4          tidyr_1.2.0           httpuv_1.6.5          munsell_0.5.0         viridisLite_0.4.0    
1] "1th repeat run"

Metal device set to: Apple M1 Pro

systemMemory: 32.00 GB
maxCacheSize: 10.67 GB

2022-07-25 13:25:02.019748: I tensorflow/core/common_runtime/pluggable_device/] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-07-25 13:25:02.019862: I tensorflow/core/common_runtime/pluggable_device/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-07-25 13:25:02.214002: W tensorflow/core/platform/profile_utils/] Failed to get CPU frequency: 0 Hz
2022-07-25 13:25:02.714319: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:25:03.477795: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:26:11.688963: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
21/21 [==============================] - 0s 4ms/step
171/171 [==============================] - 0s 2ms/step
[1] "2th repeat run"
2022-07-25 13:26:16.122695: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:26:16.885002: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:27:24.244544: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
21/21 [==============================] - 0s 3ms/step
171/171 [==============================] - 0s 2ms/step
[1] "3th repeat run"
2022-07-25 13:27:26.854203: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:27:27.810199: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:28:35.355860: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
21/21 [==============================] - 0s 4ms/step
171/171 [==============================] - 0s 2ms/step
[1] "4th repeat run"
2022-07-25 13:28:37.895150: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:28:38.820674: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:29:45.818672: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
21/21 [==============================] - 0s 4ms/step
171/171 [==============================] - 0s 2ms/step
[1] "5th repeat run"
2022-07-25 13:29:48.424173: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
2022-07-25 13:29:49.824081: I tensorflow/core/grappler/optimizers/] Plugin optimizer for device_type GPU is enabled.
Hi, I haven't encountered this yet. The fact that your code worked fine on an Intel Mac suggests that it's likely an issue with the 'tensorflow-macos' and 'tensorflow-metal' packages provided by Apple.

If you can provide a reprex and I can reproduce on my side, I can take a look to see if it's an issue with the upstream package or with something related to R or reticulate.

@t-kalinowski ,

Thank you so much for your reply. Just one more question as I want to test is that the CPU or the GPU of M1 cause the issue, is there any quick function to disable the use of M1 GPU for training? Would something like

Sys.setenv("CUDA_VISIBLE_DEVICES" = -1)  

work for M1?

As far as I know, visibility of the M1 GPU cannot be controlled through an environment variable. The way to hide it is directly in the TensorFlow session:

tf$config$get_visible_devices("CPU") |> 

