Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU & TPU & IPU unavailable and failure to re-train the model on Windows #4

Closed
PPierre22 opened this issue Oct 12, 2021 · 5 comments
Closed

Comments

@PPierre22
Copy link

Hi,

I'm having some difficulties to reproduce the tutorial you're giving (on a Window 10 system). In particular, i cannot train the model (R throws out an error).

Overall, everything seems to work smoothly up until the "Train the model" subsection of your tutorial. There is, however, one error when i load the model:

> model = df_model()
ERROR 1: PROJ: proj_create_from_database: Cannot find proj.db
PROJ: proj_create_from_database: Cannot find proj.db
Reading config file: C:\Users...\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\deepforest\data\deepforest_config.yml

Despite this error message, i can load & predict & visualize on the example *.png and *.tif data you provide (and everything seems to work just fine). However, when i train the model (using the data provided in your tutorial), i have some sort of warning here :

> model$create_trainer()
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

Then, the following command throws out an error message (copied hereafter), and nothing else happen, the progression bar stays at 0% :

model$trainer$fit(model)

| Name | Type | Params

0 | model | RetinaNet | 32.1 M

31.9 M Trainable params
222 K Non-trainable params
32.1 M Total params
128.592 Total estimated model params size (MB)
C:\Users...\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\pytorch_lightning\trainer\data_loading.py:382: UserWarning: One of given dataloaders is None and it will be skipped.
rank_zero_warn("One of given dataloaders is None and it will be skipped.")
ay be a bottleneck. Consider increasing the value of the num_workers argument (try 8 which is the number of cpus on this machine) in the DataLoader init to improve performance.oader, does not have many workers which m
f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Epoch 0: 0%| | 0/1 [00:00<?, ?it/s]

I'm under the impression that the program do not have access to my CPUs. I've been looking for this "num_workers" argument that the error message suggest to increase but without success, i only found a "worker" argument in model$config and changing it to as.integer(8) does not change a thing.

Any clue on what might be going on?

Thanks

@henrykironde
Copy link
Contributor

Thanks for reporting this @PPierre22 , we are still trying to find out what could be causing this error.

@ethanwhite
Copy link
Member

Thanks again for reporting @PPierre22. We can replicate the behavior you are seeing and we're not sure exactly what's going on at the moment. The CPU should be getting used if no GPUs are present and changing the number of workers the way you did is the correct approach.

Everything works fine in Python on the same Windows machines I've tested on and deepforestr runs fine on Linux, but we are seeing a possibly related issue on macOS (#5).

If you need to get something working the best short-term solution is probably to run things directly in Python (see https://deepforest.readthedocs.io/en/latest/) and then export anything you need to use in R.

We'll keep looking to see if we can fix things in the R package, but this is likely an issue with reticulate, pytorch, and Windows, and so is probably upstream of us at the moment.

@PPierre22
Copy link
Author

Hello @ethanwhite and @henrykironde,
Thank you for your assistance. Please, let me know if you end up fixing this issue in the R package.
Best

@ethanwhite
Copy link
Member

I currently have things running on Windows again for training with no changes to the code base (but some small ones to the instructions), suggesting that the upstream issues have been addressed.

@PPierre22 - if you still have time/interest would you mind doing a fresh install (to get the underlying DeepForest package updated) and then check to see if the following code runs?

library(deepforestr)

model = df_model()
model$use_release()

annotations_file = get_data("testfile_deepforest.csv")

model$config$train$csv_file = annotations_file
model$config$train$root_dir = get_data(".")

model$create_trainer()
model$train$fit(model)

If that works then I can walk you through how to do a proper retraining since for some reason I'm still investigating it looks like the need to use the config file instead of setting some config options from R.

@ethanwhite
Copy link
Member

We now have Windows testing including training on GH Actions (see #10). Everything is working so closing this, but if anyone runs into this issue on their local system feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants