Skip to content
This repository has been archived by the owner on Jun 26, 2021. It is now read-only.

Trainer refactoring #66

Merged
merged 92 commits into from Jun 5, 2019
Merged

Trainer refactoring #66

merged 92 commits into from Jun 5, 2019

Conversation

justusschock
Copy link
Member

@justusschock justusschock commented Feb 25, 2019

This is a first draft to refactor trainer (combine code where possible) and introduce a predictor.

The metric logging was moved from the networks closure to the trainer.

@ORippler : maybe we could also merge experiments and rename the AbstractTrainer to BaseTrainer since it is not abstract anymore?

Also some tf tests fail due to shape missmatch. Have you any idea why?

@mibaumgartner : does this match your idea of a predictor? Do you have any improvements compared to your own prototype?

fix #39
fix #46 ?

EDIT: Docstrings are still missing

@justusschock justusschock added this to the Release 0.4.0 milestone Feb 25, 2019
@justusschock justusschock self-assigned this Feb 25, 2019
@justusschock justusschock added this to ToDo in PyTorch via automation Feb 25, 2019
@justusschock justusschock added this to ToDo in Tensorflow via automation Feb 25, 2019
@justusschock justusschock marked this pull request as ready for review February 25, 2019 10:50
PyTorch automation moved this from ToDo to Doing Mar 8, 2019
Copy link
Collaborator

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will investigate shape errors tomorrow. Furthermore, experiment.stratified_kfold_predict has to be adapted (might also be related to the shape errors)

delira/training/predictor.py Outdated Show resolved Hide resolved
delira/training/predictor.py Outdated Show resolved Hide resolved
Tensorflow automation moved this from ToDo to Doing Mar 8, 2019
@justusschock
Copy link
Member Author

justusschock commented Jun 3, 2019

I just added two things to the Predictor:

  • The conversion of the batchdict from native tensors to numpy (since the batchdict may be converted inplace during predict)
  • A as_generator option, which yields the results of each batch instead of adding them to the list and returning the whole list afterwards. This should be useful if we only want to predict, since for huge datasets (especially if 3D) tasks like segmentation might cause OutOfMemoryErrors even if we only store predictions in memory and this way they can be processed/saved on a per-batch basis

EDIT:
I failed with the generator part, but fixed it now... The problem with generators is, that every function, that contains a yield in it's body returns a generator name (even if the actual yield is never triggered). This prevents us from using a solution which either returns a generator or the complete predictions and metrics. Because I think, we should definitely support this kind of behavior, b510ffb makes the predictor.predict_data_mgr always return a generator object, which is of size 1 if the lazy_gen flag is False.
13d3adb updates the experiment.test function accordingly

@justusschock justusschock mentioned this pull request Jun 3, 2019
4 tasks
@justusschock justusschock changed the base branch from master to parallel_master June 4, 2019 08:04
@justusschock
Copy link
Member Author

I think we're almost good to go. The only remaining issue is the one with tensorflow not finding some resources in our tests. According to this issue it is most likely due to a thread that is spawned anywhere... Any Ideas on why this happens now and didn't happen before?

justusschock and others added 3 commits June 5, 2019 11:27
Make shallow copy of batchdict to retain keys, which might get popped in `prepare_batch`
@justusschock
Copy link
Member Author

The trainer is now good to be merged. The only failing test is python 3.7 which is due to trixi dependencies but completely unrelated to this PR

Copy link
Member

@mibaumgartner mibaumgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good 👍
We should include the Predictor inside the training.rst so it is displayed correctly in the documentation.
I think that should be a quick fix, so i approve the changes.

@justusschock
Copy link
Member Author

The file already existed, I just forgot to include it into the root file. Done now

@justusschock justusschock merged commit 69080a8 into parallel_master Jun 5, 2019
PyTorch automation moved this from Doing to Done Jun 5, 2019
Tensorflow automation moved this from Doing to Done Jun 5, 2019
@justusschock justusschock deleted the trainer_refactoring branch June 5, 2019 19:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
PyTorch
  
Done
Tensorflow
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants