GitHub - juliandewit/kaggle_ndsb2: 3rd place solution for the second kaggle national datascience bowl

This is the source code for the 3rd place solution to the Second National Data Science Bowl hosted by Kaggle.com. For documenation about the approach look here

Dependencies & data

I used the anaconda default distribution with all the libraries that came with it. Next to this I used opencv(cv2), pydicom and MxNet (20151228 but later version will most probably be fine). For more detailed windows 64 installation instructions look here.

The dicom data needs to be downloaded from Kaggle and must be extracted in the data_kaggle/train /validate and /test folders.

Adjust settings

In the settings.py you can adjust some parameters. The most important one is the special "quick mode". This makes training the model 5x faster at the expense of some datascience rigor. Instead of training different folds to calibrate upon to prevent overfitting we train only one fold. This overfits a bit in step 3 and 4 but still results in a solid 0.0105 score which is enough for a 3rd place on the LB. Not choosing quick mode takes much longer to train but will result in less overfit and gives 0.0101 on the LB. Which is almost 2nd place and maybe with some luck it is.

Run the solution

python step0_preprocess.py
As a result the /data_preprocessed_images folder will contain ~329.000 preprocessed images and some extra csv files will be generated in the root folder.
python step1_train_segmenter.py
As a result you will have (a) trained model(s) in the root folder. Depending on the fold RMSE should be around 0.049 (train) and 0.052 (validate).
python step2_predict_volumes.py
As a result you will have a csv containing raw predictions for all 1140 patients. Also the data_patient_predictions will contain all generated overlays and csv data per patient for debugging. In the logs the average error in ml should be around 10ml.
python step3_calibrate.py
As a result you will have a csv file containing all the calibrated predictions. In the logs the average error in ML should go down with +/- 1ml.
python step4_submission.py
As a result the /data_submission_files folder will contain a submission file. In the logs the crps should be around 0.010.

Hardware

The solution should be gentle on the GPU because of the small batchsize. Any recent GPU supported by MxNet should do the job I figure. The lowest card I tried (and that worked) was a GT740.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data_kaggle		data_kaggle
data_preprocessed_images		data_preprocessed_images
data_segmenter_trainset		data_segmenter_trainset
data_submission_files		data_submission_files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_WIN64		README_WIN64
helpers.py		helpers.py
helpers_dicom.py		helpers_dicom.py
helpers_fileiter.py		helpers_fileiter.py
settings.py		settings.py
step0_preprocess.py		step0_preprocess.py
step1_train_segmenter.py		step1_train_segmenter.py
step2_predict_volumes.py		step2_predict_volumes.py
step3_calibrate.py		step3_calibrate.py
step4_submission.py		step4_submission.py
train_validate.csv		train_validate.csv

License

juliandewit/kaggle_ndsb2

Folders and files

Latest commit

History

Repository files navigation

Dependencies & data

Adjust settings

Run the solution

Hardware

About

Resources

License

Stars

Watchers

Forks

Languages