Skip to content

owczr/lung-cancer-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lung Cancer Detection

Table Of Contents

  1. About
  2. Project Structure
  3. Usage
  4. License

About

Lung Cancer Detection is a project made as part of Engineers Thesis "Applications of artificial intelligence in oncology on computer tomography dataset" by Jakub Owczarek, under the guidance of Thesis Advisor dr. hab. inz Mariusz Mlynarczuk prof. AGH.

The goal of this project is to process the LIDC-IDRI dataset and evaluate the performance of deep learning models pre-trained on Image Net by leveraging transfer learning.

Project Structure

This repository contains the following directories:

  • docs - contains markdown files with more specific descriptions of the project components

  • notebooks - contains Jupyter Notebooks that were used for experiments, analysis, visualizations, etc

  • scripts - this directory is the actual workhorse and contains two notable subdirectories:

    • azure - contains scripts for Azure Virtual Machine and Azure Machine Learning
    • local - contains scripts that were used for local development
  • src - contains main components of the project:

    • azure - contains utilities specific to Azure services
    • dataset - contains DatasetLoader component used to feed data during model training
    • model - contains model builder and director classes
    • preprocessing - contains classes used for LIDC-IDRI dataset preprocessing
    • config.py - some constants used throughout the project
  • tests - contains (few) tests for the project components

Usage

This project was created with Azure in mind and therefore the main scripts are meant for usage on Azure.

1. Preprocessing

  1. First step is to download the LIDC-IDRI dataset on Azure Virtual Machine. The azure/virtual_machine/download_dataset.sh script is meant for this task.
  2. Then, it's time to preprocess this dataset to a format suitable for supervised deep learning model training. The azure/virtual_machine/process_dataset.py script is meant for this task. Additionally, in the same directory is train_test_split.py, which should be used to split processed data.
  3. Finally, the preprocessed dataset can be uploaded with the upload_dataset_2.sh script to Azure Blob Storage. There is also upload_dataset.sh script, but it doesn't use the azcopy utility and is too slow.

2. Model training

  1. With preprocessed dataset on Azure Blob Storage, the Virtual Machine will be no longer necessary. From this dataset an Azure Machine Learning data asset can be created, which can be utilized during model training.
  2. Now to run the actual model training under scripts/azure/machine_learing is the run_training_job.py script. This script can be used to create a job on AML, to build, compile and train desired model.

License

This project is licensed under the MIT License - see the LICENSE.md file for details