MTB++

Introduction

MTB++ is a software developed to predict antimicrobial resistance in MTB bacteria using machine learning for 13 groups of antibiotics including Amikacin, Bedaquiline, Clofazimine, Delamanid, Ethambutol, Ethionamide, Isoniazid, Kanamycin, Levofloxacin, Linezolid, Moxifloxacin, Rifampicin, Rifabutin; and 3 antibiotic families including Rifampin, Aminoglycosides, Fluoroquinolone. This README contains instructions on how to run the trained classifier or to rebuild the classifier from raw data.

Rebuilding is an advanced use-case. We expect most users to only run the trained classifier. This software is maintained by Ali Serajian (ma.serajian@gmail.com). Please post an Issue onto GitHub if there are any issues with these instructions.

Citation

This software is under GNU license. If you use the software please cite the following paper:

Installation

Two methods of installation are considered for MTB++ according to the user's preference. Automatic Installation, and Manual Installation. In case your system supports the "module load" environment, you can your the Automatic Installation, otherwise, Manual Installation is recommended.

Regardless of the installation method used, the following dependencies should be installed first.

Dependencies

python 3.0+ (3.6+ recommended)
- sklearn (Version 1.1.2)
- joblib (Pre-exists on python3+)
Cmake(tested on v3.26.4)
GCC (9.3.3 recommended)

Automatic Installation

Installation Instructions

To simplify the installation process, the provided setup.sh script automates the setup by utilizing the "module load" environment. The script loads essential modules such as GCC and CMake (they need to be installed), compiles SBWT, and verifies the version of Scikit-learn. To use the script, follow these steps:

Cloning MTB++ and installing it:

git clone https://github.com/M-Serajian/MTB-plus-plus.git
cd MTB-plus-plus
sh setup.sh

Manual Installation

If the setup script is not applicable to your system (for example, if your system does not support the "module load" environment), follow these manual installation steps:

Cloning MTB++ getting into the project:

git clone https://github.com/M-Serajian/MTB-plus-plus.git
cd MTB-plus-plus

Compiling and Installing SBWT_Kmer_Counters: Compile SBWT_Kmer_Counters as follows:

cd src
git clone https://github.com/M-Serajian/SBWT-kmer-counters.git
cd SBWT-kmer-counters
git submodule update --init --recursive
cd SBWT/build
cmake ..
make -j

Install Scikit-learn version 1.1.2:

pip3 install scikit-learn==1.1.2

Now, MTB++ is ready to be used.

Usage

Mtb++.py can be located at the MTB-plus-plus directory (the root on the cloned directory).

python Mtb++.py -f FASTAfile -o Output.csv

Example

python Mtb++.py -f data/sample_genomes/ERR8665561.fasta -o ERR8665561.csv

MTB++ Report Consolidation

If MTB++ is utilized for a substantial number of isolates, individual .csv reports are generated for each isolate in the directory specified by the -o flag for the "Mtb++.py" script. To streamline and unify this data into comprehensive reports, the MTB++_Report_Consolidation.rb script has been developed.

Purpose

The primary purpose of MTB++_Report_Consolidation.rb is to process the individual CSV files and create two finalized reports:

Logistic Regression Prediction Report: This report consolidates predictions made by the Logistic Regression classifier for each genome.
Random Forest Prediction Report: This report aggregates predictions based on the Random Forest classifier for each genome.

Usage

To use MTB++_Report_Consolidation.rb effectively, make sure to run it after executing MTB++ for the isolates. The script is designed to consolidate individual reports found in the directory specified by the -d or --data-directoryd flag. It identifies all CSV files in that directory, creating two distinct CSV files that offer a comprehensive overview of the predictions made by MTB++ for each isolate.

How to Run

ruby MTB++_Report_Consolidation.rb -d [DATA_DIRECTORY] -o [OUTPUT_DIRECTORY]

-d or --data-directory: Specify the directory where all the individual MTB++ reports (CSV files) for each isolate are stored. This parameter is mandatory.
-o or --output-directory: (Optional) Specify the directory where you want the unified reports for Logistic Regression and Random Forest predictions to be saved. If not provided, the default is the current directory.

MTB++ 31mer Analysis Multi-thread Tool#

Purpose

This code reports the number of occurrences of the 31-mers associated with each class of antibiotic.

Usage

-h, --help HELP: Show the help message and exit.
-i, --I INPUT_FILE: (required) Input resistant_genome_IDs.csv, the header of the CSV file should be the antibiotics (Amikacin, ....).
-o, --O OUTPUT_DIR: (required) The output directory where the results will be saved.
-b, --B BASE_DIRECTORY: (required) The directory including the FASTA files.
-f, --F FASTA_EXTENSION: (required) The extensions of the Fasta files. The valid FASTA extensions are: fasta, fa, fas, fna, ffn.
-t, --temporary-directory TEMPORARY_DIRECTORY: (required) This is a directory of Temporary files. Depending on the number of Genomes to be Processed, the free space to increase.

How to Run

perl 31mer_analysis -i PATH/to/resistant_genome_IDs.csv -o PATH/to/output_dir -b Base_directory_of_FASTA_Files -f FASTA_extension -t Temporary_directory

Classifying Data using MTB++

Below are the instructions to use the classifier. Here, we assume that the data to be classified is available as a set of paired-end sequence reads. In our example, we will have reads1.fq and reads2.fq

Dependencies for training classifiers from scratch

python 3.0+ (3.6+ recommended)
- sklearn (Version 1.1.2)
Cmake(tested on v3.26.4)
GCC (9.3.3 recommended)
SBWT_Kmer_counters
SPAdes
enaBrowserTools

Pipeline

The following image demonstrates the data analysis pipeline in MTB++ model developement.

Assemble the data into contigs

Use SPAdes to assemble the data

spades.py -r1 reads1.fastq -r2 reads2.fastq -o contigs.fa

Classify the data using the models

Take the contigs.fa file to make a prediction using the models

run.py -i contigs.fa -o prediction.txt

Building the Classifier

Below are the instructions in order to rebuild the classifier and reproduce our results. If you would like to just use the trained classifier, see above.

Download the raw data

The first step is to download the FASTQ data, using European Nucleotide Archive (ENA) Browser Tools.

Assemble the data into contigs

Use spades to assemble the data

spades.py -r1 reads1.fastq -r2 reads2.fastq

Extract and match phenotypic data

Extract the phenotypes from the ENA data and match the identifier numbers here

command line here

Create feature matrix

The first step is to extract the k-mers using the SBWT. The fasta_filenames.txt is a list with all the names of the fasta files.

./sbwt build --in-file fasta_filenames.txt -k 31 -o index.sbwt -t 8 -m 10 --temp-dir temp
./counters index.sbwt fasta_filenames.txt > index_file.txt

From the above command, you should have an index file outputted from SBWT (index_file.txt). Next, we transform this index file to a feature matrix that can be used for training.

Npy_files_address="/blue/boucher/share/Deep_TB_Ali/Final_TB/NPY_Binary_Files_with_index/"
Number_of_Samples=6224
Number_of_kmers_in_file=30000000
min_filter_kmers_occurring_less_than=10 # Less than 10 times occurance, the kmer will be ignored
max_filter_kmers_occurring_more_than=3000

mkdir -p $Npy_files_address

python projects/MTB-plus-plus/src/Ascii_to_Feature_Matrix/Ascii_to_Matrix.py $file_number $Number_of_Samples \
          $Color_matrix_address $Npy_files_address\
          $Number_of_kmers_in_file \
          $min_filter_kmers_occurring_less_than\
          $max_filter_kmers_occurring_more_than

These commands are also available in a script. The output should be .npy files that we will use in the next step. See ascii_to_feature.sh.

Feature selection.

Create five folds of the data to be further used for Chi-squared test and classification.

./mypython.py somthing.npy > output

Next, we perform Chi-squared test to rank the features based on their significance here.

./mypython.py somthing.npy > output

Lastly, we select the top features for each resistance class for training the classifiers here.

./mypython.py somthing.npy > output

Train the Classifiers

The last step is to train classifiers, both the Logistic Regression and Random Forest classifiers. here.

./mypython.py somthing.npy > output

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
data		data
images		images
include		include
slurm_scripts		slurm_scripts
src		src
temp		temp
.gitignore		.gitignore
31mer_analysis.pl		31mer_analysis.pl
LICENSE		LICENSE
MTB++_Report_Consolidation.rb		MTB++_Report_Consolidation.rb
Mtb++.py		Mtb++.py
README.md		README.md
job_array.sh		job_array.sh
setup.sh		setup.sh
sitemap.xml		sitemap.xml

License

M-Serajian/MTB-plus-plus

Folders and files

Latest commit

History

Repository files navigation

MTB++

Introduction

Citation

Installation

Dependencies

Automatic Installation

Installation Instructions

Manual Installation

Usage

Example

MTB++ Report Consolidation

Purpose

Usage

How to Run

MTB++ 31mer Analysis Multi-thread Tool#

Purpose

Usage

How to Run

Classifying Data using MTB++

Dependencies for training classifiers from scratch

Pipeline

Assemble the data into contigs

Classify the data using the models

Building the Classifier

Download the raw data

Assemble the data into contigs

Extract and match phenotypic data

Create feature matrix

Feature selection.

Train the Classifiers

About

Topics

Resources

License

Stars

Watchers

Forks

Languages