GitHub

PSU CSE 566 Final Project

This repository is created for PSU CSE 566 final project.

Project Title:

Assessing some recently published metagenomic profilers for viral and fungal samples

Project Motivation:

Many metagenomic profiling tools (e.g. MetaBinG2, MiCoP and Metaglin) were published in recent few years. These tools claim that they are the best in specific analysis among the existing metagenomic profiling tools. Most of them were compared only for identifying bacterial community and thus might be bias against viruses and fungi. To order to better determine how to choose these tools for analyzing viral or fungal samples, it is needed to have an independent and comprehensive comparison among them.

Project Methods:

In this evaluation, two more metagenomic profilers (DIAMOND+MEGAN and MetaPhlAn3) are included besides those three mentioned above because DIAMOND+MEGAN's profiling performance is not bad in terms of accuracy in the comparison within the papers of Metaglin and MetaPhlAn3 is the new version of MetaPhlAn which hasn't been compared with other profilers before.

In order to have an unbiased comparison and show their corresponding features, I first used the viral and fungal databases from MiCoP and rebuilt the databases for Metalign, MetaBinG2. I can't build the corresponding databases for MetaPhlAn3 because its database is based on marker genes which are different to build in a short time. And DIMOND is designed mainly for the NCBI database. In addition, since in pratice we can't design the database for a given metagenomic sample, I also compared them by using their own databases. Two kinds of data which can be evaluated by known abundance were utilized: simulated data (which is generated by using CAMISIM software) and mock community data (which is a real data but known for its candidate microbes and their approximate proportions). The profilers will be assessed in the following five parts:

Accuracy of taxon identification in different ranks (Phylum, Class, Order, Family, Genus and Species)
Accuracy of abundance estimation
Robustness under the influence of unknown organism
Speed
Memory Usage

Part1 and Par2 were evaluated by a CAMI competition Software OPAL.

How to run?

This project can be executed by simply running all shell command lines within the bash script main_run.sh. Since CAMISIM needs to run with python 2.7 and other software run with python 3.7, this bash script can't run in a whole. But you can follow the instruction to run it step by step.

Results:

All result files are stored within data/CAMI_OPAL/results. And the log files for checking speed and memory usage can be found within data/run_data.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CAMISIM		CAMISIM
data		data
py_scripts		py_scripts
sh_scripts		sh_scripts
README.md		README.md
main_run.sh		main_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAMISIM

CAMISIM

data

data

py_scripts

py_scripts

sh_scripts

sh_scripts

README.md

README.md

main_run.sh

main_run.sh

Repository files navigation

PSU CSE 566 Final Project

Project Title:

Project Motivation:

Project Methods:

How to run?

Results:

About

Releases

Packages

Languages

chunyuma/CSE566_final_project

Folders and files

Latest commit

History

Repository files navigation

PSU CSE 566 Final Project

Project Title:

Project Motivation:

Project Methods:

How to run?

Results:

About

Resources

Stars

Watchers

Forks

Languages