Hierarchical Clustering of DNA sequence

Sparc_upcxx is a re-implementation that uses UPC++. It performs similar or even better than sparc_mpi. However I find it might be tricky to run a upcxx program since the data transfer is hidden from users. I found failure sometimes when lots of rpc calls were performed asynchronously.

When running the programs, make sure there are enough nodes to hold all the data in memory. Although some programs support storing temporary data in disk, but it will make the progress really slow.

Example Runs

Please find sbatch scripts of sample runs on LAWRENCIUM in misc/example folder.

Installation

First clone the code

    git clone https://github.com/Lizhen0909/sparc-mpi.git
    cd sparc-mpi && git submodule update --init --recursive

All the versions are independent from each other, you may choose to only build the version that you are interested in.

Usages

Similar to Sparc, given a sequence file the flow of analysis includes 4 steps. For each program use "-h" or "--help" option to get help.

Kmer Counting (optional)

Find the kmer counting profile of the data, so that we decides how to filter out "bad" kmers. In this step we run edge_generating_$SURFIX where $SURFIX means mrmpi, mimir, mpi or upcxx.

For example for mpi version:

    $./kmer_counting_mpi -h
    
    -h, --help
    shows this help message
    -i, --input
    input folder which contains read sequences
    -p, --port
    port number
    -z, --zip
    zip output files
    -k, --kmer-length
    length of kmer
    -o, --output
    output folder
    --without-canonical-kmer
    do not use canonical kmer

Kmer-Reads-Mapping

Find shared reads for kmers with kmer_read_mapping_$SURFIX where $SURFIX means mrmpi, mimir, mpi or upcxx.

For example for mpi version:

    $./kmer_read_mapping_mpi -h
    
    -h, --help
    shows this help message
    -i, --input
    input folder which contains read sequences
    -p, --port
    port number
    -z, --zip
    zip output files
    -k, --kmer-length
    length of kmer
    -o, --output
    output folder
    --without-canonical-kmer
    do not use canonical kmer

Edge Generating

Generate graph edges using edge_generating_$SURFIX where $SURFIX means mrmpi, mimir, mpi or upcxx.

For example for mpi version:

    $./edge_generating_mpi -h
    
    -h, --help
    shows this help message
    -i, --input
    input folder which contains read sequences
    -p, --port
    port number
    -z, --zip
    zip output files
    -o, --output
    output folder
    --max-degree
    max_degree of a node; max_degree should be greater than 1
    --min-shared-kmers
    minimum number of kmers that two reads share. (note: this option does not work)

Clustering

Use lpav1_upcxx to do graph clustering.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
cmake		cmake
data/sample2		data/sample2
extlib		extlib
misc		misc
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmake

cmake

data/sample2

data/sample2

extlib

extlib

misc

misc

src

src

.gitignore

.gitignore

.gitmodules

.gitmodules

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Hierarchical Clustering of DNA sequence

Example Runs

Installation

Usages

Kmer Counting (optional)

Kmer-Reads-Mapping

Edge Generating

Clustering

About

Releases

Packages

Languages

License

bochen0909/hsds-upcxx

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Clustering of DNA sequence

Example Runs

Installation

Usages

Kmer Counting (optional)

Kmer-Reads-Mapping

Edge Generating

Clustering

About

Topics

Resources

License

Stars

Watchers

Forks

Languages