Skip to content

rdpstaff/Xander-HMMgs

Repository files navigation

Using HMMgs:
    See detailed step-by-step instructions in Xander_assembler repository (https://github.com/rdpstaff/Xander_assembler)

Build - Build a De Bruijn graph from from a set of reads
	java -jar hmmgs.jar build <read_file> <bloom_out> <kmerSize> <bloomSizeLog2> [cutoff = 2] [# hashCount = 4] [bitsetSizeLog2 = 30]
        read_file
             fasta or fastq files containing the reads to build the graph from 
        bloom_out
             file to write the bloom filter to 
        kmerSize
            should be multiple of 3, (recommend 45, minimum 30, maximum 63) 
        bloomSizeLog2
            the size of the bloom filter (or memory needed) is 2^bloomSizeLog2 bits, increase if the predicted false positive rate is greater than 1%
        cutoff
            minimum number of times a kmer has to be observed in SEQFILE to be included in the final bloom filter
        hashCount
            number of hash functions, recommend 4
        bitsetSizeLog2
            the size of one bitSet 2^bitsetSizeLog2, recommend 30

    The bloom filter stats such as bloom filter predicted false positive rate is written to stdout. 

Search - Perform local assembly starting at the given start points in a given de Bruijn Graph 
	output files <kmers>_nucl.fasta, _prot.fasta, search stats written to stdout
    java -jar hmmgs.jar search [-h] [-u] [-p <n_nodes>] <k> <limit_in_seconds> <bloom_filter> <for_hmm> <rev_hmm> <kmers>
        -u
            don't normalize the hmm input
        -p  n_nodes 
            prune the search if the score does not improve after n_nodes (default 20, set to 0 to disable pruning)
        k
            number of best local assemblies to return for each kmer
        limit_in_seconds
            dtime limit for individual searches (conservative suggestion = 100)
        bloom_filter
            bloom filter built using hmmgs build
        for_hmm, rev_hmm
            hidden markov models, HMMER3 format
        kmers
            starting points (can use KmerFilter's fast_kmer_filter to identify starting points)
        [#threads] experimental, suggested 1 (not thoroughly tested)

Merge - Merge the left and right contigs generated by hmmgs search
	java -jar hmmgs.jar merge [options] <hmm> <hmmgs_file> <nucl_contig>
        -a,--all                Generate all combinations for multiple paths for each starting kmer, instead of just the best
        -b,--min-bits <arg>     Minimum bits score
        -l,--min-length <arg>   Minimum length
        -o,--out <arg>          Write output to file instead of stdout

KmerFilter:
	fast_kmer_filter - search a set of reads against a set of reference sequences to identify starting points for assembly
	java -jar KmerFilter.jar fast_kmer_filter <kmerSize> <query_file> [name=]<ref_file> ...
        -a,--aligned              Build trie from aligned sequences
        -o,--out <arg>            Redirect output to file
        -T,--transl-table <arg>   Translation table to use when translating
                                  nucleotide to protein sequences
        -t,--threads <arg>        #Threads to use

         <kmerSize> kmer length, should be multiple of 3, (recommend 45, minimum 30, maximum 63) 
         <query_file> read file to search for starting points in (use the same fasta file used to build the De Bruijn Graph)
         1 or more aligned reference files (aligned using the same HMM that will be used to search) with an optional reference name (ie nifh=my_nifh_refs_aligned.fasta)


Other uses:
     HMMgs can also be used to extract subgraphs from starting points instead of contigs to perform further analysis with (see edu.msu.cme.rdp.graph.GraphSearch)
     HMMgs can also be used to compute base coverage for contigs (generated by hmmgs or other programs) (see edu.msu.cme.rdp.graph.abundance.ReadKmerMapper and base_coverage.py)

NOTES:
     When using fast_kmer_filter to identify start points there are two things to be aware of.
       1. While the Bloom Filter Builder allows any k-size (hmmgs requiers a k divisible by 3 however), fast_kmer_filter requires k <= 63
       2. fast_kmer_filter allows for multiple gene starting points to be searched for at the same time (since each requires a scan over the read file it is faster to do every gene at once), however this means the output file is multiplexed and must be demultiplexed before used in hmmgs search.  This can be done with the following command: grep 'gene_name' <multiplexed_starts_file> | cut -f2- > <demultiplexed_gene_start_points>