Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the help message, remove duplicate option descriptions, and get clarify options with the same flag #18

Open
nh13 opened this issue Aug 14, 2020 · 2 comments

Comments

@nh13
Copy link

nh13 commented Aug 14, 2020

Four suggestions based off of 8c24fb1:

  1. The "-h" or "--help" options get a strange error. If they aren't implemented, then shouldn't the tool complain "no such option exist" versus the following?

The index directory -h does not seem to exist

  1. The help message has duplicate entries, for example:
        -i, --index <index>
                    directory where the pufferfish index is stored
  1. The help message has two options for the same flag:
        -t, --threads <num threads>
                    Specify the number of threads (default=8)
...
        -t, --type <statType>
                    statType (options:ctab)
  1. Can you add the default/current value to each option? For example, I assume that allowSoftclip is off by default (end-to-end alignment), but if the option could output:
      --allowSoftclip
                    Allow soft-clipping at start and end of alignments (default=off)

Full output of the help message:

pufferfish/build/src/pufferfish align -h


Parsing command line failed with exception: The index directory -h does not seem to exist.


SYNOPSIS
        pufferfish index -r <ref_file>... -o <output_dir> [--headerSep <sep_strs>] [--keepFixedFasta] [--keepDuplicates] [-d <decoy_list>] [-f <filt_size>] [--tmpdir <twopaco_tmp_dir>] [-k <kmer_length>] [-p <threads>] [-l] [-q] [-s] [-e <extension_size>] [-v]
        pufferfish index -r <ref_file>... -o <output_dir> [--headerSep <sep_strs>] [--keepFixedFasta] [--keepDuplicates] [-d <decoy_list>] [-f <filt_size>] [--tmpdir <twopaco_tmp_dir>] [-k <kmer_length>] [-p <threads>] [-l] [-q] [-x <lossy_rate>] [-v]
        pufferfish validate -i <index> [-v]
        pufferfish lookup -i <index> -r <ref> [-v]
        pufferfish align -i <index> --mate1 <mate 1> --mate2 <mate 2> [-b] [--coverageScoreRatio <score ratio>] [-t <num threads>] [-m] (--noOutput | (-o <output file>)) [--allowSoftclip] [--allowOverhangSoftclip] [--maxSpliceGap <max splice gap>] [--maxFragmentLength <max frag length>] [--noOrphans] [--orphanRecovery] [--noDiscordant] [--noDovetail] [-z] [-k|-p] [--verbose] [--fullAlignment] [--heuristicChaining] [--bestStrata] [--genomicReads] [--primaryAlignment] [--filterGenomics <genes names file>] [--filterBestScoreMicrobiome <genes ID file>] [--filterMicrobiome <genes ID file>] [--bt2DefaultThreshold] [--minScoreFraction <minScoreFraction>] [--consensusFraction <consensus fraction>] [--noAlignmentCache] [-v]
        pufferfish align -i <index> --read <reads> [-b] [--coverageScoreRatio <score ratio>] [-t <num threads>] [-m] (--noOutput | (-o <output file>)) [--allowSoftclip] [--allowOverhangSoftclip] [--maxSpliceGap <max splice gap>] [--maxFragmentLength <max frag length>] [--noOrphans] [--orphanRecovery] [--noDiscordant] [--noDovetail] [-z] [-k|-p] [--verbose] [--fullAlignment] [--heuristicChaining] [--bestStrata] [--genomicReads] [--primaryAlignment] [--filterGenomics <genes names file>] [--filterBestScoreMicrobiome <genes ID file>] [--filterMicrobiome <genes ID file>] [--bt2DefaultThreshold] [--minScoreFraction <minScoreFraction>] [--consensusFraction <consensus fraction>] [--noAlignmentCache] [-v]
        pufferfish examine -i <index> [--dump-fasta <fasta_out>] [--dump-kmer-freq <kmer_freq_out>] [-v]
        pufferfish stat [-t <statType>] -i <index> [-v]
        pufferfish help [-v]

OPTIONS
        -r, --ref <ref_file>
                    path to the reference fasta file

        -o, --output <output_dir>
                    directory where index is written

        --headerSep <sep_strs>
                    Instead of a space or tab, break the header at the first occurrence of this string, and name the transcript as the token before the first separator (default = space & tab)

        --keepFixedFasta
                    Retain the fixed fasta file (without short transcripts and duplicates, clipped, etc.) generated during indexing

        --keepDuplicates
                    Retain duplicate references in the input

        -d, --decoys <decoy_list>
                    Treat these sequences as decoys that may be sequence-similar to some known indexed reference

        -f, --filt-size <filt_size>
                    filter size to pass to TwoPaCo when building the reference dBG

        --tmpdir <twopaco_tmp_dir>
                    temporary work directory to pass to TwoPaCo when building the reference dBG

        -k, --klen <kmer_length>
                    length of the k-mer with which the dBG was built (default = 31)

        -p, --threads <threads>
                    total number of threads to use for building MPHF (default = 16)

        -l, --build-edges
                    build and record explicit edge table for the contaigs of the ccdBG (default = false)

        -q, --build-eqclses
                    build and record equivalence classes (default = false)

        -s, --sparse
                    use the sparse pufferfish index (less space, but slower lookup)

        -e, --extension <extension_size>
                    length of the extension to store in the sparse index (default = 4)

        <lossy_rate>
                    use the lossy sampling index with a sampling rate of x (less space and fast, but lower sensitivity)

        -i, --index <index>
                    directory where the pufferfish index is stored

        -i, --index <index>
                    directory where the pufferfish index is stored

        -r, --ref <ref>
                    fasta file with reference sequences

        -i, --index <index>
                    Directory where the Pufferfish index is stored

        --mate1, -1 <mate 1>
                    Path to the left end of the read files

        --mate2, -2 <mate 2>
                    Path to the right end of the read files

        --read <reads>
                    Path to single-end read files

        -b, --batchOfReads
                    Is each input a file containing the list of reads? (default=false)

        --coverageScoreRatio <score ratio>
                    Discard mappings with a coverage score < scoreRatio * OPT (default=0.6)

        -t, --threads <num threads>
                    Specify the number of threads (default=8)

        -m, --just-mapping
                    don't attempt alignment validation; just do mapping

        --noOutput  Run without writing SAM file

        -o, --outdir <output file>
                    Output file where the alignment results will be stored

        --allowSoftclip
                    Allow soft-clipping at start and end of alignments

        --allowOverhangSoftclip
                    Allow soft-clipping part of a read that overhangs the reference (the regular --allowSoftclip flag overrides this one)

        --maxSpliceGap <max splice gap>
                    Specify maximum splice gap that two uni-MEMs should have

        --maxFragmentLength <max frag length>
                    Specify the maximum distance between the last uni-MEM of the left and first uni-MEM of the right end of the read pairs (default:1000)

        --noOrphans Write Orphans flag

        --orphanRecovery
                    Recover mappings for the other end of orphans using alignment

        --noDiscordant
                    Write Orphans flag

        --noDovetail
                    Disallow dovetail alignment for paired end reads

        -z, --compressedOutput
                    Compress (gzip) the output file

        -k, --krakOut
                    Write output in the format required for krakMap

        -p, --pam   Write output in the format required for salmon
        --verbose   Print out auxilary information to trace program's flow

        --fullAlignment
                    Perform full alignment instead of gapped alignment

        --heuristicChaining
                    Whether or not perform only 2 rounds of chaining

        --bestStrata
                    Keep only the alignments with the best score for each read

        --genomicReads
                    Align genomic dna-seq reads instead of RNA-seq reads

        --primaryAlignment
                    Report at most one alignment per read

        --filterGenomics <genes names file>
                    Path to the file containing gene IDs. Filters alignments to the IDs listed in the file. Used to filter genomic reads while aligning to both genome and transcriptome.A read will be reported with only the valid gene ID alignments and will be discarded if the best alignment is to an invalid IDThe IDs are the same as the IDs in the fasta file provided for the index construction phase

        --filterBestScoreMicrobiome <genes ID file>
                    Path to the file containing gene IDs. Same as option "filterGenomics" except that a read will be discarded if aligned equally best to a valid and invalid gene ID.

        --filterMicrobiome <genes ID file>
                    Path to the file containing gene IDs. Same as option "filterGenomics" except that a read will be discarded if an invalid gene ID is in the list of alignments.

        --bt2DefaultThreshold
                    mimic the default threshold function of Bowtie2 which is t = -0.6 -0.6 * read_len

        --minScoreFraction <minScoreFraction>
                    Discard alignments with alignment score < minScoreFraction * max_alignment_score for that read (default=0.65)

        --consensusFraction <consensus fraction>
                    The fraction of mems, relative to the reference with the maximum number of mems, that a reference must contain in order to move forward with computing an optimal chain score (default=0.65)

        --noAlignmentCache
                    Do not use the alignment cache during the alignment.

        -i, --index <index>
                    pufferfish index directory

        --dump-fasta <fasta_out>
                    dump the reference sequences in the index in the provided fasta file

        --dump-kmer-freq <kmer_freq_out>
                    dump the frequency histogram of k-mers

        -t, --type <statType>
                    statType (options:ctab)

        -i, --index <index>
                    directory where the pufferfish index is stored

        -v, --version

@fataltes
Copy link
Collaborator

Dear Nils (@nh13 ),

Thank you for your quick test of PuffAligner.

The particular items you pointed at in this issue are definitely something that requires our immediate action to resolve so that the help manual of the tool is easy to interpret by the user. The biggest issue is the automatically generated help produced by the argument parsing library we are using, clipp, exhibits the issues you raise regarding duplicate options in the way it generates the default help messages. We're looking into if there is a way to fix this within clipp, and may otherwise consider changing the argument parser we use.

That being said, we hope to have this resolved one way or the other in a few days. We are currently in the process of merging the cigar-strings branch where Puffaligner lives and the develop (which we use as an external library for the selective-alignment in salmon). We anticipate it will take a few days to test and guarantee that the performance and the accuracy are not changed by either some specific optimizations in cigar-string branch or merge conflicts. After that, we will continue resolving these issues on the (merged) develop branch. We will ping back here when the develop branch is updated with the improved help messages and options.

Thank you again!

@hermidalc
Copy link

And I would suggest please if when e.g. pufferfish index --help, only show the index help and relevant options, same as other command line programs already do with sub command help. This will probably already eliminate the number of duplicate options seen in the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants