Skip to content

ponnhide/flashpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

flashpy

flashpy is the python module for merging paired-ends reads generated by high-throughput DNA sequencing systems such as Illumina Miseq, Hiseq and Novaseq. This python code reimplements the algorithm of flash (https://github.com/ebiggers/flash) using cython, so it runs very fast (Merge 10,000 sequence pairs within about 1 second.)

Installation

  1. python setup.py build_ext --inplace
  2. Set PYTHONPATH to the directory where you cloned the repository.

Usage

flashpy provides only two functions: merge and flash. The merge function merge a single pair of two reads. The flash function just iterate merge function for paired reads in the given paired fastq files.

  • merge(seq1=Nonn, seq2=None, score1=None, score2=None, min_overlap=50, max_overlap=300, allow_outies=True, min_identity=0.5, max_idenity=1.0)
    Merge a single seqeunce pair of seq1 and seq2.

    • seq1: str
      The DNA sequence.
    • seq2: str
      The DNA sequence paired with the seq1.
    • score1: list of int
      The quality values for the DNA sequence seq1. The values must be decoded from the ascii codes. The list must be composed of the same number of values as letters in the sequence seq1.
    • score2: list of int
      The quality values for the DNA sequence seq2. The values must be decoded from the ascii codes. The list must be composed of the same number of values as letters in the sequence seq2.
    • min_overlap: int
      The minimum overlap length between two sequences, seq1 and seq2.
    • max_overlap: int
      The maximum overlap length between two sequences, seq1 and seq2.
    • allow-outies: bool
      If True, try to combine a sequence pair of seq1 and seq2 in the "outie".
    • min_identity: bool
      Minimum allowed sequence identity between the overlapping regions of seq1 and seq2.
    • max_identity: bool
      If the identity of a overlapping region is larger than the max_identity value, the function will terminate the operation and return the result based on the overlapping region, even if better overlap regions are stil remained in the other locations.

    return
    merged_sequence (str), merged_score (list of int), identity (float), ovelap_length (int), overlap_direction ("innie" or "outie")

  • flash(read1=None, read2=None, min_overlap=50, max_overlap=300, allow_outies=True, min_identity=0.5, max_idenity=1.0, show_progress=True, key_check=True)
    Merge a single pair of two fastq files.

    • read1: str
      FASTQ file path.
    • read2: str
      FASTQ file path paired with the read1.
    • min_overlap: int
      Same parameter with min_overlap of merge. The parameter value is applied for all sequence pairs.
    • max_overlap: int
      Same parameter with max_overlap of merge. The parameter value is applied for all sequence pairs.
    • allow-outies: bool
      Same parameter with min_overlap of merge. The parameter value is applied for all sequence pairs.
    • min_identity: bool
      Same parameter with min_overlap of merge. The parameter value is applied for all sequence pairs.
    • max_identity: bool
      Same parameter with min_overlap of merge. The parameter value is applied for all sequence pairs.
    • show_progress: bool
      If true, display progress bar of the operation.
    • key_check: bool
      If true, for each sequence key in read2, the function will if the same sequence key exists in read1.

    return
    merged_reads, overlap_distributions

    • mergd_reads: dict

       {*key1* (common sequence key of *r1_key1* and *r2_key*): 
       	{"r1_key"  : Original sequence key in read1:, 
       	 "r2_key"  : Original sequence key in read2 paired with *r1_key*, 
       	 "seq"     : Merged sequence,
       	 "quality" : Merged score,
       	 "identity": Sequence identity of the overlapping region}
        *key2*: ...,
        ...
        }
      
    • overlap_distributions: dict

      {*key1* (("innie" or "outie", *overlap_length*)): Number of paired sequences that share the overlapping region of length *overlap_length*, 
       *key2* : ...,
       ...
       } 
      

About

Fast python code to merge paired-end reads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published