Paired read processing #133

Colelyman · 2021-08-19T20:11:42Z

The algorithm runs the unit tests successfully now, I imagine that it needs to be integrated into the core so that the newly aligned reads are actually used, right? If so, could you point me to where this would need to be done?

Most common alleles for each pooled target are output if the flag '--compile_postrun_references' is provided. This writes alleles with frequncy defined by the parameter to --compile_postrun_reference_allele_cutoff This file can be manually edited to remove noisy alleles, and then used to run CRISPRessoPooled again but to provide alternate alleles to each CRISPResso run by using the parameter '--alternate_alleles'. This is particularly useful in cases where control experiments are available. The running pattern would be: 1) CRISPRessoPooled --compile_postrun_references {control} 2) CRISPRessoPooled --alternate_alleles {produced in step 1} {control} CRISPRessoPooled --alternate_alleles {produced in step 1} {experiment}

Rename postrun references file to be more standardized with other output files. Output is now "CRISPResso2Pooled_postrun_references.txt"

Related to issue #61 This happens when N_ROWS < 1 which I assume has something to do with no results -> negative control

Fix a bug when generating compare plot

- Special bonus for y'all to keep you company during covid - axis ticks on most plots! - added parameter --plot_histogram_outliers to plot all insertion sizes in histogram - all insertion sizes are reported in .hist output files #64 - add HDR reference plot (may change this later to set ref1 to the longer reference of WT/HDR but for now it is always WT..) Allow reverse complement of extension seq if PE sequence is specified.

Frameshift plots don't show 0-bp changes (these dwarf all other changes). The number of reads not shown are added to the legend. Addressed cloning quantification windows when bases are inserted in the clone-ee (previously these cloned bases would be ignored. Force HDR to clone all quantification windows from Ref 1 Fix #60 and #59

Plot window for sgRNA will be the same length after cloning even if the window is shorter or longer after comparing between ref1 and HDR.

Adds CRISPRessoAggregate Adds start/end time to CRISPRessoBatch info Started removing pickle dependencies from Pooled and Report

In CRISPRessoWGS, the region file contains a 'chr_id' column which is sometimes mis-recognized as ints when read by pandas if using the chromosome notation without 'chr' (e.g. 1,2,3 in stead of chr1,chr2,chr3). This bug fix forces chr_ids to be read as strs.

Starting in version 2.1.0, insertion quantification has been changed to only include insertions completely contained by the quantification window. To use the legacy quantification method (i.e. include insertions directly adjacent to the quantification window) please use the parameter --use_legacy_insertion_quantification

Prime editing input parameters are forced to be in the RNA 3'->5' direction. This makes sure that the scaffold incorporation happens on the correct side of the extension sequence. Errors are thrown if improper directionality is detected. Fastq_out now includes alignment scores and details for every run (it may be time to upgrade that SSD to hold these new fastq output files, but it makes debugging particular reads a lot easier!) Update to linked data for plot 3b in report

Multiple amplicon names are resolved before adding the HDR amplicon -- unnamed amplicons are named Amplicon{i} for each amplicon. Plot 4g data (nuc pct table, mod pct table for all reads aligned to the first reference) is output and linked to from the plot display Ambiguous reads don't contribute to plot 4g data (which would otherwise lead to double counting and pct values > 1)

These changes implement separating reads to their corresponding amplicons via Python instead of through awk. This is to get around the maximum number of open files that is limited on many operating systems. Co-authored-by: Kendell Clement <k.clement.dev@gmail.com>

* Error out if HDR amplicon matches existing amplicon * Add check for amplicon sequence uniqueness * Fix bug with bam_input not having bam_output * Test for no returned lines in auto mode, version bump to 2.2.11 * Fix pandas deprecation of df.append

…b#250)

CRISPResso checks that prime editing guides are provided in the proper orientation (e.g. pegRNA 3'->5', spacer sequence 5'->3') and checks these orientations by alignment. Sometimes, the alignment can be better in the opposite direction, and this parameter allows these checks to be overridden. Otherwise, these checks would halt the program and produce the output 'The prime editing pegRNA spacer sequence appears to be given in the 3\'->5\' order. The prime editing pegRNA spacer sequence (--prime_editing_pegRNA_spacer_seq) must be given in the RNA 5\'->3\' order.'

if the user specifies the prime_editing_override_prime_edited_ref_seq, it could not contain the extension seq (if they don't provide the extension seq in the appropriate orientation), so check that here. Extension sequence should be provided reverse-complement to the prime edited sequence.

…ons_string

* Add FLASh and Trimmomatic deprecation notice to CLI output * Add Edilytics email address to CLI output

Previously, the bam would set the cigar string to 0 if the read was unaligned. This breaks the sam->bam conversion and causes the errors in pinellolab#235.

This change checks to see if a bam file was input, and if so it doesn't try to remove any intermediate files because there aren't any. Co-authored-by: Cole Lyman <cole@coles-mbp-2.lan>

pinellolab#274) I have suffered enough trying to debug my installation, so hopefully this helps someone else. Co-authored-by: Cole Lyman <cole@coles-mbp-2.lan>

In the most recent version of numpy (1.24) some of the types have been deprecated. This commit fixes these errors.

* Fixing documentation to match pooled headers * Header removal bug fix change documentation to guide_seq * Update documentation and help feature for CRISPRessoPooled * Remove extra newlines from CRISPRessoPooled -h * Make variable names as clear as my firstborn child's name * Update one more variable name Co-authored-by: Samuel Nichols <Snic9004@gmail.com>

* Implement logging handler to overwrite the latest log status to file * Add StatusHandler to CRISPRessoCORE log This will take the latest log output and write it to a file (`status.txt`), the catch being that with each log the file is overwritten so that one can easily tell where CRISPResso currently is and what the error is (if any). These changes include some slight refactoring in order to accomodate any potential parameter exceptions. * Add StatusHandler to CRISPRessoBatch and refactor `logger.warn` to `warn` * Add StatusHandler to CRISPRessoPooled and a little refactoring * Implement `percent_complete` to the status log * Add StatusHandler to CRISPRessoAggregate log * Add StatusHandler to CRISPRessoCompare log * Add StatusHandler to CRISPRessoPooledWGSCompare log * Add StatusHandler to CRISPRessoWGS log * Rename `status.txt` to `CRISPResso_status.txt` * Modify status log names to match the tool they are generated from * Add percent_complete stages to CRISPRessoCORE These also include log statements of each plot that is being generated as well as fixing some variable name collisions with `ind`. * Format the percentage in the log to be 2 decimal places * Change all plotting logs from `info` to `debug` and simplify progress This refactors how the progress of the plots is calculated, making it much simplier. Before this change we would of had to keep track of the number of times `percent_complete` was output, but now it simply updates the percent complete after each amplicon is finished processing. Hopefully this will make things easier to mantain even though it will be a little less "accurate" (not sure how accurate the original implementation was...). * Implemented shared console log handler across all CRISPResso* calls This allows for easy changes to logging formatting, which was inspired by having to change the default logging level. The default logging level needs to be set at `logging.DEBUG` in order for the debug log statements to not be ignored for the running and status logs. * Add ability to set the verbosity level to each CRISPResso* tool This allows users to set a verbosity level between 1 and 4 using the `-v`/`--verbosity` CLI parameter. If the `--debug` flag is present, then the level will default to 4, being the most verbose. * Implement showing the last seen `percent_compelte` when none is provided * Keep track of and log when multiple parallel runs are completed These changes modify `CRISPRessoMultiProcessing.run_crispresso_cmds` such that we can now display when a run is completed. This potentially breaks how signals and interupts are handled with multiple runs happening, but this needs to be reviewed. * Add debug and percentage complete to CRISPRessoBatch * Add percent complete to CRISPRessoPooled * Add debug and percent_complete message to CRISPRessoAggregate * Add `percent_complete` to CRISPRessoCompare * Add `percent_complete` to CRISPRessoPooledWGSCompare * Add status and `percent_complete` to CRISPRessoMeta * Add `verbosity` arguments to CRISPRessoCompare and CRISPRessoPooledWGSCompare * Fixing documentation to match pooled headers * Header removal bug fix change documentation to guide_seq * Update documentation and help feature for CRISPRessoPooled * Remove extra newlines from CRISPRessoPooled -h * Make variable names as clear as my firstborn child's name * Update one more variable name * Fix bug to flow CRISPRessoPooled options to sub command * Make amplicon file args variable name clear * Update how parameters are set and retrieved from parameter object The refactor in the previous commit changed the type of the arguments to a dictionary which doesn't have the parameters as attributes, and this commit fixes that error. * Add note in output header for change in default CRISPRessoPooled In the next release (2.3.0) the `--demultiplex_only_at_amplicons` will be the default when running in mixed-mode. This is to allow for inexact alignments of the reads and the amplicons to the genome. For more context, see this issue pinellolab#276 * Clarify the verbosity parameter help message * Separate out parameters to `normalize_name` in CRISPRessoCORE * Separate out parameters to `normalize_name` in CRISPRessoWGS * Separate out parameters to `normalize_name` in CRISPRessoPooled * Separate out parameters to `normalize_name` in CRISPRessoCompare * Fix bug in CRISPRessoPooled by replacing `database_id` with `normalize_name` * Refactor `run_crispresso_cmds` to not require a `logger` This commit implements the functionality to make the `logger` object optional by seeing which module called the `run_crispresso_cmds` function and obtaining the correct object from that module name. The function also immediately returns when no commands are passed to it. * Add amplicon name to plotting debug statements in CRISPRessoCORE --------- Co-authored-by: Cole Lyman <cole@coles-mbp-2.lan> Co-authored-by: Cole Lyman <cole@Coles-MacBook-Pro-2.local> Co-authored-by: Cole Lyman <cole@colelyman.com> Co-authored-by: Samuel Nichols <Snic9004@gmail.com>

…un if one fails. Use `conda install -c conda-forge pytest-check` to install the dependencies

kclem and others added 30 commits September 23, 2020 23:32

Update postrun reference output file

1938f91

Rename postrun references file to be more standardized with other output files. Output is now "CRISPResso2Pooled_postrun_references.txt"

Fix bug in mode to write fastq out

50b702c

Fix bug in read counting for interleaved fastqs

04c924c

Update new parameters, fix docker biuld problem

3a390f5

Version bump to v2.0.42

b39954e

delete merging intermediate files

d180416

Fix a bug when generating compare plot

251ec0a

Related to issue #61 This happens when N_ROWS < 1 which I assume has something to do with no results -> negative control

Merge pull request #62 from matandro/patch-1

baccafb

Fix a bug when generating compare plot

Standardize pie plot appearances

356fd66

Fixes for when no reads align #63

40d2168

Update histogram x-limits, caption, and data

2070143

Fix plot window cloning from Ref1 to HDR

971f1b1

Plot window for sgRNA will be the same length after cloning even if the window is shorter or longer after comparing between ref1 and HDR.

Add scripts folder for one-off analyses

e704bfd

Output alignment details for unaligned reads in fastq_out or bam_out

49964ae

Introduction of CRISPRessoAggregate to aggregate stats across runs

e1324f2

Adds CRISPRessoAggregate Adds start/end time to CRISPRessoBatch info Started removing pickle dependencies from Pooled and Report

Update README.md

8005d2a

Update links in readme to https - fix pinellolab#79

7c278c5

fix missing import path on NaN

9927868

Fix #72 bam_input error

e684569

Alt pooled processing implementation

29093de

Keep old awk command for speed for samples with <50 amplicons

9771a28

CRISPRessoPooled - close active file in demux

d961c2e

kclem and others added 12 commits October 6, 2022 16:32

Fix CRISPRessoBatch plot pool bug when plots are suppressed

cbf71f1

Fix typo of CRISPResssoPlot when plotting nucleotide quilt (pinellola…

e133f3c

…b#250)

Allow spaces in read names for CRISPRessoWGS

73ac5d8

Add script to filter input based on sequence presence

53e8f95

Update filterReadsOnSequencePresence.py

056369f

Version bump to 2.2.11a

7c4dad2

Clarify default CRISPRessoPooled settings for use_legacy_bowtie2_opti…

51a1969

…ons_string

Format filterReadsOnSequencePresence script

4149d8e

Add deprecation notice (pinellolab#260)

9674593

* Add FLASh and Trimmomatic deprecation notice to CLI output * Add Edilytics email address to CLI output

Colelyman marked this pull request as draft December 8, 2022 20:49

kclem and others added 16 commits December 19, 2022 13:28

Fix pinellolab#235 - Cigar string is * if read unaligned

66b65fc

Previously, the bam would set the cigar string to 0 if the read was unaligned. This breaks the sam->bam conversion and causes the errors in pinellolab#235.

Clarify input param help for pooled bam

d026acd

Delete vscode settings

ed77f73

Fix bug when pooled bam is input (pinellolab#265)

6b25660

This change checks to see if a bam file was input, and if so it doesn't try to remove any intermediate files because there aren't any. Co-authored-by: Cole Lyman <cole@coles-mbp-2.lan>

Add snippet about installing CRISPResso2 via bioconda on Apple silicon (

7c5fa7d

pinellolab#274) I have suffered enough trying to debug my installation, so hopefully this helps someone else. Co-authored-by: Cole Lyman <cole@coles-mbp-2.lan>

Fix deprecated numpy type names (fixes pinellolab#269) (pinellolab#270)

bc55076

In the most recent version of numpy (1.24) some of the types have been deprecated. This commit fixes these errors.

Version bump to 2.2.12

4c1266b

Fix print statement in CORE

5805b3f

Case-insensitive headers accepted in CRISPRessoPooled

ec44bb8

Merge branch 'master' into upstream-paired-processing

b379a6d

New tests, replace assert statment with check.equals so all will be r…

65b493d

…un if one fails. Use `conda install -c conda-forge pytest-check` to install the dependencies

Extra tests

bb27ba1

Reverse R2 seq and qual

d3430e8

Reverse quality string in force_merge_pairs

41bfcaa

Colelyman force-pushed the upstream-paired-processing branch from b074699 to 41bfcaa Compare September 21, 2023 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paired read processing #133

Paired read processing #133

Colelyman commented Aug 19, 2021

Paired read processing #133

Are you sure you want to change the base?

Paired read processing #133

Conversation

Colelyman commented Aug 19, 2021