Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of filtering reads based on phred score of selected basepairs #179

Open
caspl1 opened this issue Dec 8, 2021 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@caspl1
Copy link

caspl1 commented Dec 8, 2021

Hi.
Thanks for a great pipeline!
I have a population of cells where cells with two different knock-in single nucleotide substitutions are present amongst unedited cells. I would like to quantify the number of reads with each of these substitutions using CRISPResso2. I would like to apply stringent phred score quality filtering, not to the whole read, but only to these two basepair positions (or if that is not possible the window containing both of them, since they are located quite close to each other).

If i filter using the min_average_read_quality parameter, i will not be sure that these positions contain high quality base calls. If i filter using min_single_bp_quality, i will discard too many reads. And if i filter using min_bp_quality_or_N, the total number of reads with each single nucleotide substitution is not easily accessible in the allele plot since the N's are located different places in the reads and the reads are then not identical and counted together.

Is there a way to do this or would it be possible for you to make it?
Thanks a lot
/Casper

@Colelyman
Copy link
Contributor

Colelyman commented Dec 8, 2021

Thanks for using CRISPResso2! I don't think this is currently implemented, but seems like it would be a neat feature. @kclem may have more ideas on how to currently do this.

If I understand correctly, the feature would take minimum phred scores for specific basepair positions and then filter the reads that have at least those scores, right?

@caspl1
Copy link
Author

caspl1 commented Dec 8, 2021

Thanks for the reply. Yes, that is exactly what i mean!

@kclem
Copy link
Member

kclem commented Dec 11, 2021

Briefly, I think this would have to be done post-alignment and post-processing via the --fastq_output parameter which creates a fastq file including quality scores and read alignment positions for downstream filtering.

I don't think this is a common use case, so probably just writing a post-processing script would be most useful. What outputs are you looking for? Number of edited vs unedited reads after filtering for quality at those positions?

@kclem kclem added the enhancement New feature or request label Dec 11, 2021
@caspl1
Copy link
Author

caspl1 commented Dec 11, 2021

Thanks for the reply! I am looking to compare the ratio of the two single base pair substitutions (i.e. number of reads of one compared to the other). They are both present at low frequencies and therefore i want to add stringent requirements for quality of those exact base pairs to minimize effect of sequencing noise.

@leonorrib
Copy link

leonorrib commented Mar 15, 2022

Dear @kclem and @Colelyman,

Thank you for your answers. I'm working together with @caspl1 on his project. We are implementing a post-processing script and would like to make sure that we are in the right direction, because so far we can't manage to obtain the same number of fragments that are given by CRISPResso2.

The input data are fastq files from a KI experiment, where there are two single base pair substitutions introduced (ie. two amplicons of interest, each with one single base pair substitution).

We would like to use the output fastq file to:

  • First, identify the fragments with each of the single base pair substitutions introduced. The total number of these fragments is usually provided in the output files called "9.XXX.Alleles_frequency_table_around_sgRNA_XXXXXXXXXXXX.png"
  • Then, on the list of fragments of interest identified above, we would like to filter out those where base pair with the substitution has low phred score.

But, so far, we can't manage to reproduce the same numbers that CRISPResso2 outouts in the first part above. Perhaps, since you know better the CRIPSPResso2 algorithm, you can help us with how to identify the fragments of interest.

So far, the steps we follow in our script are:

For each sequenced fragment:

  1. Check whether the fragment aligns to some of the amplicons of interest. If not, then discard it.
  2. Check whether the alignment is an ambiguous alignment (ie. to multiple amplicons). If so, then discard the fragment.
  3. If the fragment aligns to TYPE1 ("ALN" field in the output fastq file) & there are no alignment gaps in the reference sequence or aligned sequence ("ALN_DETAILS" field in the output fastq file), then count the fragment as "type1".
  4. If the alignment is to TYPE2 ("ALN" field in the output fastq file) & there are no gaps in the reference sequence or aligned sequence ("ALN_DETAILS" field in the output fastq file), then count the fragment as "type2".

Best regards,
Leonor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants