Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sRNA spike-ins #116

Open
mschilli87 opened this issue Nov 9, 2023 · 9 comments
Open

sRNA spike-ins #116

mschilli87 opened this issue Nov 9, 2023 · 9 comments
Assignees
Labels

Comments

@mschilli87
Copy link
Member

mschilli87 commented Nov 9, 2023

Dear @Drmirdeep,

More and more sRNA-seq projects utilize RNA spike-ins as controls.

What is the recommended workflow to quantify these along miRNAs using miRDeep2's quantifier module?
I added the corresponding sequences to the mature miRNA input FASTA but they don't get picked up. I am guessing I need to at least make up some 'hairpins' for them as well (How? Can I just use the mature sequence or do I need to actually craft legit hairpin-structures to bypass some internal filtering? And what do I do with spike-ins that are longer than your typical mature miRNA but well within the read length - say 35nt RNA spike-in @ 51 cycles single-end Illumina sequencing?).
But will I need to make up a fake chromsome as well that hosts those fake hairpins and re-build an new genome bowtie index just to quantify the spike-ins along endogenous miRNAs?

Any advise would be highly appreciated.

Also, would you be up to adding a new optional input FASTA with 'mature' spike-in sequences that are quantied as-is by the quantifier module?

@Drmirdeep
Copy link
Collaborator

Drmirdeep commented Nov 9, 2023

You will just need your spikein sequences in your mature file and precursor file definition (length doesn't matter)

cat mature.fa
>spikein1
CTACGATCGATGCTAGCTAGCAGGCGATCAGCATCGATGATCGATTA

cat pres.fa
>spikein1
CTACGATCGATGCTAGCTAGCAGGCGATCAGCATCGATGATCGATTA

just run it in its most simple form and give the mature.fa as the precursor input as well

perl quantifier.pl -r reads_collapsed.fa -m mature.fa -p mature.fa

so just add your spikins to your mature file and to the precursor file

@mschilli87
Copy link
Member Author

Thank you. I can confirm that copying the 'mature' sequences to the 'precursor' FASTA results in miRDeep2 picking up the RNA spike-ins for quantification.

@mschilli87
Copy link
Member Author

mschilli87 commented Feb 6, 2024

Sorry for re-opening this: I noticed that only reads matching the spike-in exactly are quantified using this approach. I noticed because the results did not add up and by a simple grep 'analysis' I found that most reads in my data that contain an exact match of a specific spike-in contain a final T after pre-processing. I do not know if this could be an actual problem with the spike-ins or is just an artifact from insufficient adapter-removal but I found it weird that the exact 'mature' sequence plus a single (non-templated) nucleotide would not be picked up mir miRDeep2. Now I am wondering if that is due to the processed read being longer than the 'precursor' sequence of that spike-in. Any advice on how to adjust my input or tweak the parameters to pick those up?


edit: Just a quick update: It looks like indeed I missed a leading T in the 3' adapter sequence during pre-processing. I'll try to see if this is enough to fix the issue. Still, I am curious if this is expected as untemplated bases added to miRNAs post-transcriptionally have been shown to be biologically relevant so it would be good to know if miRDeep2 misses those by default or if this is related to the precursor=mature hack I used for the spike-ins.

@mschilli87 mschilli87 reopened this Feb 6, 2024
@Drmirdeep
Copy link
Collaborator

Then just put a few A‘s in the end of the precursor. Mismatches can be allowed up to 2 if I remember correctly. If the mature is longer than the precursor then bowtie will fail to map I guess

@mschilli87
Copy link
Member Author

mschilli87 commented Feb 6, 2024

That's what I am currently guessing as well. Do you know if N's would be allowed as well? I would prefer not to bias the mapper towards any specific untemplated base. Thank you for the fast response.

@mschilli87
Copy link
Member Author

I can confirm the spike-in counts changed by orders of magnitude while miRNA counts remains almost unchanged. So the extreme sensitivity to non-templated bases appears to be an issue for the spike-ins only, likely due to what was discussed here. I will leave this issue open until as a reminder to update it with my final conclusions in case anyone else searches this tracker looking for advice how to include spike-ins.

@mschilli87 mschilli87 self-assigned this Feb 6, 2024
@mschilli87
Copy link
Member Author

mschilli87 commented Feb 6, 2024

When padding the 'mature' spike-ins with ten N's on either side for the 'hairpins' I do not get any spike-in quantification. IIRC, miRDeep2 discards sequences with non ACTG(U?) characters (at least by default). @Drmirdeep: Would you expect A's to perform better and if so, do you share my concerns regarding the bias introduced by those?


edit: As expected, replacing the N's with A's did yield spike-in results again, with slightly more reads mapping to them as before. I feel like some padding should be done to remove the bias agains spike-ins (compared to endogenous miRNAs) in case of imperfect pre-processing but I am concerned that picking a specific nucleotides introduces another bias. @Drmirdeep: Could you provide your point of view?

@Drmirdeep
Copy link
Collaborator

If the preprocessing is the problem then this should be solved on the preprocessing side not on a tool that expects the input data to follow certain conventions

@mschilli87
Copy link
Member Author

mschilli87 commented Feb 7, 2024

I guess I was not clear enough: The pre-processing issue was the reason the quantification was off orders of magnitude. That is fixed and not my issue. However, that made me realise that a read that, for whatever reason, has an extra base maps just fine for actual miRNAs but not for spike-ins. The whole point of spike-ins is to serve as controls. So they should be treated as close to identitcal as possible. So in a way the problem is that my pre-processing mistake affected spike-ins but not miRNAs (or at least much less). Padding the spike ins with A's helps, because now they are treated more similar as miRNAs. But as along as I fix that padding, there is still a bias that is spike-in specific. And ideally, nothing should be specific for the thing you normalize to. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants