Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remora re-squiggle (refine_signal_map.SigMapRefiner) on RNA data #166

Closed
mem3nto0 opened this issue Mar 27, 2024 · 4 comments
Closed

Remora re-squiggle (refine_signal_map.SigMapRefiner) on RNA data #166

mem3nto0 opened this issue Mar 27, 2024 · 4 comments

Comments

@mem3nto0
Copy link

Dear ONT team,

I am trying to re-squiggle my RNA data using Remora after I basecalled the data with Dorado. Comparing the analysis from Tombo and Remora on the same dataset, I obtain different results. The notebook page of Remora says that Remora setup is tested and adjusted for DNA Kit 14 9-mer, but no specification for RNA data. Should it work the same?

Additionally, the analysis on synthetic RNA data (where I know the modification position) shows consistent results using Tombo, while with Remora the results don't converge in the modification position.

My questions are:
There is any setup for Remora that is more suitable for RNA data?
If the answer is no for the first question, there is any pipeline for base-calling with Dorado and then using Tombo?

thank you for your time and attention.
Kind regards

@marcus1487
Copy link
Collaborator

The k-mer models for RNA need to be used. These can be found in the kmer_models repository as noted in the README. The reverse_signal flag should also be set where applicable for RNA reads as the signal proceeds from 3' to 5' ends of the RNA. Finally the latest Dorado release (v0.5.3) should be used as there were some bugs related to RNA trimming/splitting in previous Dorado versions effecting the move table and thus Remora analyses. Some additional bug fixes are coming in the next Remora release to handle some edge cases in move table parsing, but the vast majority of reads should be handled correctly with the latest Dorado and Remora releases.

@mzdravkov
Copy link

Hi @marcus1487, we talked together with you and Logan a few days ago.

Sorry for hijacking the issue (please let me know if I should open a new one), but I just stumbled on a bug related to the move table when resquiggling RNA004 data and wanted to make sure it is the same thing that you're preparing a fix for.

We re-basecalled our data with Dorado v0.5.3+d9af343 and I just tried resquiggling it again with Remora, but I'm getting a:

remora.RemoraError: Move table discordant with signal                         

So I have a few questions:

  1. Is this likely to be resolved with the upcoming fixes in the next release?
  2. When do you expect the release to be published?
  3. Can you suggest a workaround that we can do in the meantime? I tried changing the code directly so that I pass missing_ok=True to the get_io_reads function call here and it seems to resolve the issue (I guess we're just losing some problematic reads). Do you think that this approach is okay?

Thanks,
Mihail

@mem3nto0
Copy link
Author

mem3nto0 commented Apr 1, 2024

@marcus1487 Thank you for the reply,

When I analyzed the data, I set already the reverse_signal=True and I chose the suggested kmer-model for RNA (rna_r9.4_180mv_70bps). I saw that my Dorado is not in line with the last update and I will check with the new version.

But I would like to still ask about a few elements in Remora. It is possible in the software to change the "sd_params", which are elements designed for the re-squiggle. In the README of Remora, it says that the pre-settled values are tested for DNA. They can be used also for RNA?

Additionally, changing do_rough_rescale, scale_iters, and do_fix_guage settings in the "refine_signal_map.SigMapRefiner" the re-squiggle changes significantly. There are specific settings to use for sd_params, do_rough_rescale, scale_iters, and do_fix_guage to analyze RNA data?

Thank you for your time and attention.
kind regards.

@marcus1487
Copy link
Collaborator

@mzdravkov 1. Yes, this is likely to be resolved in the next release. 2. I do not have a concrete timeline for this release, but hope to have it out in the next couple of weeks after some stress testing of other features. Could possibly look at a pre-release pushed to github, but not tagged/published as a release. 3. No good workaround. There were some incorrect assumptions concerning some of the tags around signal trimming and splitting which have been resolved. This problem seems to be larger for RNA runs, and quite rare in DNA runs.

@mem3nto0 The sd_params generally work quite well for RNA in our hands. These parameters control the short dwell penalty. Specifying a longer array will increase runtime (potentially significantly for much larger values), but may provide some marginal benefits for slower speed runs (especially in RNA). We have not extensively tested this and thus recommend the default values for this paramter which do work quite well.

I have indeed found that increasing the scale_iters value to >0 can cause some interesting edge cases (e.g. scaling signal down to 0). I would strongly suggest leaving this value set to 0 for most all signal metric extraction settings. I've considered making this parameter boolean (essentially between -1 and 0), but it seems that the functionality may be useful at some point so have left it for now. For signal plotting/signal metric extraction I would suggest using do_rough_rescale=True and do_fix_guage=True for most all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants