remora 3.0: error when training on different canonical bases #140

Mathias-Boulanger · 2023-11-24T13:59:51Z

Hi,

I got an error (c.f below) when running remora dataset prepare using multiple focus bases. I already trained models in the same spirit using remora 2.0, that why I don't know if that's an expected behavior...

If this is expected, then how can I train models on a specific mod base taking into account that other base/context can be also methylated?

Also a more general question, what is the best practice to infer train remora models? Should I subset 10-15% of my training data for validation (and use the rest to train) or should I use everything to train and infer with the same dataset?

Thank you for your help

Remora command:

remora dataset prepare \
	--output-path ${wd}data/0_unmeth/prepData/mock_5_CpG_6mA \
	--refine-kmer-level-table ${wd}data/ONT/9mer_levels_v1.txt \
	--refine-rough-rescale \
	--motif CG 0 --motif A 0 \
	--mod-base-control \
	--max-chunks-per-read 20 \
	--num-extract-alignment-workers 24 \
	--num-extract-chunks-workers 24 \
	${wd}data/0_unmeth/0_unmeth.pod5 \
	${wd}data/0_unmeth/0_unmeth.pass.bam

Error log:

[14:37:43.988] Extracting read IDs from POD5
[14:37:49.204] Found 1,242,986 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[14:37:49.302] Making reference-anchored training data
[14:37:49.302] Opening dataset for output
Traceback (most recent call last):
  File "/miniconda3/envs/remora_train/bin/remora", line 8, in <module>
    sys.exit(run())
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/parsers.py", line 302, in run_dataset_prepare
    extract_chunk_dataset(
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/prepare_train_data.py", line 165, in extract_chunk_dataset
    metadata=DatasetMetadata(
  File "<string>", line 23, in __init__
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 847, in __post_init__
    self.check_motifs()
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 824, in check_motifs
    raise RemoraError(
remora.RemoraError: Cannot create dataset with multiple motif focus bases: {'A', 'C'}

Remora version:

> remora -v
Remora version: 3.0.0

The text was updated successfully, but these errors were encountered:

marcus1487 · 2023-11-25T06:58:03Z

The error here is the intended behavior. Remora models (and thus datasets) are linked to a single canonical base. Multiple alternatives to the canonical base are described by one model, but alternatives to multiple canonical bases should be separated into separate models (and datasets). These models can be run simultaneously in dorado so there should be no penalty at inference time for the models being separated. Hopefully this helps clear up the intentions, but please do post if you have any further questions.

Mathias-Boulanger · 2023-11-27T23:09:22Z

That is indeed more clear. I will train both separately for each canonical base and then use the 2 models simultaneously in dorado.

However, I don't understand why for model trained for the same canonical base but on different motifs (CG and GC for example) I cannot export the pytorch model in dorado format.
I got this:

remora model export train_results_CpG_GpC/model_best.pt dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1           
[23:10:57.468] Loading model                                                                                                                                            
[23:10:57.726] Loaded a torchscript model                                                                                                                               
[23:10:57.727] Exporting model to dorado format                                                                                                                         
[23:10:57.921] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv1.weight.tensor                            
[23:10:57.928] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv1.bias.tensor                              
[23:10:57.936] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv2.weight.tensor                            
[23:10:57.942] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv2.bias.tensor                              
[23:10:57.949] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv3.weight.tensor                            
[23:10:57.954] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv3.bias.tensor                              
[23:10:57.961] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv1.weight.tensor                            
[23:10:57.966] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv1.bias.tensor                              
[23:10:57.973] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv2.weight.tensor                            
[23:10:57.979] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv2.bias.tensor                              
[23:10:57.998] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/merge_conv1.weight.tensor                          
[23:10:58.004] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/merge_conv1.bias.tensor                            
[23:10:58.012] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.weight_ih_l0.tensor                          
[23:10:58.018] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.weight_hh_l0.tensor                          
[23:10:58.024] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.bias_ih_l0.tensor                            
[23:10:58.030] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.bias_hh_l0.tensor                            
[23:10:58.036] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.weight_ih_l0.tensor                          
[23:10:58.042] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.weight_hh_l0.tensor                          
[23:10:58.048] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.bias_ih_l0.tensor                            
[23:10:58.054] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.bias_hh_l0.tensor                            
[23:10:58.060] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/fc.weight.tensor                                   
[23:10:58.091] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/fc.bias.tensor                                     
[23:10:58.103] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/refine_kmer_levels.tensor                          
Traceback (most recent call last):                                                                                                                                      
  File "/miniconda3/envs/remora_train/bin/remora", line 8, in <module>                                                                                
    sys.exit(run())                                                                                                                                                     
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/main.py", line 71, in run                                                   
    cmd_func(args)                                                                                                                                                      
  File "miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/parsers.py", line 939, in run_model_export                                  
    export_model_dorado(ckpt, model, args.output_path)                                                                                                                  
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/model_util.py", line 213, in export_model_dorado                            
    raise RemoraError("Dorado only supports models with a single motif")                                                                                                
remora.RemoraError: Dorado only supports models with a single motif

Does that mean I should train also two models separately? But how will be encoded the methylation in the BAM, in two separate MM/ML tags or merged (this is the same modified base 'm')?

Do you have any insights about the best practice to train Remora models?

Thank you for your help!

marcus1487 · 2023-11-29T01:16:14Z

This is a current limitation in Dorado, but not Remora models. You can train a model in CG and GC context in a single model, but Dorado only supports a single motif for each Remora model. I think Dorado will also not support multiple models for against the same canonical base. These issues would be best directed to the Dorado repository if they are required for your research.

Given the current state of the software your options are to

run remora infer with this model

Remora does not support multiple modified base model currently though so you'd have to run the A-mods model separately
This does not require re-basecalling, but running remora on most contexts may not be too much faster than basecalling alone.

"manually" convert your CG+GC-context model to an "all-context" model

since calls will be made in all contexts you may want to filter calls to those matching a basecalled motif. See the modkit repo for this type of command.

I hope this helps a bit or at least points in the right direction.

Mathias-Boulanger · 2023-11-29T09:36:41Z

Thank you for your useful insights!
I transferred to issue to Dorado repository here

I think I will try to convert the model to a all context one and then filter for my motifs of interest. To do so, I just need to manually change the metadata of the model?

Using remora infer will work as well but will increase the execution time of my pipeline significantly.. But worth to try :)

I'll will keep you posted of what was the best..

Thank you again

marcus1487 · 2024-03-07T21:01:03Z

Converting to an all-context model would indeed be a workaround here. Note that performance outside of the trained contexts would likely be quite poor.

Have you been able to get along with this fix?

Mathias-Boulanger · 2024-03-08T16:08:01Z

Hi Marcus,

This is still on the todo list, unfortunately, due to the priorities in the project. But, actually good timing, I should put my hand in it soon.

This is frustrating because the model as it's trained today could be used with Bonito as well. (We are already using bonito to call CG and GC methylation with a custom model trained with Remora 2.0 and 4KHz-LSK114 datasets). However, Bonito 0.7.3 is not supporting Remora > 3.0 models...

Anyway, I'll keep you posted.

marcus1487 · 2024-03-08T18:58:20Z

I am testing upgrading the remora dep in bonito right now, but as a workaround you should just be able to bump the remora version in the bonito requirements.txt file and reinstall. I do not think there are any breaking changes in remora 3.0 in terms of the interface used by bonito.

Mathias-Boulanger · 2024-03-12T10:29:31Z

I just updated the requirements.txt file in the bonito repo with:

ont-remora==3.1.0
pod5==0.3.6

And it's working like a charm! Thank you.
I am currently testing the behavior of the model on a large dataset in comparison of the AllC 5mC pretrained model.

In parallel, I finally succeed in training the same model in the 'allC' settings and convert it for dorado usage. It's currently running. I'll keep you posted.

Thank for your help to let us move forward.

Mathias-Boulanger · 2024-03-14T16:53:26Z

The model is working very nicely!
You can see that it is correcting the GpC vias that we identified in the 5mC allC 5kHz model. Overall, the custom model reduces the number of methylated outliers (compared to BS) on the bottom right corner of the scatters.

Here is the model performance:

I am looking forward to convert this model in Dorado usable one when the 2 context feature will be supported!

Thank you again

Mathias-Boulanger changed the title ~~remora > 3.0 error when training on diffrent focus bases~~ remora 3.0: error when training on diffrent focus bases Nov 24, 2023

Mathias-Boulanger mentioned this issue Nov 29, 2023

Dorado ability to use Remora model trained on multiple motifs of the same canonical base nanoporetech/dorado#490

Open

Mathias-Boulanger changed the title ~~remora 3.0: error when training on diffrent focus bases~~ remora 3.0: error when training on different focus bases Nov 29, 2023

Mathias-Boulanger changed the title ~~remora 3.0: error when training on different focus bases~~ remora 3.0: error when training on different canonical bases Nov 29, 2023

marcus1487 closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remora 3.0: error when training on different canonical bases #140

remora 3.0: error when training on different canonical bases #140

Mathias-Boulanger commented Nov 24, 2023 •

edited

marcus1487 commented Nov 25, 2023

Mathias-Boulanger commented Nov 27, 2023 •

edited

marcus1487 commented Nov 29, 2023

Mathias-Boulanger commented Nov 29, 2023 •

edited

marcus1487 commented Mar 7, 2024

Mathias-Boulanger commented Mar 8, 2024

marcus1487 commented Mar 8, 2024

Mathias-Boulanger commented Mar 12, 2024 •

edited

Mathias-Boulanger commented Mar 14, 2024

remora 3.0: error when training on different canonical bases #140

remora 3.0: error when training on different canonical bases #140

Comments

Mathias-Boulanger commented Nov 24, 2023 • edited

Remora command:

Error log:

Remora version:

marcus1487 commented Nov 25, 2023

Mathias-Boulanger commented Nov 27, 2023 • edited

marcus1487 commented Nov 29, 2023

Mathias-Boulanger commented Nov 29, 2023 • edited

marcus1487 commented Mar 7, 2024

Mathias-Boulanger commented Mar 8, 2024

marcus1487 commented Mar 8, 2024

Mathias-Boulanger commented Mar 12, 2024 • edited

Mathias-Boulanger commented Mar 14, 2024

Mathias-Boulanger commented Nov 24, 2023 •

edited

Mathias-Boulanger commented Nov 27, 2023 •

edited

Mathias-Boulanger commented Nov 29, 2023 •

edited

Mathias-Boulanger commented Mar 12, 2024 •

edited