Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% training accuracy - unable to navigate mistake #165

Closed
moa4020 opened this issue Mar 11, 2024 · 1 comment
Closed

100% training accuracy - unable to navigate mistake #165

moa4020 opened this issue Mar 11, 2024 · 1 comment

Comments

@moa4020
Copy link

moa4020 commented Mar 11, 2024

Hi Marcus,

I managed to make a dataset that has separate 10+ nucleotide handles on either side of the unmod G and mod G randomers with the structure: LeftHandle1-NNNNGNNNN-RightHandle1 and LeftHandle2-NNNNGNNNN-RightHandle2.

And I'm trying to extract chunks from this dataset but I don't think I'm doing it right. Here is my code for extracting chunks:

remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/controlchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif TNNNNGNNNNG 5 --mod-base-control --num-extract-chunks-workers 2

remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/modchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif CNNNNGNNNNT 5 --mod-base o 8oxoG --num-extract-chunks-workers 2

Then I made a config and trained the model:

remora \
  dataset make_config \
  train_dataset.jsn \
  controlchunks \
  modchunks \
  --dataset-weights 1 1 \
  --log-filename train_dataset.log
remora \
  model train \
  train_dataset.jsn \
  --model models/ConvLSTM_w_ref.py \
  --device 0 \
  --chunk-context 50 50 \
  --output-path train_results

And this is where things start feeling fishy. Because the output from the training shows that the central position is 7. But it shouldn't be. The central position/focus base should be the G in the middle - which is on the 6th position (5, according to python).

Output:

(base) [moa42@cayuga-login err]$ cat remora_train_295err
[56] Seed selected is 92422
[5242] Loading dataset from Remora dataset config
[5] Dataset summary
                     size  26,59
     modified_base_labels  True
                mod_bases  ['o']
           mod_long_names  ['oxoG']
       kmer_context_bases  (4, 4)
            chunk_context  (5, 5)
                   motifs  [('CNNNNGNNNNT', 5), ('TNNNNGNNNNG', 5)]
           reverse_signal  False
 chunk_extract_base_start  False
     chunk_extract_offset  
          sig_map_refiner  Loaded 9-mer table with  central position Rough re-scaling will be executed

[5] Loading model
[646] Model structure
network(
  (sig_conv) Convd(, 4, kernel_size=(5,), stride=(,))
  (sig_bn) BatchNormd(4, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (sig_conv2) Convd(4, 6, kernel_size=(5,), stride=(,))
  (sig_bn2) BatchNormd(6, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (sig_conv) Convd(6, 64, kernel_size=(9,), stride=(,))
  (sig_bn) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (seq_conv) Convd(6, 6, kernel_size=(5,), stride=(,))
  (seq_bn) BatchNormd(6, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (seq_conv2) Convd(6, 64, kernel_size=(,), stride=(,))
  (seq_bn2) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (merge_conv) Convd(2, 64, kernel_size=(5,), stride=(,))
  (merge_bn) BatchNormd(64, eps=e-5, momentum=, affine=True, track_running_stats=True)
  (lstm) LSTM(64, 64)
  (lstm2) LSTM(64, 64)
  (fc) Linear(in_features=64, out_features=2, bias=True)
  (dropout) Dropout(p=, inplace=False)
)
[442] Params (k) 4 | MACs (M) 245
[442] Preparing training settings
[54] Dataset loaded with labels control9,; oxoG42,24
[549] Train labels control4,; oxoG,24
[549] Held-out validation labels control5,; oxoG5,
[549] Training set validation labels control5,; oxoG5,
[549] Running initial validation
Batches 5it [,  2s/it]
Batches 5it [, 45it/s]
[9645] Start training
Epochs   %|          | / [5<6255, 26s/it, acc_train=9999, acc_val=9994, loss_train=26, loss_valEpochs   %|          | / [654<244, 4496s/it, acc_train=, acc_val=9996,  acc_val=9996, loss_train=, loss_vaEpochs  2%|█▏        | 2/ [959<456, 9996s/it, acc_train=, acc_val=9996, loss_train=, loss_val=4]
Epoch Progress %|██████████| 4/4
[4962] No validation accuracy improvement after  epochs Training stopped early
[4962] Saving final model checkpoint
[49522] Done
@marcus1487
Copy link
Collaborator

There are indeed a number of issues with this setup. I'll start with the fact that Remora is not intended for the processing of randomer strands directly. Randomer datasets are processed into Remora datasets using the Betta program. I would urge you to join the Betta program if the randomer approach is essential to your project.

To address the issues with the processing directly, the --motif TNNNNGNNNNG 5 argument sets the motif for the resulting dataset. Without other arguments this has the effect that only locations (basecalls and reference) matching this motif are included in the resulting dataset. But this also means that the resulting model will only make calls in the TNNNNGNNNNG sequence of basecalls. This also means that every site where the reference and basecalls match this motif will be included in the resulting dataset. Overall it does not sound as though this is the intended target for this dataset.

The primary target for Remora-only (without access to Betta) data preparation and models is to annotate reference locations with canonical or modified bases. The motif argument is intended to limit the model to motifs and not necessarily as the selection criteria for the training chunks.

It looks like you are also acquiring very few training chunks. There appear to be some copying errors, but it looks like there may be very few training chunks. We generally recommend at least 1 million chunks for training. Training with fewer examples will likely lead to overtraining to the examples provided.

I hope this helps you along the track to processing your samples. Please post here if any further assistance is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants