You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I managed to make a dataset that has separate 10+ nucleotide handles on either side of the unmod G and mod G randomers with the structure: LeftHandle1-NNNNGNNNN-RightHandle1 and LeftHandle2-NNNNGNNNN-RightHandle2.
And I'm trying to extract chunks from this dataset but I don't think I'm doing it right. Here is my code for extracting chunks:
And this is where things start feeling fishy. Because the output from the training shows that the central position is 7. But it shouldn't be. The central position/focus base should be the G in the middle - which is on the 6th position (5, according to python).
There are indeed a number of issues with this setup. I'll start with the fact that Remora is not intended for the processing of randomer strands directly. Randomer datasets are processed into Remora datasets using the Betta program. I would urge you to join the Betta program if the randomer approach is essential to your project.
To address the issues with the processing directly, the --motif TNNNNGNNNNG 5 argument sets the motif for the resulting dataset. Without other arguments this has the effect that only locations (basecalls and reference) matching this motif are included in the resulting dataset. But this also means that the resulting model will only make calls in the TNNNNGNNNNG sequence of basecalls. This also means that every site where the reference and basecalls match this motif will be included in the resulting dataset. Overall it does not sound as though this is the intended target for this dataset.
The primary target for Remora-only (without access to Betta) data preparation and models is to annotate reference locations with canonical or modified bases. The motif argument is intended to limit the model to motifs and not necessarily as the selection criteria for the training chunks.
It looks like you are also acquiring very few training chunks. There appear to be some copying errors, but it looks like there may be very few training chunks. We generally recommend at least 1 million chunks for training. Training with fewer examples will likely lead to overtraining to the examples provided.
I hope this helps you along the track to processing your samples. Please post here if any further assistance is needed.
Hi Marcus,
I managed to make a dataset that has separate 10+ nucleotide handles on either side of the unmod G and mod G randomers with the structure:
LeftHandle1-NNNNGNNNN-RightHandle1
andLeftHandle2-NNNNGNNNN-RightHandle2
.And I'm trying to extract chunks from this dataset but I don't think I'm doing it right. Here is my code for extracting chunks:
remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/controlchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif TNNNNGNNNNG 5 --mod-base-control --num-extract-chunks-workers 2
remora dataset prepare pod5Dir/merged.pod5 bamDir/v02.2_trimmed_aligned_moves.bam --output-path $outDir/modchunks --refine-kmer-level-table $outDir/9mer_levels_v2.txt --refine-rough-rescale --motif CNNNNGNNNNT 5 --mod-base o 8oxoG --num-extract-chunks-workers 2
Then I made a config and trained the model:
And this is where things start feeling fishy. Because the output from the training shows that the central position is 7. But it shouldn't be. The central position/focus base should be the G in the middle - which is on the 6th position (5, according to python).
Output:
The text was updated successfully, but these errors were encountered: