ref to signal mapping did not consider splicing #118

chilampoon · 2023-10-03T22:23:49Z

Hi @marcus1487, I was using remora API on real direct RNA-seq data, but ran into errors from this command:

  io_read = Read.from_pod5_and_alignment(
      pod5_read_record=pod5_reads[bam_read.query_name],
      alignment_record=bam_read,
      reverse_signal=True,
  )

---------------------------------------------------------------------------

File [/opt/miniconda3/lib/python3.11/site-packages/remora/io.py:1545](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/lib/python3.11/site-packages/remora/io.py:1545), in Read.add_alignment(self, alignment_record, parse_ref_align, reverse_signal)
 1540 self.ref_to_signal = DC.compute_ref_to_signal(
 1541     query_to_signal=self.query_to_signal,
 1542     cigar=self.cigar,
 1543 )
 1544 # +1 because knots include the end position of the last base
-> 1545 assert self.ref_to_signal.size == len(self.ref_seq) + 1, (
 1546     "discordant ref seq lengths: move+cigar:"
 1547     f"{self.ref_to_signal.size} ref_seq:{len(self.ref_seq)}"
 1548 )
 1549 self.ref_reg.end = self.ref_reg.start + self.ref_to_signal.size - 1

AssertionError: discordant ref seq lengths: move+cigar:16978 ref_seq:2542

Apparently ref_seq:2542 is the number of bases of this gene's exonic regions (i.e. seq length of this read), and move+cigar:16978 ~= ref_end - ref_start on the genomic region (i.e. the range of all exons and introns).

I then tried to resolve it, for here if changes to

knots = np.interp(np.arange(query_knots[-1] + 1), query_knots, ref_knots)

will get the correct lengths as it uses reference knots and query knots to generate reference to read knots. Then for map_ref_to_signal() I guess can just

return np.floor(
        ref_to_query_knots
    ).astype(int)

then it gives ref_to_signal coordinate mapping.
I haven't tested it further, but I can provide a toy dataset for this issue, thank you and looking forward to your comments.

The text was updated successfully, but these errors were encountered:

marcus1487 · 2023-10-05T15:56:57Z

Remora is not currently designed to work with spliced RNA alignments. This is not currently on the roadmap, but is certainly possible. The fix would be to this function primarily. I'd be happy to accept a PR for this, but need to ensure that it would not have any impact on standard genomic alignments.

chilampoon · 2023-10-10T17:39:57Z

Remora is not currently designed to work with spliced RNA alignments. This is not currently on the roadmap, but is certainly possible. The fix would be to this function primarily. I'd be happy to accept a PR for this, but need to ensure that it would not have any impact on standard genomic alignments.

Okay, I misunderstood something before but updated the understanding later on, I will open a PR soon.

I wrote a function to get the query to reference mapping, and modified something to get the query to ref coordinates mappings -> this leads to a few differences from the original ref_to_signal values (for genomic reads) but they are more or less similar... for example if I just plt.plot(ref_to_signal). This read's cigar string is 3S10M1D24M1I5M1I7M1D34M1D16M2I15M1I7M1D6M1I12M1I1M2I11M2I5M3I20M15S

it works for my spliced read mentioned above:

please review my PR, thank you very much!

marcus1487 · 2023-10-10T18:57:43Z

I'm on paternity leave at the moment, so I won't be able to take a very close look at this for a little bit, but at a high level unless there is a clear reason I think any differences in produced genomic alignments will not be acceptable. If differences in genomic alignments are required to support the skip operation could you detail those here?

chilampoon · 2023-10-10T19:24:07Z

Got it, thanks @marcus1487. The differences come from map_ref_to_signal()

def map_ref_to_signal(*, query_to_signal, ref_to_query_knots):
    return np.floor(
        np.interp(
            ref_to_query_knots,
            np.arange(query_to_signal.size),
            query_to_signal,
        )
    ).astype(int)

if stick to it, the ref_to_signal array with knots from my new function get_ref_coords() is exactly the same as before for genomic reads.

However for spliced reads, it doesn't generate the correct mapping as the xp in numpy.interp() is not continuous but with gaps - that's why I added

query_to_ref_coords = np.interp(
        np.arange(query_to_signal.size), 
        np.arange(ref_to_query_knots.size), 
        ref_to_query_knots
    ).astype(int)

to replace np.arange(query_to_signal.size).

One way I could think of is to first determine if a read has intron or not, then based on that choose to use np.arange(query_to_signal.size) or the other mapping..

chilampoon · 2023-10-11T20:44:37Z

@marcus1487 I found out I don't need another interpolation in map_ref_to_signal() at all 🤦‍♀️

the spliced read's ref_to_signal is now like

where the last 73 bases are soft clippings.

I then tested the genomic/unspliced reads I have (I wrapped the steps into get_manual_mappins() - please ignore the type)

matches = []
num_processed_reads = 0
for bam_read in bam_reads_reg:
    try:
        mapping = get_manual_mappins(bam_read, pod5_reads[bam_read.query_name])
        num_processed_reads += 1
    except:
        continue

    io_read = Read.from_pod5_and_alignment(
        pod5_read_record=pod5_reads[bam_read.query_name],
        alignment_record=bam_read,
        reverse_signal=True,
    )
    matches.append((mapping == io_read.ref_to_signal).all())

print(num_processed_reads)
np.unique(matches, return_counts=True)

66686
(array([ True]), array([66686]))

the results are all the same! sorry for the mess... all the stuff is a bit complicated. I'm still in awe that people like you can create such complex software, I am going to update the PR now

chilampoon mentioned this issue Oct 10, 2023

Add query to ref coord mapping #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref to signal mapping did not consider splicing #118

ref to signal mapping did not consider splicing #118

chilampoon commented Oct 3, 2023 •

edited

marcus1487 commented Oct 5, 2023

chilampoon commented Oct 10, 2023

marcus1487 commented Oct 10, 2023

chilampoon commented Oct 10, 2023 •

edited

chilampoon commented Oct 11, 2023

ref to signal mapping did not consider splicing #118

ref to signal mapping did not consider splicing #118

Comments

chilampoon commented Oct 3, 2023 • edited

marcus1487 commented Oct 5, 2023

chilampoon commented Oct 10, 2023

marcus1487 commented Oct 10, 2023

chilampoon commented Oct 10, 2023 • edited

chilampoon commented Oct 11, 2023

chilampoon commented Oct 3, 2023 •

edited

chilampoon commented Oct 10, 2023 •

edited