Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref to signal mapping did not consider splicing #118

Open
chilampoon opened this issue Oct 3, 2023 · 5 comments
Open

ref to signal mapping did not consider splicing #118

chilampoon opened this issue Oct 3, 2023 · 5 comments

Comments

@chilampoon
Copy link

chilampoon commented Oct 3, 2023

Hi @marcus1487, I was using remora API on real direct RNA-seq data, but ran into errors from this command:

  io_read = Read.from_pod5_and_alignment(
      pod5_read_record=pod5_reads[bam_read.query_name],
      alignment_record=bam_read,
      reverse_signal=True,
  )
---------------------------------------------------------------------------

File [/opt/miniconda3/lib/python3.11/site-packages/remora/io.py:1545](https://file+.vscode-resource.vscode-cdn.net/opt/miniconda3/lib/python3.11/site-packages/remora/io.py:1545), in Read.add_alignment(self, alignment_record, parse_ref_align, reverse_signal)
 1540 self.ref_to_signal = DC.compute_ref_to_signal(
 1541     query_to_signal=self.query_to_signal,
 1542     cigar=self.cigar,
 1543 )
 1544 # +1 because knots include the end position of the last base
-> 1545 assert self.ref_to_signal.size == len(self.ref_seq) + 1, (
 1546     "discordant ref seq lengths: move+cigar:"
 1547     f"{self.ref_to_signal.size} ref_seq:{len(self.ref_seq)}"
 1548 )
 1549 self.ref_reg.end = self.ref_reg.start + self.ref_to_signal.size - 1

AssertionError: discordant ref seq lengths: move+cigar:16978 ref_seq:2542

Apparently ref_seq:2542 is the number of bases of this gene's exonic regions (i.e. seq length of this read), and move+cigar:16978 ~= ref_end - ref_start on the genomic region (i.e. the range of all exons and introns).

I then tried to resolve it, for here if changes to

knots = np.interp(np.arange(query_knots[-1] + 1), query_knots, ref_knots)

will get the correct lengths as it uses reference knots and query knots to generate reference to read knots. Then for map_ref_to_signal() I guess can just

return np.floor(
        ref_to_query_knots
    ).astype(int)

then it gives ref_to_signal coordinate mapping.
I haven't tested it further, but I can provide a toy dataset for this issue, thank you and looking forward to your comments.

@marcus1487
Copy link
Collaborator

Remora is not currently designed to work with spliced RNA alignments. This is not currently on the roadmap, but is certainly possible. The fix would be to this function primarily. I'd be happy to accept a PR for this, but need to ensure that it would not have any impact on standard genomic alignments.

@chilampoon
Copy link
Author

Remora is not currently designed to work with spliced RNA alignments. This is not currently on the roadmap, but is certainly possible. The fix would be to this function primarily. I'd be happy to accept a PR for this, but need to ensure that it would not have any impact on standard genomic alignments.

Okay, I misunderstood something before but updated the understanding later on, I will open a PR soon.

I wrote a function to get the query to reference mapping, and modified something to get the query to ref coordinates mappings -> this leads to a few differences from the original ref_to_signal values (for genomic reads) but they are more or less similar... for example if I just plt.plot(ref_to_signal). This read's cigar string is 3S10M1D24M1I5M1I7M1D34M1D16M2I15M1I7M1D6M1I12M1I1M2I11M2I5M3I20M15S

image image

it works for my spliced read mentioned above:
image

please review my PR, thank you very much!

@marcus1487
Copy link
Collaborator

I'm on paternity leave at the moment, so I won't be able to take a very close look at this for a little bit, but at a high level unless there is a clear reason I think any differences in produced genomic alignments will not be acceptable. If differences in genomic alignments are required to support the skip operation could you detail those here?

@chilampoon
Copy link
Author

chilampoon commented Oct 10, 2023

Got it, thanks @marcus1487. The differences come from map_ref_to_signal()

def map_ref_to_signal(*, query_to_signal, ref_to_query_knots):
    return np.floor(
        np.interp(
            ref_to_query_knots,
            np.arange(query_to_signal.size),
            query_to_signal,
        )
    ).astype(int)

if stick to it, the ref_to_signal array with knots from my new function get_ref_coords() is exactly the same as before for genomic reads.

However for spliced reads, it doesn't generate the correct mapping as the xp in numpy.interp() is not continuous but with gaps - that's why I added

query_to_ref_coords = np.interp(
        np.arange(query_to_signal.size), 
        np.arange(ref_to_query_knots.size), 
        ref_to_query_knots
    ).astype(int)

to replace np.arange(query_to_signal.size).

One way I could think of is to first determine if a read has intron or not, then based on that choose to use np.arange(query_to_signal.size) or the other mapping..

@chilampoon
Copy link
Author

@marcus1487 I found out I don't need another interpolation in map_ref_to_signal() at all 🤦‍♀️

the spliced read's ref_to_signal is now like
image
where the last 73 bases are soft clippings.

I then tested the genomic/unspliced reads I have (I wrapped the steps into get_manual_mappins() - please ignore the type)

matches = []
num_processed_reads = 0
for bam_read in bam_reads_reg:
    try:
        mapping = get_manual_mappins(bam_read, pod5_reads[bam_read.query_name])
        num_processed_reads += 1
    except:
        continue

    io_read = Read.from_pod5_and_alignment(
        pod5_read_record=pod5_reads[bam_read.query_name],
        alignment_record=bam_read,
        reverse_signal=True,
    )
    matches.append((mapping == io_read.ref_to_signal).all())
print(num_processed_reads)
np.unique(matches, return_counts=True)

66686
(array([ True]), array([66686]))

the results are all the same! sorry for the mess... all the stuff is a bit complicated. I'm still in awe that people like you can create such complex software, I am going to update the PR now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants