Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify SeqLike.__getitem__() to ignore the AA sequence when accessing/slicing the NT sequence #72

Open
ndousis opened this issue Feb 1, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@ndousis
Copy link
Contributor

ndousis commented Feb 1, 2023

As currently implemented, slicing a SeqLike that contains AA and NT records yields:

  1. a simple behavior when the AA record is primary, and
  2. a slow and less robust behavior when the NT record is primary (e.g., if the slice is not a triplet, SeqLike returns a "Partial codon" warning)

In particular, the computational overhead of translating the NT sequence during each __getitem__() call can be very tedious (e.g., when applying a long sliding window over an entire sequence).

I propose that we eliminate the second behavior (attempting to slice the AA record when NT is primary)-- i.e., when slicing the NT SeqLike, all bets are off with respect to the AA record, and the user must explicitly translate the SeqLike to yield a new AA record.

Alternatively, we could add a flag to optionally retain the AA record when True (default False; similar to how BioPython already addresses this for SeqRecord.reverse_complement()), though I don't know how to implement this properly in __getitem__().

@ndousis ndousis added the enhancement New feature or request label Feb 1, 2023
@ndousis ndousis self-assigned this Feb 1, 2023
@andrewgiessel
Copy link
Contributor

Thanks for opening @ndousis !

Keeping the AA and NT records in sync (if we can) is a core design tenant of SeqLike and I am very hesitant to change that. I do appreciate how it adds a lot of overhead, so I've explored a couple of options below.

Your flag idea could potentially work. Optional arguments to __getitem__ are possible, but you can't pass them in unless you call the method directly rather than use the [] syntax. i.e.:

s = SeqLike('aaacccgggttt', seq_type="NT")

# the following are the same
s[0:3]
s.__getitem__(slice(0, 3, None))

# assuming the following signature,
#     def __getitem__(self, index, safe=True) -> "SeqLike":
# then the following statement works and `safe` is False within `__getitem__`

s.__getitem__(slice(0, 3, None), safe=False)

I've implemented this on my machine (but with behavior preserving defaults) as well as not translating if the NT slice is not a triplet. The latter check is a good suggestion regardless. But you can see that the code is a bit verbose because you have to manually make a slice object.

That said, I do not think the flag is worth implementing unless there is a compelling reason you must use SeqLikes for these sections of your workflow. I would suggest instead to explore using the underlying NT SeqRecord and then remaking any SeqLikes if needed afterwards. This would be especially do-able if you don't need the resulting sequence but only a calculated value on sliding windows. I feel like this is your best option here.

Another alternative would be to subclass SeqLike for your specific use case and override __getitem__ there.

Let me know your thoughts

@ndousis
Copy link
Contributor Author

ndousis commented Feb 3, 2023

If the intended behavior is to keep AA and NT in sync, then perhaps we should implement that behavior more directly rather than relying on a re-translation step. As __getitem__() is implemented currently, a sliding NT window with a single nucleotide stride would yield many out-of-frame AA sequences-- so the AA sequence is practically no longer synchronized. Instead, what if __getitem()__ applied a modulo operation to maintain the codon-AA correspondence using "X" or similar at the 5' and 3' flanks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants