Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: How to handle sequence casing and other transformations #94

Open
reece opened this issue Mar 3, 2021 · 3 comments
Open

RFC: How to handle sequence casing and other transformations #94

reece opened this issue Mar 3, 2021 · 3 comments
Labels
keep alive exempt issue from staleness checks

Comments

@reece
Copy link
Member

reece commented Mar 3, 2021

Problem Summary

There are several flavors of GRCh38. All are coordinate compatible but have distinct sequences. "Official" GRCh38 sequences are uppercase and contain ambiguity characters. Ensembl replaces ambiguity characters with N. hg38 from UCSC represents repeat regions with lower case.

SeqRepo needs a way to preserve the original sequence verbatim, but also to support commonly used transformations, and to make this choice apparent to users.

Background

The GRC defines official genomic references, which includes the assembly name, member accessions, nucleotide sequences, alternate assemblies, etc. For an example, see GCF_000001405.26 assembly report.

According to GRCh38, the sequences referred to by GRCh38:1 and refseq:NC_000001.11 is a (masked) sequence w/ambiguity characters. It is unacceptable to hijack these identifiers to mean another sequence. However, these sequences are very usable as-is because no one expects lower case in the genomic sequence, for example. (Embedding annotations like masking into sequences is a mistake.)

Because the GRC sequences are inconvenient to use as-is, UCSC and Ensembl transform the sequences to be more useful. The transformations preserve coordinates, but change the sequence by upper-casing. Thus, we have two versions of each sequence for a given assembly.

While supporting case-squashing and disambiguating sequences, it should also be possible to support reverse complement and circular sequences and coordinates.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Sep 19, 2023
@github-actions
Copy link

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2023
@reece reece added stale closed Issue was closed automatically due to inactivity and removed stale closed Issue was closed automatically due to inactivity labels Nov 27, 2023
@reece reece reopened this Dec 8, 2023
@reece reece removed the stale Issue is stale and subject to automatic closing label Dec 8, 2023
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label May 25, 2024
@jsstevenson jsstevenson added keep alive exempt issue from staleness checks and removed stale Issue is stale and subject to automatic closing labels May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep alive exempt issue from staleness checks
Projects
None yet
Development

No branches or pull requests

2 participants