Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exon boundary / intron validation - need for genome build specific validation? #700

Open
davmlaw opened this issue Sep 28, 2023 · 4 comments
Labels
data provider schema change enhancement New feature or request keep alive exempt issue from staleness checks

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Sep 28, 2023

Biocommons HGVS currently performs no validation on whether an intronic coordinate is valid or inside an intron or not.

The trouble is - to perform validation - you need to know information about strandedness, which HGVS does not have access to until it knows the genome build. - this means you can't do the validation in the obvious place ExtrinsicValidator probably on the "var_n" (sequence variant of type "n")

For instance you could provide a wrong exon boundary. The HGVS Spec on numbering says:

nucleotides at the 5’ end of an intron are numbered relative to the last nucleotide of the directly upstream exon, followed by a “+” (plus) and their position in to the intron, like c.87+1, c.87+2, c.87+3, …
nucleotides at the 3’ end of an intron are numbered relative to the first nucleotide of the directly downstream exon, followed by a “-” (minus) and their position out of the intron, like …, c.88-3, c.88-2, c.88-1.

If offset is positive, exon boundary should be in stranded ends
If offset is negative, exon boundary should be in stranded starts

Example 1 (no error w/Biocommons HGVS) - correct boundary is 228, I provide the wrong exon boundary:

Example Variant Validator ClinGen Allele Registry
NM_152587.3:c.228+1G>T valid valid
NM_152587.3:c.227+1A>T ExonBoundaryError: Position c.227+1 does not correspond with an exon boundary for transcript NM_152587.3 InternalServerError - intronic position inside exon

ClinGen gives the same error message if the exon boundary is wrong, even if you would be inside the intron (eg NM_152587.3:c.227+5A>T)

Example 2 (no error w/Biocommons HGVS) - I reverse the offset (from "-" to "+") leaving boundary as is

Example Variant Validator ClinGen Allele Registry
NM_152587.3:c.175-1G>C valid valid
NM_152587.3:c.175+1C>G ExonBoundaryError: Position c.175+1 does not correspond with an exon boundary for transcript NM_152587.3 InternalServerError - intronic position inside exon

VariantValidator is looking at the correct exon boundary for the strandedness (ie starts or ends) so even if you use a valid exon boundary (just with signs reversed) it gives the same error

Notes on validation implementation

Key issue: To know the correct exon starts/exon ends - you need to know the transcript's strandedness

It's often easier to work with sequence variants of type "n" as their boundaries correspond to transcript exon start/ends, eg:

c.HGVS n.HGVS exon boundary where to look
NM_152587.3:c.228+1G>T NM_152587.3:n.298+1G>T 298 '+' so upstream
NM_152587.3:c.175-1G>C NM_152587.3:n.245-1G>C 245 '-' so downstream

But how to map upstream/downstream to exon starts/ends? You need to know strand - and this is NOT provided in any data providers methods that don't take genome build / contig

tx_info = hdp.get_tx_identity_info("NM_152587.3")

# To know how starts/ends map - you need to know strandedness of transcript - this is "-"
exon_starts = []
exon_ends = []
total = 0
for length in tx_info["lengths"]:
    exon_starts.append(total + 1)
    total += length
    exon_ends.append(total)
In [115]: exon_starts
Out[115]: [1, 62, 152, 245, 299, 500, 631, 802, 858]

In [116]: exon_ends
Out[116]: [61, 151, 244, 298, 499, 630, 801, 857, 1033]

Valid exon boundaries that map outside the transcript

  • Offsets that are so big they map off the transcript
  • Offsets that are so big they map past the intron into the next exon

To work either of these out, you need to know how big the introns are - which you can only get via data provider methods that take a contig/genome build

Offsets of 0 are prohibited

This is a low priority issue, and probably doesn't hurt much to leave it.

But technically, NM_152587.3:c.228-0= is invalid.

Variant validator doesn't throw an error, but ClinGen allele registry throws "HgvsParsingError - Cannot parse definition of mutation"

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Nov 29, 2023
Copy link

github-actions bot commented Dec 6, 2023

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2023
@reece reece removed stale Issue is stale and subject to automatic closing closed-by-stale labels Dec 8, 2023
@reece reece reopened this Dec 8, 2023
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Mar 11, 2024
Copy link

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2024
@jsstevenson jsstevenson added enhancement New feature or request keep alive exempt issue from staleness checks and removed stale Issue is stale and subject to automatic closing closed-by-stale labels Mar 19, 2024
@jsstevenson jsstevenson reopened this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data provider schema change enhancement New feature or request keep alive exempt issue from staleness checks
Projects
None yet
Development

No branches or pull requests

3 participants