Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't align transcripts with different numbers of exons #195

Open
reece opened this issue Sep 28, 2015 · 4 comments
Open

Don't align transcripts with different numbers of exons #195

reece opened this issue Sep 28, 2015 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@reece
Copy link
Member

reece commented Sep 28, 2015

Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/uta #195
Migrated by bitbucket-issue-migration on 2016-09-09 15:15:07


UTA historically has aligned transcript and genomic exons even when the number of exons in each exon set differs. This practice masks real issues in underlying data and should be discontinued.

@reece reece added major enhancement New feature or request labels Sep 9, 2016
@reece reece added this to the 0.3.0 milestone Sep 9, 2016
@gostachowiak
Copy link

I have discovered an issue with transcript NM_001278433.1 (gene PRKAR1A), which I believe is an example of this issue. If my understanding is incorrect, please let me know.

Exon sets for the transcript:

SET search_path=uta_20180821;
SELECT * FROM exon_set WHERE tx_ac='NM_001278433.1';

267741	NM_001278433.1	AC_000149.1	1	splign	2014-02-11 01:22:19.920492
332948	NM_001278433.1	NC_000017.10	1	blat	2014-02-11 02:40:24.121284
267727	NM_001278433.1	NC_000017.10	1	splign	2014-02-11 01:22:19.920492
763376	NM_001278433.1	NC_000017.11	1	splign	2016-08-27 17:40:37.616249
267735	NM_001278433.1	NC_018928.2	1	splign	2014-02-11 01:22:19.920492
738588	NM_001278433.1	NM_001278433.1	1	transcript	2016-08-27 10:28:27.974572
88837	NM_001278433.1	NM_001278433.1	1	transcript/8ecabff0	2014-02-11 00:00:18.455632
344311	NM_001278433.1	NM_001278433.1	1	transcript/92190059	2015-08-25 22:44:41.311184

The GRCh37 splign chromosomal alignment has 10 exons:

SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='267727';

The "self" alignment has 11 exons:

SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='738588';

By looking at exon lengths, the discrepancy is in exon 1 so when doing g-to-c calculations using hgvs, variants along the entire transcript give bad results.

My assumption was that "transcript" is the relevant self-alignment, and not "transcript/8ecabff0" or "transcript/92190059"

@reece
Copy link
Member Author

reece commented Sep 9, 2020

First, I'm impressed that you dove this far into UTA internals!

I don't know the story for this transcript specifically, and these data are 4-6 years old, perhaps from the time before NCBI released gff files. So, this might be hard to reproduce now from sources.

When alt_aln_method contains /, it means that the UTA loader encountered a case where the definition provided by NCBI changed over time. When this happens, UTA deprecates the existing one by renaming the alignment method. (The hash after the / is a truncated md5 made by serializing the start,end coordinates and CDS start,end.)

The presence of / nearly always mean that the assembly and/or alignments are problematic. So, proceed with caution.

In uta_20190926, I see this:

anonymous@uta/uta=> set search_path  = uta_20190926 ;
anonymous@uta/uta=> select alt_ac, alt_aln_method, n_exons from tx_exon_set_summary_mv where tx_ac = 'NM_001278433.1' order by 2;
┌────────────────┬─────────────────────┬─────────┐
│     alt_ac     │   alt_aln_method    │ n_exons │
├────────────────┼─────────────────────┼─────────┤
│ NC_000017.10   │ blat                │      11 │
│ NC_018928.2    │ splign              │      10 │
│ AC_000149.1    │ splign              │      10 │
│ NC_000017.10   │ splign              │      11 │
│ NC_000017.11   │ splign              │      11 │
│ NC_000017.10   │ splign/04e3c837     │      10 │
│ NM_001278433.1 │ transcript          │      11 │
│ NM_001278433.1 │ transcript/8ecabff0 │      11 │
│ NM_001278433.1 │ transcript/92190059 │      10 │
└────────────────┴─────────────────────┴─────────┘

So, it looks to me as though you should upgrade to uta_20190926, in which NM_001278433.1 aligns to NC_000017.10 and NC_000017.11 without issues.

Please close if that answers your question.

@gostachowiak
Copy link

Reece:

Thank you very much for your time-- that was helpful.

I don't see uta_20190926 as a tag on the dockerhub page, so I wasn't sure if it was advisable to use:
https://hub.docker.com/r/biocommons/uta/tags

Is this version an "official" release that was built/validated to the same standards as the uta_20180821 version?

Also, if we did update to the 2019 uta, which versions of hgvs and seqrepo would you recommend moving up to?

We currently use:

  • uta: uta_20180821
  • seqrepo: 2018-08-21
  • hgvs: 1.3.0

Thanks again.

Matt

@reece
Copy link
Member Author

reece commented Sep 12, 2020

uta_20190926 currently has an issue (#228) that prevents us from building a docker images. A change was made to materialize a very large view, and it takes >12 hours (when I killed it) to materialize data. We'll need to unwind that before distributing docker images.

You should be able to use any version of hgvs. The change log may help you figure out whether any of the changes since 1.3.0 are relevant to you.

Unfortunately, you'll have to wait on the uta fixes. No ETA yet.

@github-actions github-actions bot removed the major label Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants