Announcing cdot: a data provider to load lots of transcripts fast #634

davmlaw · 2022-02-03T06:19:27Z

Hi, thanks for HGVS project, it’s great.

I’ve written a data provider that has 5.5 times as many transcripts as UTA, and a REST service (>10x faster than public UTA and works behind firewalls)

pip install cdot

from cdot.hgvs.dataproviders import RESTDataProvider

hdp = cdot.RESTDataProvider()

It takes a different approach from UTA - converting existing RefSeq/Ensembl GTFs to JSON.

This allows you to use flat files instead of Postgres, or a trivial (300 line) REST service. It aims at being fast and simple, and doesn't implement all of the provider functionality yet (or maybe ever - ie just transcript/genome conversions)

See project: https://github.com/SACGF/cdot

This is a glue project, and everything is MIT so I’d be happy to merge it into the core HGVS code or become a BioCommons project if you’ll have it - see https://github.com/SACGF/cdot/wiki/Project-goals---directions

Thanks!

The text was updated successfully, but these errors were encountered:

reece · 2022-02-03T08:01:27Z

OMG, I am so thrilled that this exists! I had hoped that something like this would arise to solve clear issues with UTA currency, but also as a way to make it much easier to support custom transcripts. I haven't tried cdot yet, but the direction is spot-on.

What does it take to load from RefSeq? I have ~6 years of NCBI GFF3 archives. That might be useful to generate historical transcript alignments.

Kudos! And thanks for the contribution.

Let's do talk about working together on this... are you on the biocommons slack workspace? Try this: https://join.slack.com/t/biocommons/shared_invite/zt-12xzu037o-abkp_8a8Ta1Sc8JaBkpyqA (expires 2022-03-02).

davmlaw · 2022-02-03T09:25:17Z

Hi, as an example here's the script to download and generate RefSeq for 37:

https://github.com/SACGF/cdot/blob/main/generate_transcript_data/refseq_transcripts_grch37.sh

There's scripts for 38 and Ensembl in the same directory.

Will join up on slack at work tomorrow, also keen to define a JSON transcript format - with the super long term goal of a GA4GH standard and pushing Ensembl and RefSeq for an API for every transcript they've ever made but yeah happy to build whatever glue is needed in the meantime.

phiweger · 2022-04-05T06:53:18Z

@davmlaw I successfully used cdot bc/ it provides a flat file for transcript mapping, and thus in principle no internet connection to map coordinates from say coding to genomic. However, this fails w/o a connection:

from cdot.hgvs.dataproviders import JSONDataProvider
from hgvs.assemblymapper import AssemblyMapper
from hgvs.parser import Parser

hdp = JSONDataProvider(['cdot-0.2.1.refseq.grch37_grch38.json.gz'])
am = AssemblyMapper(
    hdp,
    assembly_name='GRCh37',
    alt_aln_method='splign',
    replace_reference=True)
    # blat, splign, genewise
hp = Parser()
var_c = hp.parse_hgvs_variant('NM_001637.3:c.100A>T')
am.c_to_g(var_c)

# HGVSDataNotAvailableError: Failed to fetch NM_001637.3 from bioutils.seqfetcher (network fetching) (Failed to fetch NM_001637.3 (HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=...): Max retries exceeded with url: /entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_001637.3&rettype=fasta&seq_start=501&seq_stop=501&tool=bioutils&email=biocommons-dev@googlegroups.com (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at ...>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))))

You mentioned

This allows you to use flat files instead of Postgres [...]

so what am I doing wrong with the above snippet? To clarify, my aim is to map coding to genomic coords without access to an internet connection. Thanks for your help!

reece · 2022-04-08T23:02:43Z

I haven't used cdot yet, so I can't provide complete guidance.

The error you received is from seqefetcher, which fetches sequences during normalization. This is distinct from transcript alignments from cdot.

It appears that seqfetcher failed because the call was rejected by NCBI. That could be due to a sporadic outage at NCBI, failure of those credentials, too many recent attempts from your IP, local network issues, or something else.

See https://hgvs.readthedocs.io/en/stable/installation.html for details. You should think of cdot as replacing uta. Consider installing seqrepo to obviate remote sequence lookups.

phiweger · 2022-04-09T19:07:31Z

Thank you for your answer @reece !

Consider installing seqrepo to obviate remote sequence lookups.

How does that work? Like, how can I use seqrepo to not lookup sequences remotely?

davmlaw · 2022-04-11T08:17:37Z

Yes - exactly as Reece said - HGVS requires a way to get sequences (SeqRepo) and transcripts (UTA/cdot)

You can install them all locally, for SeqRepo copy/paste the 3 command line snippets from instructions here

then set HGVS_SEQREPO_DIR environment variable before you call HGVS

phiweger · 2022-04-11T15:03:44Z

This worked, thanks @reece @davmlaw !

roysomak4 · 2022-08-05T18:56:24Z

Hi @davmlaw, I share the same excitement as @reece! It simplifies the DevOps of provisioning and maintaining a relational database inside kubernetes. I tried out the cdot dataprovider for hgvs (https://github.com/SACGF/cdot). The two main gaps I see are missing functionality for getting the predicted amino acid change HGVS nomenclature (critical for clinical reporting) and all possible transcripts for a given variant position. Are you planning to add these to the cdot dataprovider? I am happy to open this up as an issue in your GitHub repo if that helps. Thanks!!

davmlaw · 2022-08-08T13:51:04Z

Hi, I've added the issues to the cdot repo (linked above) please add any further info if you need. I hope to be able to find time within a few weeks and will ping you when done

davmlaw · 2022-08-29T12:26:23Z

Hi @roysomak4 - I've released a new version of cdot (0.2.8) that handles protein identifiers (so c_to_p works) and also all of the possible transcripts for a given position (this only works on local JSON files, not yet on the REST server)

If you use local JSON file, you'll need to download the latest files

Let me know if there are any issues

roysomak4 · 2022-08-31T14:25:16Z

Thanks @davmlaw! I will run my tests on the local JSON file and share the results shortly.

roysomak4 · 2022-08-31T15:22:23Z

Hi @davmlaw, the c_to_p function is working fine. I check with one EGFR and one TP53 variant and the correct p. nomenclature is generated for both of them.
The function that should return possible transcripts for a given variant is returning an empty list (example below) regardless of the variant.

>>> import hgvs
>>> from hgvs.assemblymapper import AssemblyMapper
>>> from cdot.hgvs.dataproviders import JSONDataProvider
>>> hdp = JSONDataProvider()
>>> hdp = JSONDataProvider(["./cdot-0.2.8.refseq.grch38.json.gz"])
>>> am = AssemblyMapper(hdp,assembly_name='GRCh38',alt_aln_method='splign',replace_reference=True)
>>> hp = hgvs.parser.Parser()

>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]

>>> hgvs_g = 'NC_000007.14:g.55181370G>A'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]

I am using GRCh38

Thanks,
Somak

davmlaw · 2022-09-01T02:40:34Z

Hi, this was due to an off by 1 error (I assumed var_g.posedit.pos.start.base would be 1 past the end)

I've fixed it and upgraded cdot, the data files haven't changed just the library

roysomak4 · 2022-09-02T19:35:34Z

Thanks for the quick fix! I can confirm that the transcript function is working now (see log below).

In terms of speed, I observed that starting v0.2.8, loading the JSON file take much longer than it did in the prior versions I tested. I am loading cdot-0.2.8.refseq.grch38.json.gz database. It used to be almost instantaneous in the versions prior to 0.2.8. With v0.2.9 it takes about 10-15 seconds. I tried loading the data from nvme SSD and there seems to be no change in the load time.

>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> hp = hgvs.parser.hgvs
hgvs.parser.hgvs
>>> hp = hgvs.parser.Parser()
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> str(var_g)
'NC_000017.11:g.7676154G>C'
>>> var_c = am.g_to_c(var_g, 'NM_000546.6')
>>> str(var_c)
'NM_000546.6:c.215C>G'
>>> var_p = am.c_to_p(var_c)
>>> str(var_p)
'NP_000537.3:p.(Pro72Arg)'
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
['NM_000546.5', 'NM_001126114.2', 'NM_001126118.1', 'NM_001126112.2', 'NM_001126113.2', 'NM_001276695.1', 'NM_001276696.1', 'NM_001276761.1', 'NM_001276760.1', 'NM_001126113.3', 'NM_001126114.3', 'NM_001126118.2', 'NM_001276695.3', 'NM_001126112.3', 'NM_001276696.3', 'NM_001276760.3', 'NM_001276761.3', 'NM_000546.6', 'NM_001276696.2', 'NM_001276695.2', 'NM_001276760.2', 'NM_001276761.2']

davmlaw · 2022-09-05T02:21:39Z

Hi, yeah it takes a while to builds the interval tree, but it only does so lazily when you need it.

If you don't use that functionality it should be the same.

You can't store the interval tree as JSON but you can pickle it, so I may look into that as an optional local cache

github-actions · 2023-12-09T01:53:57Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2023-12-17T01:59:15Z

This issue was closed because it has been stalled for 7 days with no activity.

reece · 2024-02-19T17:45:35Z

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

github-actions · 2024-05-20T01:58:45Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

davmlaw · 2024-05-20T03:50:56Z

Hi, cdot has been used by a few other people now, and I think I've ironed out the bugs.

I might make a pull request with some (commented out) options in the documentation examples talking about data provider options

This was referenced Feb 3, 2022

ENSEMBL transcripts not versioned biocommons/uta#233

Open

build UTA REST interface / refactor UTA client packaging biocommons/uta#164

Open

davmlaw mentioned this issue Aug 8, 2022

Transcripts over a position SACGF/cdot#18

Closed

davmlaw mentioned this issue Aug 29, 2022

Add protein IDs to genes SACGF/cdot#14

Closed

davmlaw mentioned this issue Sep 1, 2022

transcripts for region doesn't handle SNPs properly - HGVS relevant_transcripts fails SACGF/cdot#22

Closed

github-actions bot added the stale Issue is stale and subject to automatic closing label Dec 9, 2023

github-actions bot added the closed-by-stale label Dec 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 17, 2023

reece reopened this Feb 19, 2024

reece added resurrected and removed stale Issue is stale and subject to automatic closing labels Feb 19, 2024

github-actions bot added the stale Issue is stale and subject to automatic closing label May 20, 2024

github-actions bot removed the stale Issue is stale and subject to automatic closing label May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Announcing cdot: a data provider to load lots of transcripts fast #634

Announcing cdot: a data provider to load lots of transcripts fast #634

davmlaw commented Feb 3, 2022 •

edited

reece commented Feb 3, 2022

davmlaw commented Feb 3, 2022

phiweger commented Apr 5, 2022 •

edited

reece commented Apr 8, 2022

phiweger commented Apr 9, 2022

davmlaw commented Apr 11, 2022 •

edited

phiweger commented Apr 11, 2022

roysomak4 commented Aug 5, 2022

davmlaw commented Aug 8, 2022

davmlaw commented Aug 29, 2022

roysomak4 commented Aug 31, 2022

roysomak4 commented Aug 31, 2022 •

edited

davmlaw commented Sep 1, 2022

roysomak4 commented Sep 2, 2022

davmlaw commented Sep 5, 2022

github-actions bot commented Dec 9, 2023

github-actions bot commented Dec 17, 2023

reece commented Feb 19, 2024

github-actions bot commented May 20, 2024

davmlaw commented May 20, 2024

Announcing cdot: a data provider to load lots of transcripts fast #634

Announcing cdot: a data provider to load lots of transcripts fast #634

Comments

davmlaw commented Feb 3, 2022 • edited

reece commented Feb 3, 2022

davmlaw commented Feb 3, 2022

phiweger commented Apr 5, 2022 • edited

reece commented Apr 8, 2022

phiweger commented Apr 9, 2022

davmlaw commented Apr 11, 2022 • edited

phiweger commented Apr 11, 2022

roysomak4 commented Aug 5, 2022

davmlaw commented Aug 8, 2022

davmlaw commented Aug 29, 2022

roysomak4 commented Aug 31, 2022

roysomak4 commented Aug 31, 2022 • edited

davmlaw commented Sep 1, 2022

roysomak4 commented Sep 2, 2022

davmlaw commented Sep 5, 2022

github-actions bot commented Dec 9, 2023

github-actions bot commented Dec 17, 2023

reece commented Feb 19, 2024

github-actions bot commented May 20, 2024

davmlaw commented May 20, 2024

davmlaw commented Feb 3, 2022 •

edited

phiweger commented Apr 5, 2022 •

edited

davmlaw commented Apr 11, 2022 •

edited

roysomak4 commented Aug 31, 2022 •

edited