Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Announcing cdot: a data provider to load lots of transcripts fast #634

Open
davmlaw opened this issue Feb 3, 2022 · 20 comments
Open

Announcing cdot: a data provider to load lots of transcripts fast #634

davmlaw opened this issue Feb 3, 2022 · 20 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Feb 3, 2022

Hi, thanks for HGVS project, it’s great.

I’ve written a data provider that has 5.5 times as many transcripts as UTA, and a REST service (>10x faster than public UTA and works behind firewalls)

pip install cdot
from cdot.hgvs.dataproviders import RESTDataProvider

hdp = cdot.RESTDataProvider()

It takes a different approach from UTA - converting existing RefSeq/Ensembl GTFs to JSON.

This allows you to use flat files instead of Postgres, or a trivial (300 line) REST service. It aims at being fast and simple, and doesn't implement all of the provider functionality yet (or maybe ever - ie just transcript/genome conversions)

See project: https://github.com/SACGF/cdot

This is a glue project, and everything is MIT so I’d be happy to merge it into the core HGVS code or become a BioCommons project if you’ll have it - see https://github.com/SACGF/cdot/wiki/Project-goals---directions

Thanks!

@reece
Copy link
Member

reece commented Feb 3, 2022

OMG, I am so thrilled that this exists! I had hoped that something like this would arise to solve clear issues with UTA currency, but also as a way to make it much easier to support custom transcripts. I haven't tried cdot yet, but the direction is spot-on.

What does it take to load from RefSeq? I have ~6 years of NCBI GFF3 archives. That might be useful to generate historical transcript alignments.

Kudos! And thanks for the contribution.

Let's do talk about working together on this... are you on the biocommons slack workspace? Try this: https://join.slack.com/t/biocommons/shared_invite/zt-12xzu037o-abkp_8a8Ta1Sc8JaBkpyqA (expires 2022-03-02).

@davmlaw
Copy link
Contributor Author

davmlaw commented Feb 3, 2022

Hi, as an example here's the script to download and generate RefSeq for 37:

https://github.com/SACGF/cdot/blob/main/generate_transcript_data/refseq_transcripts_grch37.sh

There's scripts for 38 and Ensembl in the same directory.

Will join up on slack at work tomorrow, also keen to define a JSON transcript format - with the super long term goal of a GA4GH standard and pushing Ensembl and RefSeq for an API for every transcript they've ever made but yeah happy to build whatever glue is needed in the meantime.

@phiweger
Copy link

phiweger commented Apr 5, 2022

@davmlaw I successfully used cdot bc/ it provides a flat file for transcript mapping, and thus in principle no internet connection to map coordinates from say coding to genomic. However, this fails w/o a connection:

from cdot.hgvs.dataproviders import JSONDataProvider
from hgvs.assemblymapper import AssemblyMapper
from hgvs.parser import Parser

hdp = JSONDataProvider(['cdot-0.2.1.refseq.grch37_grch38.json.gz'])
am = AssemblyMapper(
    hdp,
    assembly_name='GRCh37',
    alt_aln_method='splign',
    replace_reference=True)
    # blat, splign, genewise
hp = Parser()
var_c = hp.parse_hgvs_variant('NM_001637.3:c.100A>T')
am.c_to_g(var_c)

# HGVSDataNotAvailableError: Failed to fetch NM_001637.3 from bioutils.seqfetcher (network fetching) (Failed to fetch NM_001637.3 (HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=...): Max retries exceeded with url: /entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_001637.3&rettype=fasta&seq_start=501&seq_stop=501&tool=bioutils&email=biocommons-dev@googlegroups.com (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at ...>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))))

You mentioned

This allows you to use flat files instead of Postgres [...]

so what am I doing wrong with the above snippet? To clarify, my aim is to map coding to genomic coords without access to an internet connection. Thanks for your help!

@reece
Copy link
Member

reece commented Apr 8, 2022

I haven't used cdot yet, so I can't provide complete guidance.

The error you received is from seqefetcher, which fetches sequences during normalization. This is distinct from transcript alignments from cdot.

It appears that seqfetcher failed because the call was rejected by NCBI. That could be due to a sporadic outage at NCBI, failure of those credentials, too many recent attempts from your IP, local network issues, or something else.

See https://hgvs.readthedocs.io/en/stable/installation.html for details. You should think of cdot as replacing uta. Consider installing seqrepo to obviate remote sequence lookups.

@phiweger
Copy link

phiweger commented Apr 9, 2022

Thank you for your answer @reece !

Consider installing seqrepo to obviate remote sequence lookups.

How does that work? Like, how can I use seqrepo to not lookup sequences remotely?

@davmlaw
Copy link
Contributor Author

davmlaw commented Apr 11, 2022

Yes - exactly as Reece said - HGVS requires a way to get sequences (SeqRepo) and transcripts (UTA/cdot)

You can install them all locally, for SeqRepo copy/paste the 3 command line snippets from instructions here

then set HGVS_SEQREPO_DIR environment variable before you call HGVS

@phiweger
Copy link

This worked, thanks @reece @davmlaw !

@roysomak4
Copy link

Hi @davmlaw, I share the same excitement as @reece! It simplifies the DevOps of provisioning and maintaining a relational database inside kubernetes. I tried out the cdot dataprovider for hgvs (https://github.com/SACGF/cdot). The two main gaps I see are missing functionality for getting the predicted amino acid change HGVS nomenclature (critical for clinical reporting) and all possible transcripts for a given variant position. Are you planning to add these to the cdot dataprovider? I am happy to open this up as an issue in your GitHub repo if that helps. Thanks!!

@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 8, 2022

Hi, I've added the issues to the cdot repo (linked above) please add any further info if you need. I hope to be able to find time within a few weeks and will ping you when done

@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 29, 2022

Hi @roysomak4 - I've released a new version of cdot (0.2.8) that handles protein identifiers (so c_to_p works) and also all of the possible transcripts for a given position (this only works on local JSON files, not yet on the REST server)

If you use local JSON file, you'll need to download the latest files

Let me know if there are any issues

@roysomak4
Copy link

Thanks @davmlaw! I will run my tests on the local JSON file and share the results shortly.

@roysomak4
Copy link

roysomak4 commented Aug 31, 2022

Hi @davmlaw, the c_to_p function is working fine. I check with one EGFR and one TP53 variant and the correct p. nomenclature is generated for both of them.
The function that should return possible transcripts for a given variant is returning an empty list (example below) regardless of the variant.

>>> import hgvs
>>> from hgvs.assemblymapper import AssemblyMapper
>>> from cdot.hgvs.dataproviders import JSONDataProvider
>>> hdp = JSONDataProvider()
>>> hdp = JSONDataProvider(["./cdot-0.2.8.refseq.grch38.json.gz"])
>>> am = AssemblyMapper(hdp,assembly_name='GRCh38',alt_aln_method='splign',replace_reference=True)
>>> hp = hgvs.parser.Parser()

>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]

>>> hgvs_g = 'NC_000007.14:g.55181370G>A'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]

I am using GRCh38

Thanks,
Somak

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 1, 2022

Hi, this was due to an off by 1 error (I assumed var_g.posedit.pos.start.base would be 1 past the end)

I've fixed it and upgraded cdot, the data files haven't changed just the library

@roysomak4
Copy link

Thanks for the quick fix! I can confirm that the transcript function is working now (see log below).

In terms of speed, I observed that starting v0.2.8, loading the JSON file take much longer than it did in the prior versions I tested. I am loading cdot-0.2.8.refseq.grch38.json.gz database. It used to be almost instantaneous in the versions prior to 0.2.8. With v0.2.9 it takes about 10-15 seconds. I tried loading the data from nvme SSD and there seems to be no change in the load time.

>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> hp = hgvs.parser.hgvs
hgvs.parser.hgvs
>>> hp = hgvs.parser.Parser()
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> str(var_g)
'NC_000017.11:g.7676154G>C'
>>> var_c = am.g_to_c(var_g, 'NM_000546.6')
>>> str(var_c)
'NM_000546.6:c.215C>G'
>>> var_p = am.c_to_p(var_c)
>>> str(var_p)
'NP_000537.3:p.(Pro72Arg)'
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
['NM_000546.5', 'NM_001126114.2', 'NM_001126118.1', 'NM_001126112.2', 'NM_001126113.2', 'NM_001276695.1', 'NM_001276696.1', 'NM_001276761.1', 'NM_001276760.1', 'NM_001126113.3', 'NM_001126114.3', 'NM_001126118.2', 'NM_001276695.3', 'NM_001126112.3', 'NM_001276696.3', 'NM_001276760.3', 'NM_001276761.3', 'NM_000546.6', 'NM_001276696.2', 'NM_001276695.2', 'NM_001276760.2', 'NM_001276761.2']

@davmlaw
Copy link
Contributor Author

davmlaw commented Sep 5, 2022

Hi, yeah it takes a while to builds the interval tree, but it only does so lazily when you need it.

If you don't use that functionality it should be the same.

You can't store the interval tree as JSON but you can pickle it, so I may look into that as an optional local cache

Copy link

github-actions bot commented Dec 9, 2023

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Dec 9, 2023
Copy link

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 17, 2023
@reece reece reopened this Feb 19, 2024
@reece reece added resurrected and removed stale Issue is stale and subject to automatic closing labels Feb 19, 2024
@reece
Copy link
Member

reece commented Feb 19, 2024

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label May 20, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented May 20, 2024

Hi, cdot has been used by a few other people now, and I think I've ironed out the bugs.

I might make a pull request with some (commented out) options in the documentation examples talking about data provider options

@github-actions github-actions bot removed the stale Issue is stale and subject to automatic closing label May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants