Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queries fail for some uniprot accessions #128

Open
ftwkoopmans opened this issue Jul 7, 2022 · 1 comment
Open

queries fail for some uniprot accessions #128

ftwkoopmans opened this issue Jul 7, 2022 · 1 comment

Comments

@ftwkoopmans
Copy link

Some uniprot accessions are not available for querying nor as output in the "uniprot" field/scope. To illustrate I've included 2 examples, one accession that works (P63044) and one that fails (P23819).

this works via https://mygene.info/v3/api#/query/get_query ;
"q" input: P63044
"fields" input: symbol,name,taxid,entrezgene,uniprot

returns:

{
  "took": 16,
  "total": 1,
  "max_score": 17.406927,
  "hits": [
    {
      "_id": "22318",
      "_score": 17.406927,
      "entrezgene": "22318",
      "name": "vesicle-associated membrane protein 2",
      "symbol": "Vamp2",
      "taxid": 10090,
      "uniprot": {
        "Swiss-Prot": "P63044",
        "TrEMBL": "Q8CHR4"
      }
    }
  ]
}

this works via https://mygene.info/v3/api#/query/get_query ;
in "q" input: P23819
in "fields" input: symbol,name,taxid,entrezgene,uniprot

and returns:

{
  "took": 13,
  "total": 1,
  "max_score": 7.8478303,
  "hits": [
    {
      "_id": "14800",
      "_score": 7.8478303,
      "entrezgene": "14800",
      "name": "glutamate receptor, ionotropic, AMPA2 (alpha 2)",
      "symbol": "Gria2",
      "taxid": 10090,
      "uniprot": {
        "TrEMBL": "Q4LG64"
      }
    }
  ]
}

However, note that for the latter query, the uniprot input ID that I queried (a swissprot record) is not included in the "uniprot" output field! So it seems there is a problem with the mygene.info database, possibly a subset of uniprot accessions/IDs are not stored/linked under "uniprot". Other examples are P23819, Q61941, Q8VHW2.

Furthermore, POST queries against these accessions fail even though they should not (probably same root cause).

this works via https://mygene.info/v3/api#/query/post_query ;
{ "q": "P63044", "scopes": "uniprot" }
returns:

[
  {
    "query": "P63044",
    "_id": "22318",
    "_score": 16.7524,
    "entrezgene": "22318",
    "name": "vesicle-associated membrane protein 2",
    "symbol": "Vamp2",
    "taxid": 10090
  }
]

this query fails, but it should not as this is a valid uniprot accesion that is in the mygene.info dataset (see GET query above) ;
{ "q": "P23819", "scopes": "uniprot" }
returns:

[
  {
    "query": "P23819",
    "notfound": true
  }
]
@andrewsu
Copy link
Member

Just to add a tiny bit more info. I suspect the difference in behavior between P63044 and P23819 is due to the lack of an Entrez Gene mapping in the UniProt file for P23819.

The source file for the uniprot data plugin appears to be https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz.

From the README, the column headings for this file are as follows:

1. UniProtKB-AC
2. UniProtKB-ID
3. GeneID (EntrezGene)
4. RefSeq
5. GI
6. PDB
7. GO
8. UniRef100
9. UniRef90
10. UniRef50
11. UniParc
12. PIR
13. NCBI-taxon
14. MIM
15. UniGene
16. PubMed
17. EMBL
18. EMBL-CDS
19. Ensembl
20. Ensembl_TRS
21. Ensembl_PRO
22. Additional PubMed

Note the difference in the records below in column 3 which should have a mapping to Entrez Gene.

$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P63044"' | tr "\t" "\n" | cat -n | head
     1  P63044
     2  VAMP2_MOUSE
     3  22318
     4  NP_033523.1
     5  51704193; 6678551
     6
     7  GO:0030136; GO:0060203; GO:0005737; GO:0031410; GO:0030659; GO:0030285; GO:0043231; GO:0043229; GO:0016020; GO:0043005; GO:0044306; GO:0048471; GO:0005886; GO:0030141; GO:0030667; GO:0031201; GO:0000322; GO:0045202; GO:0008021; GO:0030672; GO:0070044; GO:0070032; GO:0070033; GO:0005802; GO:0031982; GO:0042589; GO:0048306; GO:0005516; GO:0042802; GO:0017022; GO:0005543; GO:0008022; GO:0044877; GO:0005484; GO:0000149; GO:0019905; GO:0017075; GO:0044325; GO:0017156; GO:0032869; GO:0043308; GO:0098967; GO:0043001; GO:0046879; GO:0060291; GO:0061025; GO:0090316; GO:0015031; GO:0065003; GO:0045055; GO:0017158; GO:1902259; GO:0017157; GO:1903421; GO:0060627; GO:0009749; GO:0035493; GO:0016081; GO:0048488; GO:0016079; GO:0006906; GO:0016192
     8  UniRef100_P63044
     9  UniRef90_P63044
    10  UniRef50_P63044
$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P23819"' | tr "\t" "\n" | cat -n | head
     1  P23819
     2  GRIA2_MOUSE
     3
     4
     5  496139; 22096313; 26335713; 496140; 12852206
     6  7LDD:B; 7LDD:D; 7LDE:B; 7LDE:D; 7LEP:B; 7LEP:D
     7  GO:0032281; GO:0032279; GO:0009986; GO:0030425; GO:0032839; GO:0043198; GO:0043197; GO:0005783; GO:0005789; GO:0098978; GO:0030426; GO:0005887; GO:0099061; GO:0099055; GO:0099056; GO:0016020; GO:0043005; GO:0043025; GO:0043204; GO:0099544; GO:0005886; GO:0014069; GO:0098839; GO:0045211; GO:0042734; GO:0032991; GO:0098685; GO:0036477; GO:0045202; GO:0097060; GO:0008021; GO:0030672; GO:0043195; GO:0004971; GO:0001540; GO:0051117; GO:0008092; GO:0005234; GO:0035254; GO:0042802; GO:0019865; GO:0004970; GO:0015277; GO:0015276; GO:0030165; GO:0019901; GO:0038023; GO:0000149; GO:1904315; GO:0007268; GO:0045184; GO:0035235; GO:0050806; GO:0051262; GO:0031623; GO:0001919; GO:0051966
     8  UniRef100_P23819
     9  UniRef90_P19491-3
    10  UniRef50_P19491

This difference can also be seen on the corresponding UniProt web pages

Having said that, the reciprocal links do exist in NCBI Gene (likely through a mapping to Refseq Protein):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants