Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include taxon id with taxon label in facet count of entity search endpoint #386

Open
vincerubinetti opened this issue Feb 22, 2022 · 5 comments

Comments

@vincerubinetti
Copy link

vincerubinetti commented Feb 22, 2022

I'm developing the 3.0 version of the monarch ui/website, and I've run into a limitation. @putmantime

Here is an example response from the /search/entity/{term} endpoint, searching "ssh":

{
  "numFound": 177,
  "docs": [
    {
      "id": "FlyBase:FBgn0029157",
      "id_std": "FlyBase:FBgn0029157",
      "id_eng": "FlyBase:FBgn0029157",
      "id_kw": "FlyBase:FBgn0029157",
      "prefix": "FlyBase",
      "label": ["ssh"],
      "label_std": ["ssh"],
      "label_eng": ["ssh"],
      "label_kw": ["ssh"],
      "edges": 319,
      "taxon": "NCBITaxon:7227",
      "taxon_std": "NCBITaxon:7227",
      "taxon_eng": "NCBITaxon:7227",
      "taxon_kw": "NCBITaxon:7227",
      "taxon_label": "Drosophila melanogaster",
      "taxon_label_std": "Drosophila melanogaster",
      "taxon_label_eng": "Drosophila melanogaster",
      "taxon_label_kw": "Drosophila melanogaster",
      "taxon_label_synonym": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_std": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_eng": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_kw": ["fruit fly", "Sophophora melanogaster"],
      "has_phenotype": false,
      "category": ["gene", "sequence feature"],
      "category_std": ["gene", "sequence feature"],
      "category_eng": ["gene", "sequence feature"],
      "category_kw": ["gene", "sequence feature"],
      "synonym": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_std": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_eng": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_kw": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "equivalent_curie": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_std": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_eng": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_kw": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "leaf": true,
      "_version_": 1696524917734899700,
      "score": 117.35552
    }
  ],
  "facet_counts": {
    "category": {
    },
    "taxon_label": {
      "Sus scrofa": 25,
      "Drosophila melanogaster": 21,
      "Homo sapiens": 18,
      "Mus musculus": 16,
      "Bos taurus": 6,
      "Saccharomyces cerevisiae S288C": 6,
      "Xenopus tropicalis": 6,
      "Danio rerio": 5,
      "Gallus gallus": 4,
      "Anolis carolinensis": 3,
      "Canis lupus familiaris": 3,
      "Felis catus": 3,
      "Macaca mulatta": 3,
      "Monodelphis domestica": 3,
      "Ornithorhynchus anatinus": 3,
      "Pan troglodytes": 3,
      "Rattus norvegicus": 3,
      "Takifugu rubripes": 3,
      "Equus caballus": 2
    }
  },
  "highlighting": {}
}

Notice that taxon_label is being returned for facets, instead of taxon (id). This is nice for displaying a list of taxon facets, but not for actually filtering by them, because the endpoint only supports filtering by taxon (id), not taxon_label.

This requires the frontend to make a hard-coded label to id mapping for taxons. This duplicates information that we already have in biolink, is brittle, and is likely to get out of sync.

And yes, I can look up taxon from docs by finding the corresponding taxon_label field. However, then I would need to make sure all results are in docs so I have all the mappings, and that might go beyond the max rows [per page] param.


Possible solutions:

  • Support a taxon_label filter parameter (in addition to the taxon parameter) in the search endpoint. I guess this would be most useful if it was an exact match, rather than a fuzzy match. If there are multiple taxon ids that map to the same exact taxon label, then this option wouldn't be viable.

  • Return an additional taxon field in facet_counts with all the information I need: id, label, and count. This would leave the taxon_label facet untouched so current applications using biolink don't suddenly break.

  • Have some kind of taxon_map field at the top level of the response so I can go from label to id easily. Though, I think this is pretty ugly... don't want to add a top level thing for a special exception for just one type of facet.

@falquaddoomi
Copy link
Collaborator

It's not exactly what you're asking for, but would a facet structure like this work?:

"facet_counts": {
    "category": {
        "disease": 27,
        "publication": 9,
        "anatomical entity": 5,
        "cell": 5,
        "gene": 2,
        "sequence feature": 2,
        "phenotype": 1,
        "quality": 1
    },
    "taxon": {
        "NCBITaxon:9031": 1,
        "NCBITaxon:9606": 1
    },
    "taxon_label": {
        "Gallus gallus": 1,
        "Homo sapiens": 1
    },
    "_taxon_map": {
        "NCBITaxon:9031": {
            "Gallus gallus": 1
        },
        "NCBITaxon:9606": {
            "Homo sapiens": 1
        }
    }
}

Two things are different here: 1) there's a new taxon facet that groups results by taxon ID, and 2) there's a _taxon_map entry in facet_counts that groups first by taxon ID, then by taxon label, with the value being the count of both that ID and label. AFAIK there should be a one-to-one mapping between ID and label, so there'll always just be one child of the ID node, but just in case there isn't this structure will still work.

If so, I have this implemented in my fork of the ontobio library -- here's where the _taxon_map key is injected into the facet counts: https://github.com/falquaddoomi/ontobio/blob/92231d447a/ontobio/golr/golr_query.py#L603. I assume we'll have to figure out who downstream might be affected by this...maybe the best way is to submit a PR?

@vincerubinetti
Copy link
Author

That's fine with me. If this is easier to implement or more consistent with how other things and data structures in biolink are implmented, I'd say go for it.

@putmantime
Copy link
Contributor

Is the main reason you chose that structure because it supports 1 to many id to label mappings Faisal?
I don't believe that will be the case as we have chosen the NCBI id/label pair for a taxon.
If what I say is true I think the most explicit and easily readable structure would be an object for each with clear attributes.
"_taxon_map": [{ "label": "Gallus gallus", "id": "NCBITaxon:9031", "count": 1 } ]

But is a list of objects going to cause even more issues in this case @vincerubinetti ?

@falquaddoomi
Copy link
Collaborator

I formatted it that way partly because I wasn't sure if there might be more than one label that matches a given taxon ID, and also because that structure kind of more closely matches how facet pivots are returned from Solr. If IDs and labels are in fact one-to-one I agree that the structure you proposed is more readable, and it's a trivial change on my end.

@putmantime
Copy link
Contributor

Let me do some research and see if I can confirm 1to1.
The typical return type from solr was something I wasn't sure of and standardizing to that might be of more value than the clarity of my proposed structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants