Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global options for missing values #239

Open
deeenes opened this issue Apr 13, 2022 · 1 comment
Open

Global options for missing values #239

deeenes opened this issue Apr 13, 2022 · 1 comment

Comments

@deeenes
Copy link

deeenes commented Apr 13, 2022

Hi,

Thank you for developing this great library!

I have a question about dealing with missing values. See below an example:

import json
from urllib import request
from glom import glom, Coalesce

url = 'https://www.ebi.ac.uk/ols/api/ontologies/efo/terms?size=200'

with request.urlopen(url) as r:

    data = json.loads(r.read())

# 1: no missing values, result is a dict of lists, each 200 long
spec = {
    'label': ('_embedded.terms', ['label']),
    'obo_id': ('_embedded.terms', ['obo_id']),
}

result = glom(data, spec)

# 2: few missing values in "children", result is a single None
spec = {
    'label': ('_embedded.terms', ['label']),
    'obo_id': ('_embedded.terms', ['obo_id']),
    'parents': ('_embedded.terms', ['_links.parents.href']),
    'children': ('_embedded.terms', ['_links.children.href']),
}

result = glom(data, spec, default = None)

# 3: the desired result: the missing values in "children" are replaced by None's
spec = {
    'label': ('_embedded.terms', ['label']),
    'obo_id': ('_embedded.terms', ['obo_id']),
    'parents': ('_embedded.terms', ['_links.parents.href']),
    'children': (
        '_embedded.terms',
        [Coalesce('_links.children.href', default = None)]
    ),
}

result = glom(data, spec)

The third version above is a solution for me: all lists in the result are the same length, no records are dropped, and None is used in place of the missing values. However, this interface is quite inconvenient, as I would need to wrap everything into Coalesce(..., default = None). I am wondering if a better solution exists, where with one single parameter I can set the missing value handling globally?

@kurtbrose
Copy link
Collaborator

Sorry for the slow reply!

One general readability thing -- you can move the cursor down to '_embedded.terms' once outside the dict rather than as part of deriving each value:

spec = (
  '_embedded.terms',
  {
    'label': ['label'],
    'obo_id': ['obo_id'],
    'parents': ['_links.parents.href'],
    'children': [Coalesce('_links.children.href', default = None)],
  }
)

One approach you could take is to stay explicit, and save typing on Coalesce by using Or, which has the same defaulting behavior.

def _or_none(path):
   return Coalesce(path, default=None)

spec = (
  '_embedded.terms',
  {
    'label': [Or('label', default=None)],
    'obo_id': [Or('obo_id', default=None)],
    'parents': [Or('_links.parents.href', default=None)],
    'children': [Or('_links.children.href', default = None)],
  }
)

Another approach you could take is to embrace that specs are basic python data structures, and write a helper function to do the "boring stuff".

def get_paths_in_list(path_dict, default=None):
   '''given a dict of {key: path}, returns a spec that fetches that path with a default from each child'''
   return {key: [Or(val, default=default)] for key, val in path_dict.items}


spec = (
  '_embedded.terms',
  get_paths_in_list({
    'label': 'label',
    'obo_id': 'obo_id',
    'parents': '_links.parents.href',
    'children': '_links.children.href',
  })
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants