Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash computation fails for recipe file patterns containing aiohttp objects #443

Open
derekocallaghan opened this issue Nov 18, 2022 · 5 comments · May be fixed by #444
Open

Hash computation fails for recipe file patterns containing aiohttp objects #443

derekocallaghan opened this issue Nov 18, 2022 · 5 comments · May be fixed by #444

Comments

@derekocallaghan
Copy link
Contributor

derekocallaghan commented Nov 18, 2022

Recipes that use aiohttp objects in file_pattern.fsspec_open_kwargs["client_kwargs"] are currently failing as the hash can't be computed, e.g.:

Although this has been mentioned in the recent hash computation PR #429, it's unrelated to the associated changes that now compute the recipe hash at construction. Instead, the issue lies with pangeo-forge-recipes.serialization.either_encode_or_hash() when recipe file patterns contain aiohttp objects in fsspec_open_kwargs["client_kwargs"].

def either_encode_or_hash(obj: Any):
"""For objects which are not serializable with ``json.dumps``, this function defines
type-specific handlers which extract either a serializable value or a hash from the object.
:param obj: Any object which is not serializable to ``json``.
"""
if isinstance(obj, Enum): # custom serializer for FileType, CombineOp, etc.
return obj.value
elif hasattr(obj, "sha256"):
return obj.sha256.hex()
elif inspect.isfunction(obj):
return inspect.getsource(obj)
elif isinstance(obj, bytes):
return obj.hex()
raise TypeError(f"object of type {type(obj).__name__} not serializable")

Example recipe aiohttp usage (CCMP) here:

# need to override the aiohttp timeout
#  # https://github.com/pangeo-forge/eooffshore_ics_ccmp_v02_1_nrt_wind-feedstock/issues/2
pattern = FilePattern(
    make_url,
    ConcatDim(name="time", keys=dates, nitems_per_file=NITEMS_PER_FILE),
    fsspec_open_kwargs={"client_kwargs": {"timeout": aiohttp.ClientTimeout(total=1800)}},
)

The following code snippet will reproduce the problems encountered in the CCMP and Daymet recipes linked above:

In [50]: from aiohttp import BasicAuth, ClientTimeout

In [51]: either_encode_or_hash(aiohttp.BasicAuth(login="test", password="test"))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [51], in <cell line: 1>()
----> 1 either_encode_or_hash(BasicAuth(login="test", password="test"))

Input In [5], in either_encode_or_hash(obj)
     13 elif isinstance(obj, bytes):
     14     return obj.hex()
---> 15 raise TypeError(f"object of type {type(obj).__name__} not serializable")

TypeError: object of type BasicAuth not serializable

In [52]: either_encode_or_hash(ClientTimeout(total=1800))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [52], in <cell line: 1>()
----> 1 either_encode_or_hash(ClientTimeout(total=1800))

Input In [5], in either_encode_or_hash(obj)
     13 elif isinstance(obj, bytes):
     14     return obj.hex()
---> 15 raise TypeError(f"object of type {type(obj).__name__} not serializable")

TypeError: object of type ClientTimeout not serializable

FilePattern has its own sha256() implementation, and although fsspec_open_kwargs (and client_kwargs) is currently included, perhaps this could be excluded (similar to the pattern format_function and combine dims currently excluded), or particular kwargs (e.g. any aiohttp objects) could be excluded from fsspec_open_kwargs prior to serialization. I guess timeouts/credentials values are something that aren't necessary for a recipe hash?

# we exclude the format function and combine dims from ``root`` because they determine the
# index:filepath pairs yielded by iterating over ``.items()``. if these pairs are generated in
# a different way in the future, we ultimately don't care.
root = {
"fsspec_open_kwargs": pattern.fsspec_open_kwargs,
"query_string_secrets": pattern.query_string_secrets,
"file_type": pattern.file_type,
"nitems_per_file": {
op.name: op.nitems_per_file # type: ignore
for op in pattern.combine_dims
if op.name in pattern.concat_dims
},
}

In the case of aiohttp.BasicAuth, it includes encode(), which could potentially be used e.g. adding something like elif hasattr(obj, "encode"): to either_encode_or_hash(). However, nothing similar exists for aiohttp.ClientTimeout (I assume this is true of other aiohttp classes, although I haven't checked), so it may simply be easiest to exclude all aiohttp objects from the file pattern hash.

@cisaacstern
Copy link
Member

cisaacstern commented Nov 23, 2022

@derekocallaghan, thanks for this detailed description, which made it a lot easier for me to understand this issue and the possible solutions.

or particular kwargs (e.g. any aiohttp objects) could be excluded from fsspec_open_kwargs prior to serialization. I guess timeouts/credentials values are something that aren't necessary for a recipe hash?

I would agree with this. The most expedient fix here seems like excluding client_kwargs from the file_pattern hash by changing

root = {
"fsspec_open_kwargs": pattern.fsspec_open_kwargs,

to something like

  root = { 
    "fsspec_open_kwargs": {
      k: v
      for k, v in pattern.fsspec_open_kwargs.items()
      # Let's put a comment here describing why we exclude only
      # `client_kwargs` (because it can contain hard-to-hash objects)
      # so that we remember why we did this.
      if k != "client_kwargs"
    },
    ...
  }

My suggestion of taking this "shortcut" (while memorializing the decision in a comment) is motivated by:

  1. The fact that, as you've observed above, in most (all?) cases these client_kwargs won't affect the data itself, and therefore don't need to be included in a hash.
  2. The hashes are not actually used in production currently. Eventually we'd like them to help us know whether or not to rerun recipes, but that has not been implemented yet.
  3. This bug is actually blocking real recipes (linked at the top of your comment above) today, so we shouldn't get bogged down in fine-grained edge-casing of this as-yet-unused feature right now, when we have a plausible path to quickly unblock these recipes.

If, once hashes are used in Pangeo Forge Cloud, we then realize that client_kwargs contains something we want included in the hash, we can circle back and work on the edge casing at that point.

Make sense, @rabernat? If so, I'd invite @derekocallaghan to make a PR for this (if you're interested!).

@derekocallaghan
Copy link
Contributor Author

If, once hashes are used in Pangeo Forge Cloud, we then realize that client_kwargs contains something we want included in the hash, we can circle back and work on the edge casing at that point.

Make sense, @rabernat? If so, I'd invite @derekocallaghan to make a PR for this (if you're interested!).

Thanks @cisaacstern, I'll create a PR for this next week.

@derekocallaghan
Copy link
Contributor Author

I'm currently working on the PR for this including corresponding test cases, and it seems that aiohttp.BasicAuth isn't an issue as it can be serialized to json without calling either_encode_or_hash(), as is performed in pangeo_forge_recipes.serialization.dict_to_sha256():

b = dumps(
dictionary,
default=either_encode_or_hash,
ensure_ascii=False,
sort_keys=True,
indent=None,
separators=(",", ":"),
)

In [55]: dictionary
Out[55]: 
{'fsspec_open_kwargs': {'client_kwargs': {'auth': BasicAuth(login='test', password='test', encoding='latin1')}},
 'file_type': <FileType.netcdf4: 'netcdf4'>,
 'nitems_per_file': {'time': 1}}

In [56]: dumps(
    ...:         dictionary,
    ...:         default=either_encode_or_hash,
    ...:         ensure_ascii=False,
    ...:         sort_keys=True,
    ...:         indent=None,
    ...:         separators=(",", ":"),
    ...:     )
Out[56]: '{"file_type":"netcdf4","fsspec_open_kwargs":{"client_kwargs":{"auth":["test","test","latin1"]}},"nitems_per_file":{"time":1}}'

I'll still exclude client_kwargs from the hash anyway.

@derekocallaghan
Copy link
Contributor Author

I'll take a look to see whether this is still an issue following the beam-refactor release

@cisaacstern
Copy link
Member

Thanks @derekocallaghan 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants