Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: decorator to validate and annotate dependencies #2967

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jan 19, 2024

Closes #2953

This PR adds a new dependencies argument to high_level_function. This let's us fail at runtime if dependency specifications are not met, and also to annotate our documentation to inform the user ahead-of-time.

We will still need to keep these values in sync with our new pyproject.toml groups, but we can make a helper for that if needs be.

Examples:
image

image

Copy link

codecov bot commented Jan 19, 2024

Codecov Report

Attention: Patch coverage is 81.60622% with 71 lines in your changes are missing coverage. Please review.

Project coverage is 81.94%. Comparing base (9d93512) to head (0986f75).
Report is 62 commits behind head on main.

Additional details and impacted files
Files Coverage Δ
src/awkward/jax.py 87.93% <100.00%> (+3.44%) ⬆️
src/awkward/operations/ak_from_arrow.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_from_arrow_schema.py 76.92% <100.00%> (+4.19%) ⬆️
src/awkward/operations/ak_from_cupy.py 88.88% <100.00%> (+3.17%) ⬆️
src/awkward/operations/ak_from_feather.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_from_jax.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_from_rdataframe.py 40.74% <100.00%> (+4.74%) ⬆️
src/awkward/operations/ak_to_arrow.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_to_arrow_table.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_to_cupy.py 62.50% <100.00%> (+5.35%) ⬆️
... and 63 more

... and 1 file with indirect coverage changes

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice!

But what is meant by "extras" in this case? Is it pip install awkward[extras]? That would likely be obscure to most users.

I think it's useful for both the documentation and the exception message to include sample pip-install and conda-install text that can be copy-pasted into a terminal to actually install the necessary packages. There are situations in which I'll deliberately do something wrong if I know that the error message has what I need to do in a copy-pasteable form—it beats typing it manually or remembering its spelling.

I don't think any of these libraries will need a mapping between PyPI name and conda name, but if such a thing becomes necessary in the future, it can be a dict. (Uproot optionally requires xxhash, which is python-xxhash in conda-forge for disambiguation.)


I made a new environment without the optional dependencies to test it out.

>>> ak.to_arrow(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
ImportError: to use to_arrow, you must install pyarrow:

    pip install pyarrow

or

    conda install -c conda-forge pyarrow


This error occurred while calling

    ak.to_arrow(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
    )
>>> ak.to_jax(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
ImportError: This function has the following dependency requirements that are not met by your current environment:

    * jax — you do not have this package
    * jaxlib — you do not have this package



This error occurred while calling

    ak.to_jax(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
    )
>>> ak.to_parquet("/tmp/whatever.parquet", ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
ImportError: This function has the following dependency requirements that are not met by your current environment:

    * pyarrow>=7.0.0 — you do not have this package
    * fsspec — you do not have this package

You can fix this error by installing these packages directly, or install awkward with all of the following extras:

    * arrow

This error occurred while calling

    ak.to_parquet(
        '/tmp/whatever.parquet'
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
    )
>>> ak.to_backend(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]), "cuda")
ModuleNotFoundError: to use Awkward Arrays with CUDA, you must install cupy:

    pip install cupy

or

    conda install -c conda-forge cupy


This error occurred while calling

    ak.to_backend(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
        'cuda'
    )

It might be better to exclude the "This error occurred while calling" for failures due to missing libraries, so that the message about the missing libraries is last. Otherwise, someone might spend a lot of time looking at the arguments they passed to the function, looking for errors, but the more relevant message is right above.

I also tried installing the wrong version of Arrow, to see what the message would look like:

>>> ak.to_arrow(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
ImportError: pyarrow 7.0.0 or later required for to_arrow

This error occurred while calling

    ak.to_arrow(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
    )

It doesn't seem to be reaching

f"    * {req} — you have {ver} installed"

for that case.

@agoose77
Copy link
Collaborator Author

@jpivarski I did another pass! I can't reproduce your arrow bug, perhaps you could interrogate what importlib.metadata.version shows for pyarrow?

Let me know what you think of the docs deployment.

@agoose77
Copy link
Collaborator Author

I've added pyarrow as a dependency for all string functions. We can go finer grained now than just "pyarrow>=7.0.0", but this is a start.

I've made the docs note directive collapsed by default.

We need to unblock this, and update to the latest sphinx stack. This will
just get things working *for now* (TODO)
@agoose77
Copy link
Collaborator Author

I noticed that the docs environment isn't solving properly. I suspect that the dependencies are not pinned in Sphinx. The proper fix is to update our sphinx usage, but for now let's try just pinning everything hard.

Also, I need to update the decorators of more functions as-per your recent comment in the issue tracker.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 22, 2024

The new errors are caused by pathways that don't technically require e.g. fsspec hitting this new decorator.

As I see it, we have two options:

  • enforce dependencies at the function scope (e.g. there is no way to call to_json without needing fsspec)
  • set dependencies at the function scope, enforce requirement in branches (see below)
@high_level_function(name="ak.to_json", dependencies=["fsspec>1.0.0"])
def to_json(...):
    if FOO:
        # Ensures that we get fsspec>1.0.0
        fsspec = require("fsspec")
    else:
        # no fsspec
        pass

@jpivarski any preferences?

@jpivarski
Copy link
Member

As I see it, we have two options:

  1. enforce dependencies at the function scope (e.g. there is no way to call to_json without needing fsspec)
  2. set dependencies at the function scope, enforce requirement in branches (see below)

Option 2. The documentation can say, without qualification, that to_json needs fsspec to run and if given a remote URL, to_json can say, without qualification, that to_json needs fsspec to run, but if given a local file, to_json can just do it whether it has fsspec or not. It's a little lie, but a lie in a convenient direction. Explaining the truth in detail would only cloud the issue: this situation is similar to the "easier to ask forgiveness than permission" situation... sort of.

@jpivarski
Copy link
Member

Will the packaging tests pass if we update the branch? (Not including pypy; that one is fixed by just running it multiple times, but it's not required.)

@agoose77
Copy link
Collaborator Author

@jpivarski probably not. I imagine it's because the dependencies are now enforced at the function scope, not at the branch level. The changes we've agreed on (2) will fix this!

@jpivarski
Copy link
Member

In an environment that has all of these things installed (i.e. my JAX works, I can make a JAX array), I get

>>> ak.to_jax(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 461, in wrapper
    return func(*args, **kwargs)
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 211, in dispatch
    on_dispatch_internal()
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 175, in validate_runtime_dependencies
    raise exception
ImportError: This function has the following dependency requirements that are not met by your current environment:

    * jaxlib — you have 0.4.23.dev20231222 installed

If you use pip, you can install these packages with `python -m pip install jaxlib`.
Otherwise, if you use Conda, install the corresponding packages for the correct versions. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 460, in wrapper
    with OperationErrorContext(name, args, kwargs):
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 85, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 95, in handle_exception
    raise self.decorate_exception(cls, exception)
ImportError: This function has the following dependency requirements that are not met by your current environment:

    * jaxlib — you have 0.4.23.dev20231222 installed

If you use pip, you can install these packages with `python -m pip install jaxlib`.
Otherwise, if you use Conda, install the corresponding packages for the correct versions. 

This error occurred while calling

    ak.to_jax(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
    )

In ak_to_jax.py, there's no version requirement on JAX. I have noticed that the CUDA part of my JAX installation has broken:

>>> import jax
>>> a = jax.numpy.arange(10)
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
>>> 

That looks like a print statement, not a warning, so I don't think Awkward is detecting it.

Here's my JAX version and source:

% mamba list jax
# packages in environment at /home/jpivarski/mambaforge:
#
# Name                    Version                   Build  Channel
jax                       0.4.23             pyhd8ed1ab_0    conda-forge
jaxlib                    0.4.23          cpu_py310h9ed8a0c_0    conda-forge

(That is, my JAX is not 0.4.23.dev20231222 but 0.4.23.)

All of the other optional library checks seem to be working fine. (I'm going to test an environment that doesn't have them installed now.)

@jpivarski
Copy link
Member

In an environment without dependencies, this is what I'm seeing:

>>> ak.to_arrow(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
Traceback (most recent call last):
  File "/home/jpivarski/mambaforge/envs/testy/lib/python3.10/site-packages/importlib_metadata/__init__.py", line 411, in from_name
    return next(iter(cls.discover(name=name)))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 74, in iter_missing_dependencies
    _version = importlib_metadata.version(requirement.name)
  File "/home/jpivarski/mambaforge/envs/testy/lib/python3.10/site-packages/importlib_metadata/__init__.py", line 924, in version
    return distribution(distribution_name).version
  File "/home/jpivarski/mambaforge/envs/testy/lib/python3.10/site-packages/importlib_metadata/__init__.py", line 897, in distribution
    return Distribution.from_name(distribution_name)
  File "/home/jpivarski/mambaforge/envs/testy/lib/python3.10/site-packages/importlib_metadata/__init__.py", line 413, in from_name
    raise PackageNotFoundError(name)
importlib_metadata.PackageNotFoundError: No package metadata was found for pyarrow

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 461, in wrapper
    return func(*args, **kwargs)
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 211, in dispatch
    on_dispatch_internal()
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 171, in validate_runtime_dependencies
    exception = build_runtime_dependency_validation_error(dependency_spec)
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 115, in build_runtime_dependency_validation_error
    extra_missing_dependencies = [
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 77, in iter_missing_dependencies
    mod = importlib.import_module(requirement.name)
  File "/home/jpivarski/mambaforge/envs/testy/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'pyarrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 460, in wrapper
    with OperationErrorContext(name, args, kwargs):
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 85, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 95, in handle_exception
    raise self.decorate_exception(cls, exception)
ModuleNotFoundError: No module named 'pyarrow'

This error occurred while calling

    ak.to_arrow(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
    )

I.e. no message about how this function requires pyarrow and how to install it.

Same for JAX: no message about the fact that ak.to_jax requires it and how to install JAX.

But the one for CuPy works:

>>> ak.to_backend(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]), "cuda")
...
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 460, in wrapper
    with OperationErrorContext(name, args, kwargs):
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 85, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 95, in handle_exception
    raise self.decorate_exception(cls, exception)
ModuleNotFoundError: to use Awkward Arrays with CUDA, you must install cupy:

    pip install cupy

or

    conda install -c conda-forge cupy


This error occurred while calling

    ak.to_backend(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
        'cuda'
    )

Also, I noticed that this is following Option 1: calling ak.to_json (to write a local file) without fsspec raises an error. Interestingly, that error doesn't indicate that fsspec is required and how to install it, so that might be helpful in figuring out which path this is taking.

>>> ak.to_json(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]), "/tmp/whatever.json")
...
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 460, in wrapper
    with OperationErrorContext(name, args, kwargs):
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 85, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 95, in handle_exception
    raise self.decorate_exception(cls, exception)
ModuleNotFoundError: No module named 'fsspec'

This error occurred while calling

    ak.to_json(
        <Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
        '/tmp/whatever.json'
    )

It doesn't even get to the point of checking the arguments; I can use no arguments and get this error message; it's checking before anything else.

@agoose77 agoose77 force-pushed the agoose77/feat-validate-dependencies branch from d6d2e52 to 0986f75 Compare January 26, 2024 14:12
@agoose77
Copy link
Collaborator Author

I've just pushed a WIP that tries this new approach — functions establish a dependency context, which is evaluated when the function implementation calls import_required_module.

I've not yet finished (or fixed the docs component to reflect the new declaration style).

Some points:

  1. I'm not sure whether the repeated dependency specs are good or bad. We could write some of this in our pyproject.toml for all functions.
    • I don't think this is worth it — we'd still want to declare which of these dependencies each function uses. Inspecting for import_required_module would be too fragile imo.
  2. Now we're using a requires decorator to enable multiple argument passing. I don't know if I prefer
    @high_level_function(
        dependencies=[
            requires("fsspec", group="arrow"),
            requires("pyarrow>=7.0.0", group="arrow", module_name="arrow"),
        ]
    )
    or
    @requires("pyarrow>=7.0.0", group="arrow", module_name="arrow")
    @requires("fsspec", group="arrow")
    @high_level_function()

@jpivarski
Copy link
Member

I'm confused about why the interface needs all of this structure. A high-level function (possibly) needs libraries x, y, and z to run; these are the libraries it might encounter in some if-branch of the implementation, but not necessarily. (Also, we only list non-strict dependencies. fsspec will become a strict dependency in 2.6.0, which is due on February 1.)

If there are dependencies, it tries to run the high-level function, catches any ImportError (including ModuleNotFoundError), and re-raises it with a note about how to install the set of dependencies.

In principle, the PyPI or conda install name of the package might be different from the import name of the package, but we haven't encountered any like that. If there is such a distinction, then it would be global: the install-import mapping does not depend on which high-level function we're running, so it can be a private dict somewhere.

That's my understanding of the problem; the solution so far is a lot more complicated than that.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jan 26, 2024

If there are dependencies, it tries to run the high-level function, catches any ImportError (including ModuleNotFoundError), and re-raises it with a note about how to install the set of dependencies.

The short answer is versions matter. We also want to error if the user has e.g. cppyy but the wrong version. Rather than having two sets of dependency information scattered through the codebase, the intention here is that we define it in a single place for each function.

In principle, the PyPI or conda install name of the package might be different from the import name of the package, but we haven't encountered any like that. If there is such a distinction, then it would be global: the install-import mapping does not depend on which high-level function we're running, so it can be a private dict somewhere.

We need to embed this information somewhere: already, ROOT isn't on PyPI, and has a different (lower-case) name. Arrow is "pyarrow" rather than "arrow". I chose to embed this in a single place so that we don't have to handle this information in multiple related places. Instead, it's just defined many times. This makes the docs generator simpler (stateless), too.

We could define the "distribution-level" metadata somewhere such that most definitions can use the short-hand, i.e. a table mapping distribution to conda package and python import name (we only need distribution → python name for the case that it's not installed, otherwise importlib.metadata has that table). Maybe this table would also define the base version, e.g.

@requires_global("arrow", "fsspec", group="arrow")
@high_level_function(requires)

There's more to think about here — I'll come back to it :)

I feel that there's an opportunity here to clean up / centralise our "extras" handling (including how ak._connect works), but we don't want too much complexity.

@agoose77
Copy link
Collaborator Author

agoose77 commented Feb 7, 2024

Let's see if:

  1. we can move to extras as a mean of defining dependencies for function classes (e.g. str/arrow, dataframe, etc.)
  2. we can use "scoped extras" to define per-function dependencies e.g. if one func needs pyarrow>13
  3. we can tell users the pip command
  4. we can detect the users' conda binary, and suggest a possible command -- with appropriate warning about packages
  5. we can tell users a detailed info of the distributions required
  6. we can list the PyPI page on the documentation
  7. extras are scoped to the function, not branch

The requirement() function can be simple; import errors just tell users to install extra / meet dependencies. Version errors can be scoped to the package itself.

@jpivarski jpivarski added the pr-inactive A pull request that hasn't been touched in a long time label Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-inactive A pull request that hasn't been touched in a long time
Projects
None yet
2 participants