Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nx.config dict for configuring dispatching and backends #7225

Merged
merged 20 commits into from Mar 15, 2024

Conversation

eriknw
Copy link
Contributor

@eriknw eriknw commented Jan 13, 2024

This adds a global config dict for dispatching and backends. Is this what we discussed in the last networkx dispatching meeting? It allows backends to provide default configuration in "default_config" in the dict returned by networkx.backend_info entry-point. It saves this object to nx.backend_config[backend_name]. This config is initialized using environment variables if they are defined.

I think we should consider the fields of this config and whether we like the names, because this is now becoming public. Do we like fallback_to_nx: bool and automatic_backends: list?

Also, how should we document this? Should we subclass dict just to provide a docstring, such as:

class BackendConfigDict(dict):
    """blah blah blah"""

CC @MridulS @rlratzel @Schefflera-Arboricola @jim22k

(backend, info["default_config"])
for backend, info in backend_info.items()
if "default_config" in info
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, it might be more optimal to have a separate function(like get_info) that returns only the backend's configurations, like get_config(config_dict=None)(if config_dict is not None then the configuration dictionary is updated and returned otherwise the default configurations are returned) because config dictionary would probably be updated quite frequently and we probably shouldn't load all the backend_info every time. Please let me know if I'm missing something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I completely understand your comment. get_info is called exactly once at import time for each backend (before this PR and in this PR). Among other things, this helps the dispatch machinery know which backends implement which functions.

By optionally including "default_config" in the dict returned by get_info, we are able to initialize the default configuration for each backend at import time. This happens only once. This may be nice for users so they can do things like:

import networkx as nx
nx.backend_config["parallel"]["n_job"] = 5

nx.backend_config is how we expect users to update configuration.

We could make a separate entry point for the default config, but I don't know if it would be worth doing since it seems to me that the "backend_info" entry point works well enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I thought this kind of thing was possible :

import networkx as nx
nx.backend_config["parallel"]["n_job"] = 5
# defining G
print(nx.betweenness_centrality(G, backend="parallel"))

nx.backend_config["parallel"]["n_job"] = -1
nx.backend_config["parallel"]["backend"] = 'threading'
print(nx.betweenness_centrality(G, backend="parallel"))

so I thought we would be extracting all the configurations and adding them in the joblib.Parallel() in all the functions, so having a separate get_config function made sense to me.

But, I think, based on the clarification above, that all the configuration code lines will be executed while importing networkx so the final configuration for both the function calls would be {"n_jobs" : -1, "backend" : "threading"}, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this question:

  • I think the code you write above would work with this configuration system.
  • the default values will be obtained from the backend_info get_info function.
  • the user overrides those values as desired for each function call.
  • the function dispatches to nx-parallel which looks at the config to get the values to send to joblib.

I think your get_config function would live in nx-parallel and look at nx.config.backends.parallel.<name> for each option. Or, we could use the whole config object like: joblib.Parallel(**nx.config.backends.parallel). But the user would be able to change them between each function call.

== "true",
"automatic_backends": [
x.strip()
for x in os.environ.get("NETWORKX_AUTOMATIC_BACKENDS", "").split(",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does NETWORKX_AUTOMATIC_BACKENDS mean here? like what's its functionality? is it the default backend?
Sorry, I don't think I understand all of the backend functionality fully, so sorry if I am being repetitive here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, thanks for asking!

Previously, configuring backends was only possible via environment variables. NETWORKX_AUTOMATIC_BACKENDS environment variable is a comma separated list of backends that indicates which backends we want to automatically convert to and run when e.g. the input graph is a NetworkX graph. This is a prioritized list, so backends listed first are tried first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, thank you :)

@eriknw
Copy link
Contributor Author

eriknw commented Jan 29, 2024

As discussed in the weekly dispatching meeting, "fallback_to_nx" probably doesn't make sense (yet) as a user-facing configuration option, so I removed it from nx.config. It is meant to be used when testing backends. We may soon have more generalized behavior for backend-to-backend conversions, which will allow "falling back" to other backends, not just networkx.

@eriknw eriknw changed the title Add nx.backend_config dict for configuring dispatching and backends Add nx.config dict for configuring dispatching and backends Jan 30, 2024
@rossbar rossbar added the Dispatching Related to dispatching and backend support label Jan 31, 2024
Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good -- just a couple of comments.

  • I would envision config will eventually be defined/created in the primary __init__.py module. No need to do that now since backends are the only part of nx using it. But if/when it becomes part of visualization or something else....
  • I don't think we should subclass dict to get doc_strings. But that's only because I've had bad experience subclassing dict before. (The performance hit is quite large even though the subclassed dict doesn't do anything. Better to make the dict an attribute and add methods for the dict operations which just call the operation on the dict attribute.) What kind of documentation are we looking for? doc_string for a developer reading the code? doc_string for the html documentation? inspection related doc_strings which appear with ipython ? or ?? operation? I think the html version can be it's own rst page. The others are more tricky. Let's talk more about this. Are we looking for nx.config? to return something nice?

@eriknw
Copy link
Contributor Author

eriknw commented Feb 7, 2024

Are we looking for nx.config? to return something nice?

That's the main thing I was thinking, so something like:

class Config(dict):
    """Something nice; go look online for more."""

config = Config()

Does this actually perform worse in recent Python versions? I'm fine with regular dict too.

@dschult
Copy link
Member

dschult commented Feb 7, 2024

As I hinted above, my knee-jerk reaction about subclass-ing dict is based solely on my experience with performance issues. That experience may not apply here. For example, maybe performance doesn't matter for the config dict.

Let's see... Are there guidelines for when to use an attribute and when to use a method? It seems like an attribute is documented in the class documentation while a method has its own documentation. Why is that? Maybe something like attributes tend to be data structures/values, while methods tend to provide functionality. Do you have some experience with that kind of rule-of-thumb? My intuition is to have a dict attribute and document that in the class documentation. But I could easily be swayed here.

@eriknw
Copy link
Contributor Author

eriknw commented Feb 7, 2024

I'd be fine using a regular dict (and eventually adding online docs) so we can avoid these issues for now.

Performance-wise, I think subclassing from dict would be faster than subclassing from collections.UserDict or using a dict attribute. It's true that subclassing dict makes some operations such as __getitem__ and __contains__ slower than for regular dicts (CPython uses some tricks with hard-coded C slots and that get lost when building the descriptors for these slots for subclasses; in other words, my understanding is that dicts are fast, and dict subclasses are "regular speed").

You're not supposed to subclass from dict directly if you want to change methods, but we're not doing that here.

Comment on lines 130 to 142
config = {
# Get default configuration from environment variables at import time
"automatic_backends": [
x.strip()
for x in os.environ.get("NETWORKX_AUTOMATIC_BACKENDS", "").split(",")
if x.strip()
],
# Initialize default configuration for backends
"backends": {
backend: info.get("default_config", {})
for backend, info in backend_info.items()
},
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be for a separate PR but I think the config entry "automatic backends" could be named better (same for the corresponding env var), especially when seen next to the config entry here for "backends".

We should discuss this a bit more, but I think "backend priority" (config["backend_priority"], NETWORKX_BACKEND_PRIORITY) would be a better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good thing to discuss 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think maybe we can think a bit more about the schema for the config dictionary, but for now let's go with this!

Copy link
Member

@MridulS MridulS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some big picture questions, not a blocker here:

It's pretty easy to just mess up the config dictionary right now, there is no specific protection here. I can just delete the config or rewrite it. Maybe we don't even need to worry about this.

In [1]: import networkx as nx

In [2]: nx.config
Out[2]: {'automatic_backends': [], 'backends': {'graphblas': {}}}

In [3]: nx.config = 'something'

In [4]: nx.config
Out[4]: 'something'

Currently it won't be able to persist the config between 2 runs, it's declared at runtime. I guess that's fine as these configs should just stay in some config file, but we would still need a way to write the config to disk (it could just be a toml/yaml export)

@eriknw
Copy link
Contributor Author

eriknw commented Feb 29, 2024

I fixed the merge conflict, and I updated the name to "backend_priority". Anything else?

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little leery about having the name config exposed in the main namespace, especially since the configuration info contained here is related specifically to backends.

How would folks think about instead keeping the config dict exposure linked to the backends module? Does this interfere with it's behavior/implementation? From a user/API perspective, I think nx.backends.config is both more accurate and clear than nx.config.

@dschult
Copy link
Member

dschult commented Mar 6, 2024

@rossbar what do you think about having nx.config but having the backend related keys in a dict nx.config['backends']? What are the tradeoffs between using subpackages and dict hierarchies. NetworkX has leaned toward flat namespaces with "dicts all the way down" for storing/organizing data. I can imagine config options for visualization, for backends, and for some of the more complex algorithms, but probably only one of two of those. So there shouldn't be many config options overall. Are you worried about namespace collisions, ease of finding the options, documentation, or "all of the above"? :)

@eriknw
Copy link
Contributor Author

eriknw commented Mar 11, 2024

I just created a Config class. It has a docstring that y'all should read. There are even tests (before you comment--yes, I do try to write tests, and it's nice to add something that is easy to write tests for!).

The UX of the config is based on conversations and feedback from several people. I actually experimented with ~4 approaches that all basically do the same thing, and I was surprised that I settled on the one that uses dataclass. I think the end result is nice for the user and nice for backend developers who can use Config to make their own configuration objects.

After talking with @rlratzel multiple times about config, I believe he prefers attribute access (e.g. cfg.x). Dataclasses support this natively.

@dschult is sensitive to performance. Dataclasses are wicked-fast for attribute access. They are not as fast as some options for bracket syntax (e.g., cfg["x"]), but they're not bad (and some implementations are much slower).

I debated with myself whether the config should actually be a collections.abc.Mapping, and I think they should. They are very mapping-like. I also really like .get for writing portable code. I don't think we need all the other methods from MutableMapping.

I believe @Schefflera-Arboricola first argued for making this a class during a call (but I may not remember correctly). Please take a look.

@MridulS above complaineid how easy it was to mess up the config. There were no protections. Now there are.

Things we are not yet doing for config, and I would say are out of scope for this PR unless any would fundamentally change our approach:

  • thread-local storage
  • contexts
  • persistence
  • searching paths for config files
  • automagically getting config from environment variables (this is done manually)

For fun, here is another approach I tried. Its performance is very reasonable. The thing I like most about it is that you can provide documentation for every single configuration. for example, doing e.g. config.val? in IPython or Jupyter will show you the value and docstring.

KrazyKonfig
class KrazyKonfig:
    """Config class doc"""
    __slots__ = ()

    def __new__(cls, config_data, config_docs=None, doc=None):
        if config_docs is None:
            config_docs = {}
        else:
            config_docs = dict(config_docs)
        config_data = {key: config_data[key] for key in sorted(config_data)}
        class_dict = {
            "__slots__": (),  # config.newkey = val  # RAISE
            "_data": config_data,  # config._data
            "_docs": config_docs,  # config._docs
            "__doc__": doc,  # help(config)
            "__getitem__": config_data.__getitem__, # config[key]
            "__contains__": config_data.__contains__,  # key in config
            "__len__": config_data.__len__,  # len(config)
            "__iter__": config_data.__iter__,  # iter(config)
            "__reversed__": config_data.__reversed__,  # reversed(config)
            "__hash__": None,  # hash(config)  # RAISE
            "__dir__": config_data.keys,  # config.<TAB>
            "_ipython_key_completions_": config_data.keys,  # config["<TAB>
            # Make this a `collections.abc.Mapping`
            "get": config_data.get,
            "items": config_data.items,
            "keys": config_data.keys,
            "values": config_data.values,
        }
        for key in config_data:
            # Create properties for all config items for speed and docs
            val = property(op.itemgetter(key))  # config.x
            val = val.setter(partial(cls._setitem, key))  # config.x = y
            val.__doc__ = config_docs.get(key)  # config.x?  # IPython/Jupyter help
            class_dict[key] = val
        new_cls = type(cls.__name__, (cls,), class_dict)
        instance = object.__new__(new_cls)
        return instance

    @staticmethod
    def _setitem(key, self, value):
        if key not in self._data:
            raise AttributeError(f"Invalid config name: {key!r}")
        # Do checks...
        self._data[key] = value

    def __setitem__(self, key, value):
        try:
            self._setitem(key, self, value)
        except AttributeError as err:
            raise KeyError(*err.args) from None

    def __delitem__(self, key):
        raise TypeError(f"Configuration items can't be deleted (can't delete {key!r}).")

    __delattr__ = __delitem__

    def __repr__(self):
        return f'{type(self).__name__}({self._data!r})'

    def __reduce__(self):
        return type(self).mro()[1], (self._data, self._docs, self.__doc__)

    def __eq__(self, other):
        if not isinstance(other, KrazyKonfig):
            return NotImplemented
        return self.__reduce__() == other.__reduce__()

KrazyKonfig has a little more magic going on, so I don't think it's particularly accessible or maintainable (fwiw, I don't think it's too bad, and I think it's a good learning experience for the magic it uses). I think the config class currently in this PR is much more clear and maintainable (and if I ought to write more code comments, please let me know!).

"""Write me!"""

backend_priority: list[str]
backends: Config[str, Config]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy reminds me again that I still know very little about type annotations. mypy doesn't like Config[str, Config].

My "solution" will be to delete __class_getitem__ so this syntax isn't even possible. The working solution probably includes Generic and TypeVar and who knows what. Oh well... I thought this was worth trying.

@eriknw
Copy link
Contributor Author

eriknw commented Mar 13, 2024

The latest commit added the ability to define a subclass with strict=False to allow configs to be added and deleted. It works, but it does make things a little more complicated.

@eriknw
Copy link
Contributor Author

eriknw commented Mar 14, 2024

btw, I'm currently dyslexic/confused when it comes to single ` vs double `` when writing docs. Please check the usage of them in this PR and guide me in the right direction, thanks.

Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for single/double backticks issues and found none. But that doesn't mean they aren't there. :) I also looked through the recent changes and I think this is ready to go in. We should keep an eye on it especially early on to see if anything should be fixed or tweaked to make it better.

We've got two approvals. I'll give @rossbar some time to look if he gets a chance. But will likely merge later today. Ross can make a separate PR with corrections and so forth if needed.

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this would be better served by living in the nx.backends namespace rather than being available in nx, but I don't feel strongly enough that it should be a blocker! If others are happy then +1 from me!

@dschult dschult merged commit f336cf2 into networkx:main Mar 15, 2024
41 checks passed
@jarrodmillman jarrodmillman added this to the 3.3 milestone Mar 15, 2024
cvanelteren pushed a commit to cvanelteren/networkx that referenced this pull request Apr 22, 2024
…kx#7225)

* Add `nx.backend_config` dict for configuring dispatching and backends

* Rename `nx.backend_config` to `nx.config`

* "fallback_to_nx" is for testing, not for user config

* Move config of backends to e.g. `nx.config["backends"]["cugraph"]`

* How do you like this mypy?!

* Rename `automatic_backends` to `backend_priority` (and env variables)

* Create a class to handle configuration

* Oops thanks mypy

* Fix to work with more strict config

* Support (and test) default values

* Remove `__class_getitem__` and add docstring

* Allow `strict=False` when defining subclasses.

This allows configs to be added and deleted.

* Move `__init_subclass__`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dispatching Related to dispatching and backend support type: Maintenance
Development

Successfully merging this pull request may close these issues.

None yet

7 participants