[WIP] Alternative method to set PySRRegressor parameters #216

MilesCranmer · 2022-11-02T21:21:53Z

PySRRegressor currently has 74 parameters. That is way too many for a single method. It can be hard to figure out what a single parameter from the long list actually does, especially for new users - who are likely overwhelmed by the large number of options.

What this work-in-progress PR does is create an alternative way to initialize a PySRRegressor object. Rather than passing all parameters at initialization (which would still be allowed), you could instead call a series of different functions to configure various parts of the regressor.

Here's an example of what this might look like:

model = (
    PySRRegressor()
        .set_space(binary_operators=["+", "-"], unary_operators=["cos"], maxsize=30)
        .set_size(niterations=100, populations=50)
        .set_parallelism(procs=8)
)

Each of these functions would define only a subset of parameters according to the groups shown in the new API page: https://astroautomata.com/PySR/api/.

This makes for much more readable code, and hopefully a model that is easier for new users to explore, rather than having all parameters dumped in a flat list. It also presents an opportunity for better documentation: these set_... functions can have more specific explanations of, e.g., what an "equation space" is (for set_space), rather than giving a general overview in the __init__ function.

There are a couple issues with this:

Re-used documentation strings. You would have the docstring both at __init__ and at the set_... functions. Or, we could remove the docstring from __init__ altogether, and simply point the user to the various set_... functions they can use.
Some default parameters are None. How do we also make parameters optional in the set_... functions? My solution to this was to define a global NONE="none" constant that is sort of like an alternative definition of None, so we can differentiate between the user setting a variable to None or not.
A bit different than normal Scikit-learn initializations. However, I think most scikit-learn style APIs only have max ~15 parameters, so we are in a different regime.

Thanks @dfm for the idea! Would be interested in hearing your comments when you have time.

@patrick-kidger @tttc3 @JayWadekar I am eager to hear what you think as well.

patrick-kidger · 2022-11-02T22:03:24Z

This sounds like a good idea to me.

Another option here is to define intermediate helper objects:

@dataclass
class Sizes:
    niterations: int
    populations: int

@dataclass
class Space:
    binary_operations: Sequence[Callable]
    unary_operations: Sequence[Callable]
    maxsize: int

class PySRRegressor:
    def __init__(self, sizes: Sizes, space: Space, ...)

Which can likewise help to split things up by grouping less-frequently-used parameters.

This is what I do for diffrax.diffeqsolve, for example. This has a very similar feel: differential equation solvers also take a lot of parameters, many of which are only used in particular use cases (nonlinear solvers only used for implicit problems; delays only used for DDEs; tolerances only used if you're stepping adaptively/etc.etc.)

MilesCranmer · 2022-11-06T00:21:43Z

Thanks for the ideas! I am on the fence about which way to go: passing a hierarchy of objects, or declaratively setting parameters.

One one hand, I think that if a user wants to figure out the API just by using hover-help and tab completion in their IDE, then the functional method is better, and the typing method would require they keep flipping back to the docs page, and importing various objects.

But on the other hand, the typing method seems more readable. I can definitely see the advantage of a hierarchy of types when only certain parameter combinations are compatible with eachother. And it's also more modular in case a user wants to add their own configuration, or share optimal parameter configs. (Though perhaps this would be better suited for multiple-dispatch)

Maybe you could rule out certain param combos with the declarative method too, and have the functions return different subclasses of the main regressor object. Sort of like how ggplot or [altair](https://altair-viz.github.io/) creates plots. e.g., maybe .sequence_regressor() would return a subclass that could take sequences as input, rather than tabular data. And similar for tensor operations in the future. But for that perhaps it's just better to have a separate class suited for the task... Not sure!

tttc3 · 2022-11-13T11:46:12Z

I don't have any other ideas, but personally, I think the hierarchy of types is more straightforward. It avoids the duplicated docstrings, and I think is easier to understand/follow if the user hasn't read the docs beforehand. Whichever option you decide, I think it will be a nice improvement!

MilesCranmer · 2022-11-14T02:10:22Z

Thanks for the feedback @tttc3!

I’m also wondering if it might be possible to get tab completion to work by exposing a “config” module that acts to store references to each type. Then you could write, for example, pysr.config.<tab> and have it show you what types are available. @patrick-kidger what is your experience with tab completions/parameter discoverability with the type-based API?

Maybe I should just mock up a few designs and conduct a Twitter poll…

patrick-kidger · 2022-11-14T06:04:47Z

I've not tried using a config namespace to hold them - but otherwise yup, tab completion works well.

MilesCranmer · 2022-11-14T06:37:23Z

Thanks! How do you use tab completion? i.e., what string of text are you actually hitting tab on?

patrick-kidger · 2022-11-14T07:18:55Z

Typically:

-type-> diffrax.Kv
-tab-> diffrax.Kvaerno5
-type-> diffrax.Kvaerno5(no
-tab-> diffrax.Kvaerno5(nonlinear_solver=

As I have it set to prompt after two characters. (FWIW my current editors are Helix and NvChad.)

If you're thinking in terms of discoverability then that should work just the same, I think. Probably easier if you configure prompting to happen after zero characters though.

MilesCranmer · 2022-11-14T07:49:17Z

Thanks. I guess if I do <tab> after typing diffrax, there are too many options for me to know what to select, right? i.e., if I don't know to type Kv, then one might have to go to the docs?

I wonder if it would be enough to have a prefix on all config parameters, like ConfigSpace, ConfigSize, etc. Then when you hit Config<tab> , it would only show you the small set of available configs, so you could step through more easily (since Config is easy enough to memorize). Or just have a pysr.config submodule...

Create set_space, set_size, and set_objective

d85da67

MilesCranmer mentioned this pull request Nov 2, 2022

[Code cleanup] make options more hierarchical #213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Alternative method to set PySRRegressor parameters #216

[WIP] Alternative method to set PySRRegressor parameters #216

MilesCranmer commented Nov 2, 2022

patrick-kidger commented Nov 2, 2022 •

edited

MilesCranmer commented Nov 6, 2022

tttc3 commented Nov 13, 2022

MilesCranmer commented Nov 14, 2022

patrick-kidger commented Nov 14, 2022

MilesCranmer commented Nov 14, 2022

patrick-kidger commented Nov 14, 2022

MilesCranmer commented Nov 14, 2022

[WIP] Alternative method to set PySRRegressor parameters #216

Are you sure you want to change the base?

[WIP] Alternative method to set PySRRegressor parameters #216

Conversation

MilesCranmer commented Nov 2, 2022

patrick-kidger commented Nov 2, 2022 • edited

MilesCranmer commented Nov 6, 2022

tttc3 commented Nov 13, 2022

MilesCranmer commented Nov 14, 2022

patrick-kidger commented Nov 14, 2022

MilesCranmer commented Nov 14, 2022

patrick-kidger commented Nov 14, 2022

MilesCranmer commented Nov 14, 2022

patrick-kidger commented Nov 2, 2022 •

edited