Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String representation of an expression changes after running a second regression #306

Open
daloic opened this issue Apr 27, 2024 · 5 comments
Assignees

Comments

@daloic
Copy link

daloic commented Apr 27, 2024

What happened?

The first equation should be only a function of X1 and X2.
To reproduce the bug, simply run the bug() function. Maybe I am doing something wrong.

using DataFrames

import SymbolicRegression: SRRegressor
import MLJ: machine, fit!, report

function bug()
    df1 = DataFrame()
    df2 = DataFrame()
    df1[!, :X1] = rand(Float64, 1000)
    df1[!, :X2] = rand(Float64, 1000)
    df2[!, :X3] = rand(Float64, 1000)
    df2[!, :X4] = rand(Float64, 1000)

    y1 = @. 2*cos(df1[:, :X1]) + df1[:, :X2]^2 - 2
    y2 = @. 2*cos(df2[:, :X3]) + df2[:, :X4]^2 - 2

    model1 = SRRegressor(
        binary_operators=[+, -, *, /],
        unary_operators=[cos],
        niterations=30
    )
    mach1 = machine(model1, df1, y1)
    fit!(mach1)
    model2 = SRRegressor(
        binary_operators=[+, -, *, /],
        unary_operators=[cos],
        niterations=30
    )
    mach2 = machine(model2, df2, y2)
    fit!(mach2)
    r1 = report(mach1)
    r2 = report(mach2)
    println("")
    println("")
    println("")
    println("Now, the two equations, from two models and dataframes, are both function of X3 and X4:")
    println(r1.equations[r1.best_idx])
    println(r2.equations[r2.best_idx])
end

Version

0.24.3

Operating System

Linux

Interface

Script (i.e., python my_script.py)

Relevant log output

bug()
[ Info: Training machine(SRRegressor(binary_operators = Function[+, -, *, /], …), …).
[ Info: Started!
99.1%┣████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ ┫ 446/450 [00:09<00:00, 48it/s]Expressions evaluated per second: 8.22e+04. Head worker occupation: 9.7%                                                                                                                                                                   Press 'q' and then <enter> to stop execution early.                                                                                                                                                                                        Hall of Fame:                                                                                                                                                                                                                              ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Complexity  Loss       Score     Equation                                                                                                                                                                                                  1           1.599e-01  3.604e+01  y = 0.036886                                                                                                                                                                                             3           1.024e-02  1.374e+00  y = X2 - X1                                                                                                                                                                                              5           9.575e-03  3.376e-02  y = X2 - (X1 * 0.9543)                                                                                                                                                                                   7           6.926e-04  1.313e+00  y = (X2 - X1) * (X2 + X1)                                                                                                                                                                                9           2.558e-04  4.979e-01  y = (X2 - (0.98286 * X1)) * (X2 + X1)                                                                                                                                                                    10          8.191e-22  3.603e+01  y = (cos(X1) * 2) + ((X2 * X2) + -2)                                                                                                                                                                     12          1.715e-25  4.236e+00  y = (((X2 * X2) + -1.9841) + (cos(X1) * 2)) - 0.015947                                                                                                                                                   ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ [ Info: Training machine(SRRegressor(binary_operators = Function[+, -, *, /], …), …).----------------------------------------------------------------------------------------------------------------------------------------------------- [ Info: Started!-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 99.6%┣█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏┫ 448/450 [00:07<00:00, 66it/s]Expressions evaluated per second: 1.07e+05. Head worker occupation: 13.7%                                                                                                                                                                  Press 'q' and then <enter> to stop execution early.                                                                                                                                                                                        Hall of Fame:                                                                                                                                                                                                                              ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Complexity  Loss       Score     Equation                                                                                                                                                                                                  1           1.677e-01  3.604e+01  y = 0.019402                                                                                                                                                                                             3           1.024e-02  1.398e+00  y = X4 - X3                                                                                                                                                                                              5           9.802e-03  2.163e-02  y = X4 - (X3 * 0.96411)                                                                                                                                                                                  7           7.387e-04  1.293e+00  y = (X4 - X3) * (X3 + X4)                                                                                                                                                                                9           5.741e-05  1.277e+00  y = (X4 * X4) + ((X3 * X3) * -0.94205)                                                                                                                                                                   11          1.393e-05  7.081e-01  y = (((X3 * -0.88411) + -0.04641) * X3) + (X4 * X4)                                                                                                                                                      13          7.600e-07  1.454e+00  y = ((X3 * ((X3 * 0.12018) + -1.0422)) * X3) + (X4 * X4)                                                                                                                                                 16          2.218e-07  4.105e-01  y = (X4 * X4) + ((X3 * (-0.38203 - cos(X3 * -0.46609))) * (X3 * 0.72096))                                                                                                                                18          1.126e-07  3.390e-01  y = ((X4 * 1.0003) * X4) + (X3 * ((cos(cos(cos(X3)) * -0.60652) * -1.0599) * X3))                                                                                                                        ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Now, the two equations, from two models and dataframes, are both function of X3 and X4:
(((X4 * X4) + -1.9840531681086326) + (cos(X3) * 1.9999999999969573)) - 0.015946831888805472
((X4 * 1.000341390109433) * X4) + (X3 * ((cos(cos(cos(X3)) * -0.6065187759560643) * -1.0598517223541926) * X3))

Extra Info

No response

@daloic daloic added the bug Something isn't working label Apr 27, 2024
@MilesCranmer
Copy link
Owner

MilesCranmer commented Apr 27, 2024

So, the equations themselves actually are not changing here. The default println(equation) is a convenience function, and it will use the last variable names given to fit!, hence why you are seeing both equations get printed with variable names x3 and x4.

For fine-grained control, including specifying what the variables should be printed for each feature, see https://astroautomata.com/SymbolicRegression.jl/dev/api/#Printing

For example:

sr_options1 = mach1.fitresult.options
sr_options2 = mach2.fitresult.options

println(string_tree(r1.equations[r1.best_idx], sr_options1; variable_names=["x1", "x2"]))
println(string_tree(r2.equations[r2.best_idx], sr_options2; variable_names=["x3", "x4"]))

which will fix your observed behavior.

Basically what happens by default is

string_tree(equation, options=(#=last options=#), variable_names=(#=last variable names=#))

which is purely for convenience (so you can quickly print an expression), but doesn't actually change how evaluation of the expression is done.

The reason for this behavior is so that the Node type (which holds an expression) does not need to carry around metadata such as the variable names of operators – all the Node type does is store integers corresponding to the index of a feature, or index of an operator. This means performance is faster and there is a lower memory footprint.

However I have been thinking about having an Expression struct that does store this metadata so users don't run into unexpected behavior like this. And perhaps returning this to the user (which would always print correctly), rather than the lower level Node type. Thoughts?

@MilesCranmer MilesCranmer changed the title [BUG]: Running two regressions after each other, the 2nd overwrite the first [BUG]: String representation of an expression changes after running a second regression Apr 27, 2024
@daloic
Copy link
Author

daloic commented Apr 27, 2024

Now I understand the behaviour. This is not intuitive as I really had the feeling I was running two different "lines" of calculations with totally different variables.

If you create an Expression struct, this would really nice as the behaviour would be more intuitive, but adding a small notice in the examples and in the API of string_tree would suffice at first. Maybe for the future, a nice to have, if you think about restructuring a bit your code. Feel really free to close the bug or convert it as a low priority feature.

Note that with your answer, I also learnt how to get access to the options of a fit, I missed it before!

@MilesCranmer MilesCranmer added code cleanup and removed bug Something isn't working labels Apr 28, 2024
@MilesCranmer MilesCranmer changed the title [BUG]: String representation of an expression changes after running a second regression String representation of an expression changes after running a second regression Apr 28, 2024
@MilesCranmer
Copy link
Owner

Good point, thanks. I don't think this would be too bad to implement. I think there are multiple benefits of having some user-facing expression types – not only making it explicit what the operators and variables are being used in an expression (which as you point out, certainly seems like a bit of a foot gun right now) – but also enabling us to have other metadata in expressions without slowing down the search algorithm.

@daloic
Copy link
Author

daloic commented Apr 28, 2024

Thank you Miles for your kind handling of my report and the incredibly useful package you created with SymbolicRegression.

@MilesCranmer
Copy link
Owner

Thanks @daloic.

Today I worked on an implementation which you can find in SymbolicML/DynamicExpressions.jl#73. Let me know what you think of the proposal. I think it's a much more robust strategy than before, so thanks for pointing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants