Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification on MNIST #141

Open
zsz00 opened this issue Oct 13, 2022 · 3 comments
Open

Classification on MNIST #141

zsz00 opened this issue Oct 13, 2022 · 3 comments

Comments

@zsz00
Copy link

zsz00 commented Oct 13, 2022

I tried to use sr for mnist classification training, but the results were not good.
I hope you can help me see where I need to improve
MNIST is 28*28 images, 10 classes label.

using SymbolicRegression
using SymbolicUtils
import MLDatasets: MNIST
import MLUtils: splitobs

function loadmnist(batchsize, train_split)
    ## Load MNIST
    N = 60000  # 5000 
    imgs = MNIST.traintensor(1:N)
    labels_raw = MNIST.trainlabels(1:N)

    ## Process images
    x_data = Float32.(reshape(imgs, size(imgs, 1), size(imgs, 2), 1, size(imgs, 3)))
    y_data = labels_raw   # onehot(labels_raw)
    (x_train, y_train), (x_test, y_test) = splitobs((x_data, y_data); at=train_split)
    return (x_train, y_train), (x_test, y_test)
end

function train()
    batchsize, train_split = 128, 0.9
    (x_train, y_train), (x_test, y_test) = loadmnist(batchsize, train_split)

    println(size(x_train))
    options = SymbolicRegression.Options(
                    binary_operators=(+, *, /, -),
                    unary_operators=(cos, sin, exp),
                    npopulations=50,
                    batching=true,
                    batchSize=100,
                    # loss=LogitMarginLoss()
                    )
    x_train = reshape(x_train, 784, size(x_train)[end])
    y_train = convert(Vector{Float32}, y_train)

    hall_of_fame = EquationSearch(x_train, y_train, niterations=50, options=options, numprocs=8)

    dominating = calculate_pareto_frontier(x_train, y_train, hall_of_fame, options)
    eqn = node_to_symbolic(dominating[end].tree, options)
    println(simplify(eqn))  # 公式变换/简化
end

train()

N = 5000 
batching=false,
out:
Complexity  Loss       Score         Equation
18          4.715e+00  8.670e-03  ((((sin(x264 + x320) - -1.9941229) * (1.5529515 - (x484 - sin(sin(x437))))) + x355) - x599)      
19          4.703e+00  2.455e-03  ((((sin(x264 + x320) - -1.9941229) * (1.5095575 - (sin(x484) - sin(sin(x437))))) + x355) - x599) 
20          4.665e+00  8.302e-03  (((((sin(x264 + x320) - -1.9941229) - x509) * (1.5529515 - (x484 - sin(sin(x437))))) + x355) - x599)
-------------
N = 60000 
batching=true,

very slow,  get worse results.

@MilesCranmer
Copy link
Owner

MNIST is a high-dimensional dataset, where pure symbolic regression is going to do quite poorly due to the combinatoric scaling. What you can try instead is something like described in https://arxiv.org/abs/2006.11287 (see interactive example of this at the end of https://colab.research.google.com/github/MilesCranmer/PySR/blob/master/examples/pysr_demo.ipynb).

Basically, write down a neural network like $$classification = MLP_1(\sum_{i} MLP_2(\text{patch}_i))$$,

where $\text{patch}$ is a patch of pixels (maybe give it 9 pixels?). Once you train this, then try to fit SR to $MLP_2$ and $MLP_1$ independently. Finally, arrange them in the same functional form.

@MilesCranmer
Copy link
Owner

For example, maybe you'll get something like: $$MLP_2 \approx (\text{pixel}_1 - \text{pixel}_2\ \ ,\ \ \text{pixel}_3 \times \text{pixel}_4)$$ and $$MLP_1 \approx y_1 \times y_2^2$$

Thus, your final equation would be: $$classification = \text{sigmoid}((\sum_{i} \text{pixel}_1 - \text{pixel}_2) \times (\sum_i \text{pixel}_3 \times \text{pixel}_4)^2 )$$

where the sum is over small patches of $3\times 3$ pixels.

@MilesCranmer MilesCranmer changed the title Train classification on MNIST use SR Classification on MNIST Oct 13, 2022
@tecosaur
Copy link

tecosaur commented Jul 3, 2023

Regarding applying SymbolicRegression to high-dimensional data sets in general, I imagine the recommendation would be to start with a feature-selection approach, and once a small number of highly-relevant features is selected apply SymbolicRegression?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants