Proposal for New Supervised Learning Data Simulation Classes in C++ for MLPACK Library #3559

coatless · 2023-11-11T06:10:41Z

What is the desired addition or change?

This RFC proposes the addition of new supervised learning data simulation classes to the MLPACK library in C++. The objective is to extend the library's capabilities by introducing classes specifically designed for generating synthetic datasets suitable for testing linear regression and logistic regression models. These simulation classes will offer flexibility in configuring key parameters, including:

Number of observations
Number of parameters
Type of error term
Amount of sparsity
Amount of outliers (contamination)

What is the motivation for this feature?

The ability to generate synthetic datasets tailored for supervised learning scenarios is essential for robust model testing. Linear regression and logistic regression are fundamental techniques in this domain, and having dedicated simulation classes will enhance MLPACK's utility for researchers and practitioners.

If applicable, describe how this feature would be implemented.

Two distinct simulation classes, one for linear regression and one for logistic regression, will be implemented in C++ and integrated into the MLPACK library. Users will be able to instantiate these classes and set specific parameters to generate synthetic datasets for testing their models.

Example Usage

Linear Regression Simulation:

#include <mlpack/core.hpp>
#include <mlpack/methods/simulate_data/linear_regression_simulation.hpp>

// Create a linear regression simulation object
mlpack::simulate::LinearRegressionSimulation linearSim;

// Set simulation parameters
linearSim.Observations(1000);
linearSim.Parameters(5);
linearSim.Intercept(true);
linearSim.ErrorTerm(mlpack::simulate::Gaussian(10, 2)); 
linearSim.Sparsity(0.1);
linearSim.Contamination(0.05);

// Generate synthetic linear regression dataset
arma::mat design;
arma::Row<size_t> responses;
linearSim.Generate(design, responses);

// Use the generated dataset for model testing

Logistic Regression Simulation:

#include <mlpack/core.hpp>
#include <mlpack/methods/simulate_data/logistic_regression_simulation.hpp>

// Create a logistic regression simulation object
mlpack::simulate::LogisticRegressionSimulation logisticSim;

// Set simulation parameters
logisticSim.Observations(1000);
logisticSim.Intercept(true);
logisticSim.Parameters(5);
logisticSim.ErrorTerm(mlpack::simulate::Binomial); 
logisticSim.Sparsity(0.2);
logisticSim.Contamination(0.1);

// Generate synthetic logistic regression dataset
arma::mat design;
arma::Row<size_t> responses;
logisticSim.Generate(design, responses);

// Use the generated dataset for model testing

Open Questions

Are there additional parameters or features that should be considered for these simulation classes?
How can we ensure that the simulation classes are flexible enough for various use cases?

Sample data generators

The text was updated successfully, but these errors were encountered:

rcurtin · 2023-11-24T16:58:00Z

Nice, I think this could make for more compelling examples than "generate random uniform data"! 👍

It's worth pointing out that mlpack already has a number of distribution-like classes: GaussianDistribution, GammaDistribution, LaplaceDistribution, DiscreteDistribution, and so forth. (See src/mlpack/core/dists/.) Now, it would be cool to generate data directly from one of these distribution classes, but there are some issues: those distribution classes are typically aimed at (1) generating random samples via Random(), and (2) evaluating probabilities via Probability(), but that second function is totally irrelevant here---we just want to generate datasets. Even the signature of (1) is not quite right, as for existing distributions it just generates a single point.

So, certainly some additional infrastructure is necessary to generate labeled synthetic datasets, but I do think that whatever we write should be "aware" of the distribution code and make use of it when possible in the implementation (and add new distributions as needed).

A minor pedantic thought is that after #3269, pretty much everything in mlpack is directly in the mlpack:: namespace for convenience (with the exception of a couple things in util:: and a couple things in data::). So, I'd personally prefer to avoid a simulate:: namespace.

At least personally I wouldn't worry about Open Question (2) too much; I think if we provide something relatively barebones at first, it will get immediately used in the documentation, and that's probably good enough for now.

mlpack-bot · 2023-12-24T17:12:11Z

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

coatless · 2023-12-24T23:45:42Z

Sounds like an expansion for distributions is in order to handle multi-point generation. With respect to random() is this using a poorly spec'd PRNG?

On the note of namespaces, maybe this should go under util:: or where the train/test split is found?

rcurtin · 2024-01-03T15:02:12Z

Sounds like an expansion for distributions is in order to handle multi-point generation.

Possibly, it would be great to keep things unified, but if it doesn't make sense (or if the amount of work for adapting older distributions is not feasible), in my view it's okay to keep them different.

With respect to random() is this using a poorly spec'd PRNG?

It uses std::mt19937, not sure if that qualifies as "poor" (I am not an RNG expert).

On the note of namespaces, maybe this should go under util:: or where the train/test split is found?

I really think a flat namespace is fine, since there aren't really going to be any naming conflicts, but Split() is in the data:: namespace (as is Load() and Save()), and I suppose we could use that too. util:: is primarily for internal mlpack tooling, but this would be user-facing.

arthiondaena · 2024-01-30T15:55:37Z

@rcurtin I am trying to find a good beginner's issue, do you think this feature request can be implemented by a beginner to learn about mlpack.

zoq · 2024-02-22T17:16:33Z

Some of this can be a great way to jump into the codebase.

mlpack-bot · 2024-03-23T17:45:36Z

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

coatless · 2024-03-30T19:06:19Z

Active PR #3647 for reg case

mlpack-bot · 2024-04-29T19:45:43Z

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

mlpack-bot bot added s: unanswered s: unlabeled labels Nov 11, 2023

rcurtin added t: feature request and removed s: unlabeled labels Nov 13, 2023

mlpack-bot bot added the s: stale label Dec 24, 2023

coatless added s: answered and removed s: stale s: unanswered labels Dec 24, 2023

Ali-Hossam mentioned this issue Mar 1, 2024

Synthetic Regression Dataset Generator #3647

Closed

mlpack-bot bot added the s: stale label Mar 23, 2024

mlpack-bot bot closed this as completed Mar 30, 2024

coatless reopened this Mar 30, 2024

mlpack-bot bot removed the s: stale label Mar 30, 2024

mlpack-bot bot added the s: stale label Apr 29, 2024

mlpack-bot bot closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for New Supervised Learning Data Simulation Classes in C++ for MLPACK Library #3559

Proposal for New Supervised Learning Data Simulation Classes in C++ for MLPACK Library #3559

coatless commented Nov 11, 2023

rcurtin commented Nov 24, 2023

mlpack-bot bot commented Dec 24, 2023

coatless commented Dec 24, 2023

rcurtin commented Jan 3, 2024

arthiondaena commented Jan 30, 2024

zoq commented Feb 22, 2024

mlpack-bot bot commented Mar 23, 2024

coatless commented Mar 30, 2024

mlpack-bot bot commented Apr 29, 2024

Proposal for New Supervised Learning Data Simulation Classes in C++ for MLPACK Library #3559

Proposal for New Supervised Learning Data Simulation Classes in C++ for MLPACK Library #3559

Comments

coatless commented Nov 11, 2023

What is the desired addition or change?

What is the motivation for this feature?

If applicable, describe how this feature would be implemented.

Example Usage

Open Questions

Sample data generators

rcurtin commented Nov 24, 2023

mlpack-bot bot commented Dec 24, 2023

coatless commented Dec 24, 2023

rcurtin commented Jan 3, 2024

arthiondaena commented Jan 30, 2024

zoq commented Feb 22, 2024

mlpack-bot bot commented Mar 23, 2024

coatless commented Mar 30, 2024

mlpack-bot bot commented Apr 29, 2024