Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for New Supervised Learning Data Simulation Classes in C++ for MLPACK Library #3559

Closed
coatless opened this issue Nov 11, 2023 · 9 comments

Comments

@coatless
Copy link
Contributor

What is the desired addition or change?

This RFC proposes the addition of new supervised learning data simulation classes to the MLPACK library in C++. The objective is to extend the library's capabilities by introducing classes specifically designed for generating synthetic datasets suitable for testing linear regression and logistic regression models. These simulation classes will offer flexibility in configuring key parameters, including:

  • Number of observations
  • Number of parameters
  • Type of error term
  • Amount of sparsity
  • Amount of outliers (contamination)

What is the motivation for this feature?

The ability to generate synthetic datasets tailored for supervised learning scenarios is essential for robust model testing. Linear regression and logistic regression are fundamental techniques in this domain, and having dedicated simulation classes will enhance MLPACK's utility for researchers and practitioners.

If applicable, describe how this feature would be implemented.

Two distinct simulation classes, one for linear regression and one for logistic regression, will be implemented in C++ and integrated into the MLPACK library. Users will be able to instantiate these classes and set specific parameters to generate synthetic datasets for testing their models.

Example Usage

Linear Regression Simulation:

#include <mlpack/core.hpp>
#include <mlpack/methods/simulate_data/linear_regression_simulation.hpp>

// Create a linear regression simulation object
mlpack::simulate::LinearRegressionSimulation linearSim;

// Set simulation parameters
linearSim.Observations(1000);
linearSim.Parameters(5);
linearSim.Intercept(true);
linearSim.ErrorTerm(mlpack::simulate::Gaussian(10, 2)); 
linearSim.Sparsity(0.1);
linearSim.Contamination(0.05);

// Generate synthetic linear regression dataset
arma::mat design;
arma::Row<size_t> responses;
linearSim.Generate(design, responses);

// Use the generated dataset for model testing

Logistic Regression Simulation:

#include <mlpack/core.hpp>
#include <mlpack/methods/simulate_data/logistic_regression_simulation.hpp>

// Create a logistic regression simulation object
mlpack::simulate::LogisticRegressionSimulation logisticSim;

// Set simulation parameters
logisticSim.Observations(1000);
logisticSim.Intercept(true);
logisticSim.Parameters(5);
logisticSim.ErrorTerm(mlpack::simulate::Binomial); 
logisticSim.Sparsity(0.2);
logisticSim.Contamination(0.1);

// Generate synthetic logistic regression dataset
arma::mat design;
arma::Row<size_t> responses;
logisticSim.Generate(design, responses);

// Use the generated dataset for model testing

Open Questions

  1. Are there additional parameters or features that should be considered for these simulation classes?
  2. How can we ensure that the simulation classes are flexible enough for various use cases?

Sample data generators

@rcurtin
Copy link
Member

rcurtin commented Nov 24, 2023

Nice, I think this could make for more compelling examples than "generate random uniform data"! 👍

It's worth pointing out that mlpack already has a number of distribution-like classes: GaussianDistribution, GammaDistribution, LaplaceDistribution, DiscreteDistribution, and so forth. (See src/mlpack/core/dists/.) Now, it would be cool to generate data directly from one of these distribution classes, but there are some issues: those distribution classes are typically aimed at (1) generating random samples via Random(), and (2) evaluating probabilities via Probability(), but that second function is totally irrelevant here---we just want to generate datasets. Even the signature of (1) is not quite right, as for existing distributions it just generates a single point.

So, certainly some additional infrastructure is necessary to generate labeled synthetic datasets, but I do think that whatever we write should be "aware" of the distribution code and make use of it when possible in the implementation (and add new distributions as needed).

A minor pedantic thought is that after #3269, pretty much everything in mlpack is directly in the mlpack:: namespace for convenience (with the exception of a couple things in util:: and a couple things in data::). So, I'd personally prefer to avoid a simulate:: namespace.

At least personally I wouldn't worry about Open Question (2) too much; I think if we provide something relatively barebones at first, it will get immediately used in the documentation, and that's probably good enough for now.

Copy link

mlpack-bot bot commented Dec 24, 2023

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@coatless
Copy link
Contributor Author

Sounds like an expansion for distributions is in order to handle multi-point generation. With respect to random() is this using a poorly spec'd PRNG?

On the note of namespaces, maybe this should go under util:: or where the train/test split is found?

@rcurtin
Copy link
Member

rcurtin commented Jan 3, 2024

Sounds like an expansion for distributions is in order to handle multi-point generation.

Possibly, it would be great to keep things unified, but if it doesn't make sense (or if the amount of work for adapting older distributions is not feasible), in my view it's okay to keep them different.

With respect to random() is this using a poorly spec'd PRNG?

It uses std::mt19937, not sure if that qualifies as "poor" (I am not an RNG expert).

On the note of namespaces, maybe this should go under util:: or where the train/test split is found?

I really think a flat namespace is fine, since there aren't really going to be any naming conflicts, but Split() is in the data:: namespace (as is Load() and Save()), and I suppose we could use that too. util:: is primarily for internal mlpack tooling, but this would be user-facing.

@arthiondaena
Copy link

@rcurtin I am trying to find a good beginner's issue, do you think this feature request can be implemented by a beginner to learn about mlpack.

@zoq
Copy link
Member

zoq commented Feb 22, 2024

Some of this can be a great way to jump into the codebase.

Copy link

mlpack-bot bot commented Mar 23, 2024

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Mar 23, 2024
@mlpack-bot mlpack-bot bot closed this as completed Mar 30, 2024
@coatless coatless reopened this Mar 30, 2024
@mlpack-bot mlpack-bot bot removed the s: stale label Mar 30, 2024
@coatless
Copy link
Contributor Author

Active PR #3647 for reg case

Copy link

mlpack-bot bot commented Apr 29, 2024

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Apr 29, 2024
@mlpack-bot mlpack-bot bot closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants