Synthetic Regression Dataset Generator #3647

Ali-Hossam · 2024-03-01T20:11:50Z

Description

This pull request introduces a new feature to mlpack: a synthetic regression dataset generator. The generator allows users to create datasets with customizable parameters.

The implementation is inspired by scikit-learn's make_regression functionality, but includes additional enhancements such as the ability to introduce outliers and support for multiple error distribution types, including gamma and normal distributions.

Changes Made

Added SyntheticRegressionGenerator class with the following parameters:
- nSamples: Number of samples.
- nFeatures: Number of features.
- ErrorParams: Error distribution parameters (support for both normal and gamma distributions).
- nTargets: Number of target variables (includes multi-target).
- intercept: Bias term.
- sparsity: Sparsity level of coefficients.
- outliersFraction: Fraction of samples with outliers.
- outliersScale: Scale of outliers.

Related Issue

#3559

Usage Example

// Example usage of SyntheticRegressionGenerator
#include <mlpack/core/data/random_regression_generator.hpp>
using namespace mlpack::data;

int main()
{
  arma::mat X;
  arma::mat y;
  int nSamples = 100;
  int nFeatures = 1;
  ErrorParams normalError(ErrorType::NormalDist);
  normalError.normalParams = NormalDistParams(3.0, 4.0);

  RegressionDataGenerator generator(100, 1, normalError);
  generator.GenerateData(X, y);

  // ... rest of your code ...
}

Screenshots

Here is an example of the generated data. (plotted in python)

…d to test lienar regression models.

mlpack-bot · 2024-03-01T20:11:53Z

Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:

All code should follow the style guide
Documentation added for any new functionality
Tests added for any new functionality
Tests that are added follow the testing guide
Headers and license information added to the top of any new code files
HISTORY.md updated if the changes are big or user-facing
All CI checks should be passing

Thank you again for your contributions! 👍

shrit · 2024-03-05T12:01:19Z

src/mlpack/core/data/random_regression_generator.hpp

+    /// Generate noise based on the specified distribution
+    if (errParams.type == ErrorType::NormalDist) {
+        error = errParams.normalParams.mu + 
+                  errParams.normalParams.std * arma::randn(y.n_rows, y.n_cols);
+    }
+    else if (errParams.type == ErrorType::GammaDist) {
+      error = arma::randg(y.n_rows, y.n_cols, arma::distr_param(
+                errParams.gammaParams.alpha, errParams.gammaParams.beta));
+    }
+    else
+    {
+        throw std::invalid_argument("Invalid error distribution type. "


Please fx the style so it is exactly the same as what mlpack is using.

Check other source code files for the styles, also we have a style guide somewhere in the github wiki.

Please also do the same on the test files,

Also add tests for float numbers, arma::fmat, and check other tests for this.

shrit · 2024-03-05T12:03:46Z

src/mlpack/core/data/random_regression_generator.hpp

+class RegressionDataGenerator 
+{
+ public:
+  /**


use a generic matrix type instead of arma::mat.

shrit · 2024-03-05T12:06:57Z

src/mlpack/core/data/random_regression_generator.hpp

+ */
+enum class ErrorType
+{
+  NormalDist, /**< Normal distribution.  */
+  GammaDist   /**< Gamma distribution. */
+};
+
+struct NormalDistParams
+{
+  double mu;  /**< Mean. */
+  double std; /**< Standard deviation. */
+
+  NormalDistParams(double mu, double std) : mu(mu), std(std) {}
+};
+
+struct GammaDistParams
+{
+  double alpha;
+  double beta; 
+
+  GammaDistParams(double alpha, double beta) : alpha(alpha), beta(beta) {}
+};
+
+struct ErrorParams
+{
+  ErrorType type; /**< Type of error distribution. */
+
+  union
+  {
+    NormalDistParams normalParams; /**< Parameters for normal distribution. */
+    GammaDistParams gammaParams;   /**< Parameters for gamma distribution. */
+  };
+
+  ErrorParams(ErrorType type) : type(type) {};
+};


All of these are not required, since armadillo handles all of these. Why then create empty structs that do not do anything?

Hi @shrit
Thank you for your feedback,
I thought about adding more error distribution options in the future, like Laplace and Discrete distributions. However, I couldn't find a straightforward way for users to input only the needed parameters for their chosen distribution (like mu and std for normal distribution, alpha and beta for gamma distribution), apart from the current method.

I also considered letting users pass an array with parameters for each distribution, using an enumeration for clarity:

enum ErrorType { Normal, Gamma, // ... }; int main() { // Normal distribution - mu -> params[0], std -> params[1] double params[] = {2.0, 3.0}; RegressionDataGenerator generator(100, 1, ErrorType::Normal, params); // Gamma distribution - alpha -> params[0], beta -> params[1] double params[] = {2.0, 3.0}; RegressionDataGenerator generator(100, 1, ErrorType::Gamma, params); // Additional distributions // ... }

But I felt that the first method is more user-friendly, what do you think?

You can just use template parameters and overload the constructors, look at other functions, for example tree classes can be a good example to inspire what are you trying to achieve

Kindly recheck when possible

shrit · 2024-03-05T12:07:50Z

src/mlpack/core/data/random_regression_generator.hpp

+  void ValidateParameters() const 
+  {
+      // Validation logic for input parameters
+      if (nSamples <= 0 || nFeatures <= 0 || nTargets <= 0) 
+      {
+          throw std::invalid_argument("Invalid input: nSamples, nFeatures,"
+               "and nTargets must be positive.");
+      }
+
+      if (sparsity < 0 || sparsity >= 1) 
+      {
+          throw std::invalid_argument("Invalid input: sparsity must be in" 
+                "the range [0, 1).");
+      }
+
+      if (outliersFraction < 0 || outliersFraction >= 1) 
+      {
+          throw std::invalid_argument("Invalid input: outliersFraction must be"
+                 "in the range [0, 1).");
+      }
+  }


Why this one is here, also style issues

shrit · 2024-03-05T12:08:00Z

src/mlpack/tests/random_regression_generator_test.cpp

+TEST_CASE("Generate linear regression data - Valid Inputs",
+  "[regressionGenerator]") {
+  arma::mat X, y;


styles issues as well

shrit · 2024-03-05T12:09:38Z

src/mlpack/core/data/random_regression_generator.hpp

+ * Implementation of a regression data generator with random features and error 
+ * distribution.
+ *


I am not sure, this is not a method, it is more of a simulation, so I have no idea if we need to add this to mlpack or not, it is nice to have, but I do not know where we can add this, @rcurtin would know better than me to judge this class in itself and whether if it is worth it or not

mlpack-bot · 2024-04-30T01:45:39Z

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

Ali-Hossam added 2 commits March 1, 2024 16:31

Added synthetic regression dataset generator function that can be use…

d17709f

…d to test lienar regression models.

added define

0954b29

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Mar 1, 2024

Ali-Hossam marked this pull request as ready for review March 1, 2024 20:14

shrit reviewed Mar 5, 2024

View reviewed changes

Ali-Hossam and others added 2 commits March 7, 2024 16:46

FIxed style, added generic matrix type and test cases.

a70d04a

Merge branch 'mlpack:master' into generate-synthetic-regression-dataset

a84d855

coatless mentioned this pull request Mar 30, 2024

Proposal for New Supervised Learning Data Simulation Classes in C++ for MLPACK Library #3559

Closed

Ali-Hossam added 3 commits March 31, 2024 03:18

Integerated with mlpack/core/dists, and add Noise distribution Template

64dfe6c

Update test cases

c1fa1d2

Add new module to data.hpp

6ef54e9

mlpack-bot bot added the s: stale label Apr 30, 2024

mlpack-bot bot closed this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic Regression Dataset Generator #3647

Synthetic Regression Dataset Generator #3647

Ali-Hossam commented Mar 1, 2024 •

edited

mlpack-bot bot commented Mar 1, 2024

shrit Mar 5, 2024

shrit Mar 5, 2024

shrit Mar 5, 2024

Ali-Hossam Mar 5, 2024 •

edited

shrit Mar 21, 2024

Ali-Hossam Mar 31, 2024 •

edited

shrit Mar 5, 2024

shrit Mar 5, 2024

shrit Mar 5, 2024

mlpack-bot bot commented Apr 30, 2024

Synthetic Regression Dataset Generator #3647

Synthetic Regression Dataset Generator #3647

Conversation

Ali-Hossam commented Mar 1, 2024 • edited

Description

Changes Made

Related Issue

Usage Example

Screenshots

mlpack-bot bot commented Mar 1, 2024

shrit Mar 5, 2024

Choose a reason for hiding this comment

shrit Mar 5, 2024

Choose a reason for hiding this comment

shrit Mar 5, 2024

Choose a reason for hiding this comment

Ali-Hossam Mar 5, 2024 • edited

Choose a reason for hiding this comment

shrit Mar 21, 2024

Choose a reason for hiding this comment

Ali-Hossam Mar 31, 2024 • edited

Choose a reason for hiding this comment

shrit Mar 5, 2024

Choose a reason for hiding this comment

shrit Mar 5, 2024

Choose a reason for hiding this comment

shrit Mar 5, 2024

Choose a reason for hiding this comment

mlpack-bot bot commented Apr 30, 2024

Ali-Hossam commented Mar 1, 2024 •

edited

Ali-Hossam Mar 5, 2024 •

edited

Ali-Hossam Mar 31, 2024 •

edited