Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic Regression Dataset Generator #3647

Conversation

Ali-Hossam
Copy link

@Ali-Hossam Ali-Hossam commented Mar 1, 2024

Description

This pull request introduces a new feature to mlpack: a synthetic regression dataset generator. The generator allows users to create datasets with customizable parameters.

The implementation is inspired by scikit-learn's make_regression functionality, but includes additional enhancements such as the ability to introduce outliers and support for multiple error distribution types, including gamma and normal distributions.

Changes Made

  • Added SyntheticRegressionGenerator class with the following parameters:
    • nSamples: Number of samples.
    • nFeatures: Number of features.
    • ErrorParams: Error distribution parameters (support for both normal and gamma distributions).
    • nTargets: Number of target variables (includes multi-target).
    • intercept: Bias term.
    • sparsity: Sparsity level of coefficients.
    • outliersFraction: Fraction of samples with outliers.
    • outliersScale: Scale of outliers.

Related Issue

#3559

Usage Example

// Example usage of SyntheticRegressionGenerator
#include <mlpack/core/data/random_regression_generator.hpp>
using namespace mlpack::data;

int main()
{
  arma::mat X;
  arma::mat y;
  int nSamples = 100;
  int nFeatures = 1;
  ErrorParams normalError(ErrorType::NormalDist);
  normalError.normalParams = NormalDistParams(3.0, 4.0);

  RegressionDataGenerator generator(100, 1, normalError);
  generator.GenerateData(X, y);

  // ... rest of your code ...
}

Screenshots

Here is an example of the generated data. (plotted in python)
regression6

Copy link

mlpack-bot bot commented Mar 1, 2024

Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:

  • All code should follow the style guide
  • Documentation added for any new functionality
  • Tests added for any new functionality
  • Tests that are added follow the testing guide
  • Headers and license information added to the top of any new code files
  • HISTORY.md updated if the changes are big or user-facing
  • All CI checks should be passing

Thank you again for your contributions! 👍

@Ali-Hossam Ali-Hossam marked this pull request as ready for review March 1, 2024 20:14
Comment on lines 133 to 144
/// Generate noise based on the specified distribution
if (errParams.type == ErrorType::NormalDist) {
error = errParams.normalParams.mu +
errParams.normalParams.std * arma::randn(y.n_rows, y.n_cols);
}
else if (errParams.type == ErrorType::GammaDist) {
error = arma::randg(y.n_rows, y.n_cols, arma::distr_param(
errParams.gammaParams.alpha, errParams.gammaParams.beta));
}
else
{
throw std::invalid_argument("Invalid error distribution type. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please fx the style so it is exactly the same as what mlpack is using.
  2. Check other source code files for the styles, also we have a style guide somewhere in the github wiki.
  3. Please also do the same on the test files,
  4. Also add tests for float numbers, arma::fmat, and check other tests for this.

Comment on lines +67 to +70
class RegressionDataGenerator
{
public:
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a generic matrix type instead of arma::mat.

Comment on lines 25 to 59
*/
enum class ErrorType
{
NormalDist, /**< Normal distribution. */
GammaDist /**< Gamma distribution. */
};

struct NormalDistParams
{
double mu; /**< Mean. */
double std; /**< Standard deviation. */

NormalDistParams(double mu, double std) : mu(mu), std(std) {}
};

struct GammaDistParams
{
double alpha;
double beta;

GammaDistParams(double alpha, double beta) : alpha(alpha), beta(beta) {}
};

struct ErrorParams
{
ErrorType type; /**< Type of error distribution. */

union
{
NormalDistParams normalParams; /**< Parameters for normal distribution. */
GammaDistParams gammaParams; /**< Parameters for gamma distribution. */
};

ErrorParams(ErrorType type) : type(type) {};
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are not required, since armadillo handles all of these. Why then create empty structs that do not do anything?

Copy link
Author

@Ali-Hossam Ali-Hossam Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shrit
Thank you for your feedback,
I thought about adding more error distribution options in the future, like Laplace and Discrete distributions. However, I couldn't find a straightforward way for users to input only the needed parameters for their chosen distribution (like mu and std for normal distribution, alpha and beta for gamma distribution), apart from the current method.

I also considered letting users pass an array with parameters for each distribution, using an enumeration for clarity:

enum ErrorType {
    Normal, 
    Gamma,
    // ...
};

int main() {
    // Normal distribution - mu -> params[0], std -> params[1]
    double params[] = {2.0, 3.0};
    RegressionDataGenerator generator(100, 1, ErrorType::Normal, params);

    // Gamma distribution - alpha -> params[0], beta -> params[1]
    double params[] = {2.0, 3.0};
    RegressionDataGenerator generator(100, 1, ErrorType::Gamma, params);

    // Additional distributions
    // ...
}

But I felt that the first method is more user-friendly, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use template parameters and overload the constructors, look at other functions, for example tree classes can be a good example to inspire what are you trying to achieve

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly recheck when possible

Comment on lines 184 to 204
void ValidateParameters() const
{
// Validation logic for input parameters
if (nSamples <= 0 || nFeatures <= 0 || nTargets <= 0)
{
throw std::invalid_argument("Invalid input: nSamples, nFeatures,"
"and nTargets must be positive.");
}

if (sparsity < 0 || sparsity >= 1)
{
throw std::invalid_argument("Invalid input: sparsity must be in"
"the range [0, 1).");
}

if (outliersFraction < 0 || outliersFraction >= 1)
{
throw std::invalid_argument("Invalid input: outliersFraction must be"
"in the range [0, 1).");
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this one is here, also style issues

Comment on lines 7 to 9
TEST_CASE("Generate linear regression data - Valid Inputs",
"[regressionGenerator]") {
arma::mat X, y;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

styles issues as well

Comment on lines +5 to +7
* Implementation of a regression data generator with random features and error
* distribution.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, this is not a method, it is more of a simulation, so I have no idea if we need to add this to mlpack or not, it is nice to have, but I do not know where we can add this, @rcurtin would know better than me to judge this class in itself and whether if it is worth it or not

Copy link

mlpack-bot bot commented Apr 30, 2024

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Apr 30, 2024
@mlpack-bot mlpack-bot bot closed this May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants