New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthetic Regression Dataset Generator #3647
Synthetic Regression Dataset Generator #3647
Conversation
…d to test lienar regression models.
Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:
Thank you again for your contributions! 👍 |
/// Generate noise based on the specified distribution | ||
if (errParams.type == ErrorType::NormalDist) { | ||
error = errParams.normalParams.mu + | ||
errParams.normalParams.std * arma::randn(y.n_rows, y.n_cols); | ||
} | ||
else if (errParams.type == ErrorType::GammaDist) { | ||
error = arma::randg(y.n_rows, y.n_cols, arma::distr_param( | ||
errParams.gammaParams.alpha, errParams.gammaParams.beta)); | ||
} | ||
else | ||
{ | ||
throw std::invalid_argument("Invalid error distribution type. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please fx the style so it is exactly the same as what mlpack is using.
- Check other source code files for the styles, also we have a style guide somewhere in the github wiki.
- Please also do the same on the test files,
- Also add tests for float numbers,
arma::fmat
, and check other tests for this.
class RegressionDataGenerator | ||
{ | ||
public: | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a generic matrix type instead of arma::mat
.
*/ | ||
enum class ErrorType | ||
{ | ||
NormalDist, /**< Normal distribution. */ | ||
GammaDist /**< Gamma distribution. */ | ||
}; | ||
|
||
struct NormalDistParams | ||
{ | ||
double mu; /**< Mean. */ | ||
double std; /**< Standard deviation. */ | ||
|
||
NormalDistParams(double mu, double std) : mu(mu), std(std) {} | ||
}; | ||
|
||
struct GammaDistParams | ||
{ | ||
double alpha; | ||
double beta; | ||
|
||
GammaDistParams(double alpha, double beta) : alpha(alpha), beta(beta) {} | ||
}; | ||
|
||
struct ErrorParams | ||
{ | ||
ErrorType type; /**< Type of error distribution. */ | ||
|
||
union | ||
{ | ||
NormalDistParams normalParams; /**< Parameters for normal distribution. */ | ||
GammaDistParams gammaParams; /**< Parameters for gamma distribution. */ | ||
}; | ||
|
||
ErrorParams(ErrorType type) : type(type) {}; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these are not required, since armadillo handles all of these. Why then create empty structs that do not do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shrit
Thank you for your feedback,
I thought about adding more error distribution options in the future, like Laplace and Discrete distributions. However, I couldn't find a straightforward way for users to input only the needed parameters for their chosen distribution (like mu and std for normal distribution, alpha and beta for gamma distribution), apart from the current method.
I also considered letting users pass an array with parameters for each distribution, using an enumeration for clarity:
enum ErrorType {
Normal,
Gamma,
// ...
};
int main() {
// Normal distribution - mu -> params[0], std -> params[1]
double params[] = {2.0, 3.0};
RegressionDataGenerator generator(100, 1, ErrorType::Normal, params);
// Gamma distribution - alpha -> params[0], beta -> params[1]
double params[] = {2.0, 3.0};
RegressionDataGenerator generator(100, 1, ErrorType::Gamma, params);
// Additional distributions
// ...
}
But I felt that the first method is more user-friendly, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use template parameters and overload the constructors, look at other functions, for example tree classes can be a good example to inspire what are you trying to achieve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly recheck when possible
void ValidateParameters() const | ||
{ | ||
// Validation logic for input parameters | ||
if (nSamples <= 0 || nFeatures <= 0 || nTargets <= 0) | ||
{ | ||
throw std::invalid_argument("Invalid input: nSamples, nFeatures," | ||
"and nTargets must be positive."); | ||
} | ||
|
||
if (sparsity < 0 || sparsity >= 1) | ||
{ | ||
throw std::invalid_argument("Invalid input: sparsity must be in" | ||
"the range [0, 1)."); | ||
} | ||
|
||
if (outliersFraction < 0 || outliersFraction >= 1) | ||
{ | ||
throw std::invalid_argument("Invalid input: outliersFraction must be" | ||
"in the range [0, 1)."); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this one is here, also style issues
TEST_CASE("Generate linear regression data - Valid Inputs", | ||
"[regressionGenerator]") { | ||
arma::mat X, y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
styles issues as well
* Implementation of a regression data generator with random features and error | ||
* distribution. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, this is not a method, it is more of a simulation, so I have no idea if we need to add this to mlpack or not, it is nice to have, but I do not know where we can add this, @rcurtin would know better than me to judge this class in itself and whether if it is worth it or not
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
Description
This pull request introduces a new feature to mlpack: a synthetic regression dataset generator. The generator allows users to create datasets with customizable parameters.
The implementation is inspired by scikit-learn's
make_regression
functionality, but includes additional enhancements such as the ability to introduce outliers and support for multiple error distribution types, including gamma and normal distributions.Changes Made
SyntheticRegressionGenerator
class with the following parameters:nSamples
: Number of samples.nFeatures
: Number of features.ErrorParams
: Error distribution parameters (support for both normal and gamma distributions).nTargets
: Number of target variables (includes multi-target).intercept
: Bias term.sparsity
: Sparsity level of coefficients.outliersFraction
: Fraction of samples with outliers.outliersScale
: Scale of outliers.Related Issue
#3559
Usage Example
Screenshots
Here is an example of the generated data. (plotted in python)