Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement automatic ridge regression #124

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

witherscp
Copy link

@witherscp witherscp commented Feb 3, 2023

PR Description

Closes #123

Adds auto_reg parameter to vector_auto_regression. This will regularize the input matrix using scikit-learn's RidgeCV which searches for optimal alpha value. The auto_reg parameter is automatically set to True for ill-conditioned matrices.

Merge checklist

Maintainer, please confirm the following before merging:

  • All comments resolved
  • This is not your own PR
  • All CIs are happy
  • PR title starts with [MRG]
  • whats_new.rst is updated
  • PR description includes phrase "closes <#issue-number>"

@adam2392 adam2392 self-requested a review February 3, 2023 16:30
auto_reg : bool, optional
Whether to perform automatic regularization of X matrix using RidgeCV,
by default False. If matrix is not full rank, this will be adjusted to
True.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
True.
True. If ``l2_reg`` is non-zero, then this must be set to `False`.

@@ -31,6 +32,10 @@ def vector_auto_regression(
Autoregressive model order, by default 1.
l2_reg : float, optional
Ridge penalty (l2-regularization) parameter, by default 0.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ridge penalty (l2-regularization) parameter, by default 0.0.
Ridge penalty (l2-regularization) parameter, by default 0.0.
If ``auto_reg`` is `True`, then this must be set to 0.0.

Comment on lines 129 to 130
elif auto_reg and l2_reg:
raise ValueError("If l2_reg is set, then auto_reg must be set to False")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit-test for this case? Lmk if you need help w/ that.

@@ -31,6 +32,10 @@ def vector_auto_regression(
Autoregressive model order, by default 1.
l2_reg : float, optional
Ridge penalty (l2-regularization) parameter, by default 0.0.
auto_reg : bool, optional
Whether to perform automatic regularization of X matrix using RidgeCV,
by default False. If matrix is not full rank, this will be adjusted to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
by default False. If matrix is not full rank, this will be adjusted to
by default False. If the data matrix has condition number less than 1e6, then this will be adjusted to

Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a quick look, I have another thought to make the API easier and more intuitive perhaps.

Instead of adding an auto_reg parameter, what are your thoughts on making l2_reg = 'auto' by default? So l2_reg can be 'auto', a float, or a list of floats, or None.

  • 'auto': l2_reg will compute the optimal using a pre-defined grid of alphas, like you have
  • float: l2_reg will just be applied with this alpha value
  • list of floats: same as auto, but you specify the pre-defined grid of alphas
  • None, or if l2_reg is 0: no regularization

WDYT? This might make your/users life easier because you just have to worry about 1 parameter, which is documented vs figuring out how to play around with 2 parameters.

@witherscp
Copy link
Author

I agree that using only the l2_reg argument will make things much simpler. Do you want me to implement this and then commit the changes?

@adam2392
Copy link
Member

adam2392 commented Feb 3, 2023

I agree that using only the l2_reg argument will make things much simpler. Do you want me to implement this and then commit the changes?

That would be great and then I can help review the code!

@witherscp
Copy link
Author

witherscp commented Feb 3, 2023

I opted to keep the underlying functions unchanged w/r/t the l2_reg parameter, but I had to overwrite the l2_reg value in line 185 to avoid problems. This isn't clean for the code but will not affect the user experience. I also thought it was helpful to include the cross-validation alpha values in the displayed model parameters, so that the user can view them if verbose is set to True.

Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just some minor touchups. The next steps would be to:

  1. add a unit-test for the new case
  2. add an example, or augment an existing example demonstrating this usage and when it might be necessary

I think the CI are also showing some unit tests getting errors: https://github.com/mne-tools/mne-connectivity/actions/runs/4087951173/jobs/7049413823

You can follow the CONTRIBUTING guide to install relevant test packages and docs-building packages, so you can run the relevant checks locally since they are faster. You can also check coding style by running make pep. Lmk if you run into any issues.

@witherscp
Copy link
Author

I've never worked with unit tests before. What would be a good unit test to add for the new features and how do I add that? As for the current tests that are failing, I am getting the same problem across 10 tests: numpy.core._exceptions._UFuncNoLoopError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U4'), dtype('float64')) -> None. I'm not sure the cause of this error, but it may be related to the new default l2_reg value which is a str instead of float.

@adam2392
Copy link
Member

adam2392 commented Feb 6, 2023

Re adding unit tests:

  • you can experiment w/ creating a dataset that should tests the functionality you have added. For example,
    def create_noisy_data(
    add_noise,
    sigma=1e-4,
    m=100,
    random_state=12345,
    ):
    """Create noisy test data.
    Generate a 2x2 linear system, and perturb
    observations of the state variables with Gaussian noise.
    Parameters
    ----------
    add_noise : bool
    Whether to add noise or not.
    sigma : float
    noise standard deviation.
    m : int
    The number of samples.
    return_A : bool
    Whether to return the A matrix
    random_state : None | int | instance of ~numpy.random.RandomState
    If ``random_state`` is an :class:`int`, it will be used as a seed for
    :class:`~numpy.random.RandomState`. If ``None``, the seed will be
    obtained from the operating system (see
    :class:`~numpy.random.RandomState` for details). Default is
    ``None``.
    Returns
    -------
    sample_data : ndarray, shape (n_channels, n_samples)
    Observed sample data. Possibly with noise.
    sample_eigs : np.ndarray
    The true eigenvalues of the system.
    sample_A : np.ndarray
    (Optional) if ``return_A`` is True, then returns the
    true linear system matrix.
    """
    rng = np.random.RandomState(random_state)
    mu = 0.0
    noise = rng.normal(mu, sigma, m) # gaussian noise
    A = np.array([[1.0, 1.0], [-1.0, 2.0]])
    A /= np.sqrt(3)
    # compute true eigenvalues
    true_eigvals = np.linalg.eigvals(A)
    n = 2
    X = np.zeros((n, m))
    X[:, 0] = np.array([0.5, 1.0])
    # evolve the system and perturb the data with noise
    for k in range(1, m):
    X[:, k] = A.dot(X[:, k - 1])
    if add_noise:
    X[:, k - 1] += noise[k - 1]
    return X, true_eigvals, A
    creates a noisy dataset, which is used to make sure that certain functions are robust to noise. You can add a function here to generate the dataset. You probably want to create a dataset that is "ill-conditioned". So augment a dataset that is not ill-conditioned with a few sensors/rows that are "almost" copies of a few other rows. Copy the rows, and then add some small noise. This should result in ill-conditioned data since the new rows are almost exact copies.
  • Then you want to add a unit test itself. The unit tests are named def test_**. Examples are in https://github.com/mne-tools/mne-connectivity/blob/main/mne_connectivity/vector_ar/tests/test_var.py.
  • We use pytest, which is used heavily in all fields. You might find it helpful/educational to peruse their docs: https://docs.pytest.org/en/7.2.x/. It's a good tool for helping you develop robust scientific software.

Lmk if you have any questions!

@adam2392
Copy link
Member

adam2392 commented Feb 6, 2023

I've never worked with unit tests before. What would be a good unit test to add for the new features and how do I add that? As for the current tests that are failing, I am getting the same problem across 10 tests: numpy.core._exceptions._UFuncNoLoopError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U4'), dtype('float64')) -> None. I'm not sure the cause of this error, but it may be related to the new default l2_reg value which is a str instead of float.

Re the errors you see, I think you are right. The issue is that l2_reg = 'auto' should be converted into a number. IIUC there are two cases for l2_reg == 'auto':

  1. The condition number if below our threshold, therefore we will do CV to determine an optimal l2_reg value
  2. The condition number does not fall below our threshold. Looking at the docstring and code, I don't think this case is covered, causing the errors. In this setting, we can either proceed with still running CV to determine the optimal l2_reg value, or we set l2_reg = 0.

So in summary:

  1. l2_reg = None/0: no CV is done regardless of condition number. Maybe a warning is released
  2. l2_reg = 'auto' (default now): CV is done only if condition number if high.
  3. l2_reg = array of floats: CV is done using those array of floats regardless of condition number
  4. l2_reg = number > 0: no CV is done, and the l2_reg is applied to all cases.

Does this seem right to you? If so, perhaps you can walk through the code and see where there might be issues? Once you write the unit-tests, those will also help in debugging the behavior in the different cases.

@witherscp
Copy link
Author

Thanks for the debugging help in the last comment! I was able to fix the problem where I did not account for the case when l2_reg='auto' and data is not ill-conditioned.

I'm now having some trouble making the unit-tests. I noticed how you have a correct A matrix in your create_noisy_data() function which you can use to test whether the VAR-created A matrix is correct. I can create an ill-conditioned matrix using the below code, but how do I know the actual A matrix for this dataset?

rng = np.random.RandomState(0)
n_epochs, n_signals, n_times = 2, 4, 64
data = rng.randn(n_epochs, n_signals, n_times)
times = np.arange(n_times)
final_row = np.reshape(data[:,-1,:],(2,1,64))
illconditioned = np.concatenate((data, final_row), axis=1)
illconditioned[:,-1,:] += rng.normal(0,1e-6,128).reshape(2,-1)

Right now, my unit test would simply iterate through a set of parameters for l2_reg -- 'auto', None, 0, 0.3, np.array(0.1,1), [0.1,1] -- to make sure that no errors are encountered. But how can I know if the result is "correct"?

@adam2392
Copy link
Member

adam2392 commented Feb 8, 2023

I'm now having some trouble making the unit-tests. I noticed how you have a correct A matrix in your create_noisy_data() function which you can use to test whether the VAR-created A matrix is correct. I can create an ill-conditioned matrix using the below code, but how do I know the actual A matrix for this dataset?

rng = np.random.RandomState(0)
n_epochs, n_signals, n_times = 2, 4, 64
data = rng.randn(n_epochs, n_signals, n_times)
times = np.arange(n_times)
final_row = np.reshape(data[:,-1,:],(2,1,64))
illconditioned = np.concatenate((data, final_row), axis=1)
illconditioned[:,-1,:] += rng.normal(0,1e-6,128).reshape(2,-1)

Hmm good question. Maybe you can try the following:

  • generate an A matrix that is ill-conditioned by: i) Maybe an easy way is to just create the eigenvalue matrix and set one of the eigenvalues to like 1e-6. Then generate random eigenvector matrices, or ii) create a lower/upper-triangular random matrix with diagonal entries <= 1 (to ensure stability) and then artificially set one of the diagonal entries to like 1e-6. Since they're triangular, the eigenvalues are exactly the diagonal entries.
  • generate a random x0 initial conditions vector and set 2/3 of the entries to be the same (i.e. 2/3 of the "artificial sensors" have the same starting condition).
  • generate forward in time
  • double check the condition number of the output test dataset (the generated time-series)
  • verify that your procedure does "well" (either compared to the non-regularized method, or just overall close to the ground-truth matrix)

Lmk if steps 1-4 end up giving you at least what you need in terms of the test dataset. / if you have any coding questions.

Right now, my unit test would simply iterate through a set of parameters for l2_reg -- 'auto', None, 0, 0.3, np.array(0.1,1), [0.1,1] -- to make sure that no errors are encountered. But how can I know if the result is "correct"?

We'll want to test this at least works on some simulated data. You also want to create a set of tests to verify that certain behavior is carried out in each case of l2. You can verify at the very minimum here that matrix is "close" to the ground-truth and/or that the model params passed to the connectivity data structure are what you might expect.

Unrelated note: I saw you work with Sara Inati? She was a collaborator when I was in my PhD. Say hi :).

@witherscp
Copy link
Author

witherscp commented Feb 13, 2023

OK, I tried my best to follow your suggestions. I only needed to set one eigenvalue to 1e-6 to obtain an ill-conditioned matrix (as long as there are 12 channels in the dataset), so I did not need to set 2/3 of the starting conditions to the same value as well. Using the ill-conditioned data matrix, I was able to test out the l2_reg parameter with and without regularization.

Now the problem is that RidgeCV does not seem to outperform OLS in this sample dataset. In addition, many of the eigenvalues are dissimilar from the actual sample eigenvalues. What is your suggestion for next steps in debugging?

And yes, I do work with Sara Inati! Glad to see you have collaborated with us in the past. I'm thankful for your help on getting this improvement finished.

Comment on lines +204 to +208
with warnings.catch_warnings():
warnings.filterwarnings(
action='ignore',
message="Ill-conditioned matrix"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RidgeCV tests out an array of alpha values and some of them do not regularize the matrix enough to avoid an ill-conditioned matrix error. If the user sees many of these messages pop up, they may think that something is going wrong, when in fact the expected behavior of the function is happening. RidgeCV will choose the best alpha value and that will be from an instance when this error was not thrown.

@adam2392
Copy link
Member

Now the problem is that RidgeCV does not seem to outperform OLS in this sample dataset. In addition, many of the eigenvalues are dissimilar from the actual sample eigenvalues. What is your suggestion for next steps in debugging?

Hmm I'll list some things to try systematically. Before you do so, it might be helpful to track the performance changes relative to your current attempt. The two metrics I have in mind are i) how closely you approximate the true A (so some type of matrix norm of the difference), and ii) how closely you approximate the set of true eigenvalues. Then you can see if any of these changes are at least moving in the right direction (i.e. RidgeCV better than OLS).

  1. instead of simulating data with A being upper-triangular, perhaps it might help to have A not be upper-triangular. So to generate a random matrix with fixed set of eigenvalues, you just need to generate a random eigenvector matrix. Eigenvector matrices are unitary matrices (i.e. eigenvalues of 1 and some other special properties). Scipy seems to have a function for doing so: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.unitary_group.html. Also see https://mathematica.stackexchange.com/questions/228560/generate-random-matrix-with-specific-eigenvalues about generating matrices with specific set of eigenvalues.

Intuitively, this will allow the data to mix more. Perhaps rn the problem is so easy that OLS is just the better thing to do.

  1. I would also try adding more ill-conditioning, so a few more very small eigenvalues.
  2. Another thing to try is make some of the initial conditions exactly the same. Intuitively, A is mixing x0 (the initial conditions).
  3. Another thing to try is to change the dimensionality (i.e. number of sensors). Usually regularization helps in higher-dimensions. Rn you have 12 sensors and 100 time points. You can play around w/ increasing n_sensors and n_timepoints. My suspicion is that n_sensors plays a greater role.

Lmk if that helps, or if you are stuck anywhere.

@adam2392
Copy link
Member

adam2392 commented Mar 6, 2023

Hey @witherscp just checking in on this to see if you need any assistance?

If there is trouble figuring out a rigorous scenario where this is tested, we can always disable this by default and instead you can add a short example on some simulated, or real data (see existing examples), where this shows "improvement".

@witherscp
Copy link
Author

Hey @witherscp just checking in on this to see if you need any assistance?

If there is trouble figuring out a rigorous scenario where this is tested, we can always disable this by default and instead you can add a short example on some simulated, or real data (see existing examples), where this shows "improvement".

Hey, sorry for the delay; I got sidetracked by other tasks that I have been working on. I will try to test out a few scenarios where RidgeCV outperforms OLS before opting to disable the default. If this proves to be too difficult, though, it will be easy to show an example where it is necessary (as in the case of common average reference).

Can you provide a little more help with the example you suggested of using a non-upper triangular matrix? I am not that well-versed in linear algebra, so I am started to get confused by some of these ideas. What lines of code would create a non-upper triangular matrix with multiple small eigenvalues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement rank deficiency check in vector_auto_regression
2 participants