Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better documentation of missing value imputation #484

Open
Kiord opened this issue Jan 23, 2023 · 9 comments
Open

Better documentation of missing value imputation #484

Kiord opened this issue Jan 23, 2023 · 9 comments

Comments

@Kiord
Copy link

Kiord commented Jan 23, 2023

Describe the bug

I am trying to impute values with svd_interface in a matrix, but the mask values don't seem to impact the result.
I am using n_eigenvecs=matrix.shape[0] (no compression / no data loss)

Steps or Code to Reproduce

import numpy as np
from tensorly.tenalg.svd import svd_interface

m, n = 30, 50
matrix = np.random.rand(m, n)
mask = np.ones_like(matrix)
mask[m//2:, ::2] = 0 # mask some data
print(f'{np.count_nonzero(mask) / float(mask.size) * 100}% imputed')

U, S, V = svd_interface(matrix, n_eigenvecs=m, mask=mask)
recon = U @ np.diag(S) @ V
print(np.allclose(matrix, recon))

Expected behavior

The values where mask == 0 should differ from matrix to recon

Actual result

The imputing code is as follows:

    U, S, V = svd_fun(matrix, n_eigenvecs=n_eigenvecs, **kwargs)

    if mask is not None:
        for _ in range(n_iter_mask_imputation):
            matrix = matrix * mask + (U @ tl.diag(S) @ V) * (1 - mask)
            U, S, V = svd_fun(matrix, n_eigenvecs=n_eigenvecs, **kwargs)

I think the result U, S, V does not depend on mask:

  • iteration 0 (first line) : matrix transformed in svd form (no loss data)
  • iteration 1 :
    • since matrix is equal to U @ tl.diag(S) @ V, this is equivalent to matrix=matrix
    • 'matrix' is transformed to U, S, V (no loss data)
  • iteration 2 :
    • since matrix is equal to U @ tl.diag(S) @ V, this is equivalent to matrix=matrix
    • 'matrix' is transformed to U, S, V (no loss data)
  • ...

In the end mask has no effect on the result.

Versions

Windows-10-10.0.22000-SP0
Python 3.10.4 | packaged by conda-forge | (main, Mar 30 2022, 08:38:02) [MSC v.1916 64 bit (AMD64)]
NumPy 1.21.5
SciPy 1.8.0

@aarmey aarmey assigned aarmey and unassigned aarmey Jan 24, 2023
@aarmey
Copy link
Contributor

aarmey commented Jan 24, 2023

Hi @Kiord—When the rank is full, SVD is able to fully represent the matrix. As a consequence of this, it can represent whatever values are filled in as missing values, and there is no way for SVD to impute those values. Imputation occurs as a consequence of SVD being an approximation of the matrix, which can only happen when it is an incomplete representation. You should find that the imputed values change if the rank is less than full, though.

@Kiord
Copy link
Author

Kiord commented Jan 24, 2023

Hi, thank you. I was unsure if this behavior was wanted.

@Kiord Kiord closed this as completed Jan 24, 2023
@aarmey
Copy link
Contributor

aarmey commented Jan 24, 2023

No worries. Happy to help!

@JeanKossaifi
Copy link
Member

Thanks @aarmey - do you think it would make sense to add a Notes section on imputation in the docstring?

@aarmey
Copy link
Contributor

aarmey commented Jan 30, 2023

I am not sure if this should be user-facing. If a user wants to impute by SVD, there are better options in fancyimpute. However, I do agree that there could be better documentation of handling missing values and, potentially, using the tensor methods to impute. Do you think that would go in the notes section of the methods, or in a section of the user guide?

@JeanKossaifi
Copy link
Member

I agree -- maybe just a short note section in the docstring about what we mean by mask etc. We can also add separately a short section in the user guide about missing data with our tensor methods in general -- I guess there is no such thing as too exhaustive a documentation :)

@aarmey aarmey reopened this Jan 31, 2023
@aarmey aarmey changed the title svd_interface not imputing missing values (when rank is full) Better documentation of missing value imputation Jan 31, 2023
@JeanKossaifi
Copy link
Member

@Kiord would you be able to make a small PR with the changes you have in mind and ping me and @aarmey to review it?

@Kiord
Copy link
Author

Kiord commented Jun 8, 2023

Hello @JeanKossaifi, I can do try to do that. What I understood is that svd_interface is not designed to impute the values like sklearn.impute.SimpleImputer for instance, but an optional functionality (masking) allows the user to impute some data but only when the rank is reduced. Correct ?

@aarmey
Copy link
Contributor

aarmey commented Jun 9, 2023

Hi @Kiord—That is right. There is nothing wrong with using svd_interface for imputation, we just have not added a lot of basic functionality one might expect. For example, it only allows you to set a constant number of iterations, rather than running until some convergence condition. We also do not check the inputs very carefully to ensure they are reasonable (such as being lower rank than the data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants