Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency in sign. #24

Open
eliocamp opened this issue Oct 4, 2017 · 6 comments
Open

Consistency in sign. #24

eliocamp opened this issue Oct 4, 2017 · 6 comments

Comments

@eliocamp
Copy link

eliocamp commented Oct 4, 2017

I know that principal value decomposition is defined up to sign, but I think that for reproducibility, irlba() should be consistent. Ideally, it should follow the same convention as the base::svd() function. The workaround right now seems to set the seed each time irlba is run.

Reproducible example:

data("volcano")
sapply(1:10, FUN = function(x) sign(irlba::irlba(volcano, nv = 1)$u[1]))
#>  [1] -1  1  1 -1 -1 -1  1  1  1  1

sapply(1:10, FUN = function(x) sign(svd(volcano, nv = 1)$u[1]))
#>  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

sapply(1:10, FUN = function(x) {
    set.seed(42) 
    sign(irlba::irlba(volcano, nv = 1)$u[1])})
#>  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

(42 turns out to be the answer to the ultimate question of life, the universe, everything and consistent singular value decompositions)

@bwlewis
Copy link
Owner

bwlewis commented Oct 6, 2017

Thanks, this is a good suggestion. I will work in a solution to this, probably using the sign of the 1st element of the first u vector.


@eliocamp
Copy link
Author

Hey, I'm sorry to keep harping about this issue, but I found another source of inconsistency. The same data represented in slightly different way returns different signs even after setting the seed. Here's a reproducible example.

sign_test <- function(x, method) {
    # Synthetic data representing an scalar field in the
    # southern hemisphere in a "tidy" format
    data <- expand.grid(lon = seq(0, 360, by = 2),
                        lat = seq(-90, 0, by = 2),
                        date = 1:10)
    set.seed(x)
    data$z <- rnorm(nrow(data))
    
    # The same data represented as a matrix differing only  
    # in the order of the columns
    matrix1 <- as.matrix(reshape2::dcast(data, date ~ lon + lat, value.var = "z"))
    matrix2 <- as.matrix(reshape2::dcast(data, date ~ lat + lon, value.var = "z"))
    
    # Base svd has no problem 
    if (method == "base") {
        r1 <- base::svd(matrix1, nv = 1, nu = 1)$v
        r2 <- base::svd(matrix2, nv = 1, nu = 1)$v
    } else {
        set.seed(42)
        r1 <- irlba::irlba(matrix1, nu = 1, nv = 1)$v
        set.seed(42)
        r2 <- irlba::irlba(matrix2, nu = 1, nv = 1)$v
    }
    sign(r1[1]) == sign(r2[1]) 
}

N <- 20
sapply(1:N, sign_test, method = "base")    # all TRUE! :)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [15] TRUE TRUE TRUE TRUE TRUE TRUE
sapply(1:N, sign_test, method = "irlba")   # some FALSE! :(
#>  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#> [12] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE

@bwlewis
Copy link
Owner

bwlewis commented Apr 10, 2018

Thanks, this is in fact a different, more serious, problem than the sign issue above--it's a bug! EDIT: Whoops -- not a bug after all...see next long comment,

@bwlewis
Copy link
Owner

bwlewis commented Apr 17, 2018

Sorry, this was not a bug, but it does show another example of the inconsistent sign issue that you already surmised abouve.

You'll be happy that a partial fix for the inconsistent sign issue is ready, with a full version to follow soon (more on that below). But first, some notes on your example above.

I had to think a bit about it. I don't think that the example illustrates things as clearly as it could, hence my initial confusion, because r1$v and r2$v should not really expected to exhibit the same signs. However, if P is a column permutaion matrix, then matrix2 = matrix1 %*% P = r1$u %*% diag(d) %*% t(r1$v) %*% P = r1$u %*% diag(d) %*% t(t(P) %*% r1$v). That means we expect r2$v = t(P) %*% r1$v in general (they are permuted versions of each other).

Here is a restatement of your example that builds a matrix with many more columns than rows using an explicitly known SVD. We then compute the SVD using R's version and irlba. It's really the same as your example, without the lat/long stuff and just focusing in on the specific issue...

m = 10
n = 1000
set.seed(1)
# explicitly construct a matrix x1 with known SVD
u = qr.Q(qr(matrix(rnorm(m * m), m)))
v = qr.Q(qr(matrix(rnorm(m * n), n)))
d = seq(m, 1)
x1 = u %*% (d * t(v))

# Now, permute its columns
i  = sample(n)
P  = diag(1, n, n)[, i]  # permutation matrix
x2 = x1 %*% P
#  = u %*% (d * t(v)) %*% P
#  = u %*% (d * t(t(P) %*% v))   # an explicit SVD of x2
# this shows that svd(x2)$v should equal t(P) %*% v.

s2 = svd(x2)

So at this point, s2$v should equal t(P) %*% v, right? But not so! We see that the non-uniqueness of the SVD affects R's native svd sometimes too:

round(apply(s2$v - t(P) %*% v, 2, crossprod))
# [1] 0 0 0 0 4 0 0 0 0 4

However, at least R's native svd function is self-consistent with respect to sign:

s1 = svd(x1)
round(apply(s2$v - t(P) %*% s1$v, 2, crossprod))
# [1] 0 0 0 0 0 0 0 0 0 0

This is really the crux of your issue report here in issue #24 -- irlba is unfortunately not even self-consistent :( shown next:

library(irlba)
set.seed(1)
l1 = irlba(x1, 3)
set.seed(1)
l2 = irlba(x2, 3)
round(apply(l2$v - t(P) %*% l1$v, 2, crossprod))
# [1] 0 4 0

I agree that this is somewhat unsatisfactory, even though technically still correct.

After thinking about this a lot, I am working on a surprisingly simple fix, that unfortunately is less simple than it seems and breaks some tests at the moment. I added a (for now, hidden) option to set the internally used RNG in the function. If we choose an RNG with uniform sign, then your problems above are mitigated:

set.seed(1)
l1u = irlba(x1, 3, rng=runif)
set.seed(1)
l2u = irlba(x2, 3, rng=runif)
round(apply(l2u$v - t(P) %*% l1u$v, 2, crossprod))
# [1] 0 0 0

Cazart! So this is a first draft/experimental of a fix for this issue. Unfortunately, it's not yet finished and is breaking a few things here and there. Try it out (it will be in the GitHub repository presently). For now, it's a hidden option with no documentation, etc.

I am interested in thoughts on this approach, any help appreciated, including more of your great examples which have helped so far.

Best,

Bryan

bwlewis added a commit that referenced this issue Apr 17, 2018
tests are breaking though, something not quite right yet...(in process)
@bwlewis
Copy link
Owner

bwlewis commented Apr 17, 2018

Sorry, the test failure is due to an unrelated issue that I forgot about.

I think this new rng=runif option will fix your issue. If you agree, we can make it an official option with doc.

@eliocamp
Copy link
Author

eliocamp commented Apr 24, 2018

I sat down willing to test this and I now I understand your explanation of why my previous code was not correct. I was testing the sign of the first element of v, but that element does not represent the same point in space when the matrix is permuted (I come from the physical sciences, so my background in matrix algebra is rather shallow and I needed the physical dimension to understand things). After running proper tests, I see now that it is not a problem. Silly me 🤦‍♂️ .

If I understand your fix correctly, it fixes the issue in my first comment without the need to specify the seed, is that correct? That's wonderful for reproducibility. While setting the seed is always a good idea, it's not ideal because the result is still dependent on the chosen seed. Is there any drawback of using runif instead of rnorm? If not, maybe it's worth changing it to the default value of rng?

Works like a charm

data("volcano")
sapply(1:10, FUN = function(x) sign(irlba::irlba(volcano, nv = 1, rng = runif)$u[1]))
#>  [1] 1 1 1 1 1 1 1 1 1 1

Even across seeds. 🎆

sapply(1:10, FUN = function(x) {
    set.seed(x) 
    sign(irlba::irlba(volcano, nv = 1, rng = runif)$u[1])})
#>  [1] 1 1 1 1 1 1 1 1 1 1

The only potential issue is that the sign is not consistent with base::svd

sign(svd(volcano, nv = 1)$u[1]) == sign(irlba::irlba(volcano, nv = 1, rng = runif)$u[1])
#> [1] FALSE

In this case, it's solved by changing the sign of the rng.

sign(svd(volcano, nv = 1)$u[1]) == sign(irlba::irlba(volcano, nv = 1, rng = function(x) -runif(x))$u[1])
#> [1] TRUE

But it seems to be data-dependent :(

sign_test <- function(x) {
    # Synthetic data representing an scalar field in the
    # southern hemisphere in a "tidy" format
    data <- expand.grid(lon = seq(0, 360, by = 2),
                        lat = seq(-90, 0, by = 2),
                        date = 1:10)
    set.seed(x)
    data$z <- rnorm(nrow(data))
    matrix1 <- as.matrix(reshape2::dcast(data, date ~ lon + lat, value.var = "z"))
    
    r1 <- base::svd(matrix1, nv = 1, nu = 1)$u
    r2 <- irlba::irlba(matrix1, nu = 1, nv = 1, rng = function(x) -runif(x))$u
    sign(r1[1]) == sign(r2[1]) 
}

N <- 20
sapply(1:N, sign_test) 
#>  [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
#> [12]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE

I don't know if the above inconsistency can be considered a problem or not. It seems to me that different results returned from different algorithms is a natural and unavoidable consequence of the non-uniqueness of SVD. If you think this is not worth changing, then I think this issue is safely closed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants