Consistency in sign. #24

eliocamp · 2017-10-04T22:29:45Z

I know that principal value decomposition is defined up to sign, but I think that for reproducibility, irlba() should be consistent. Ideally, it should follow the same convention as the base::svd() function. The workaround right now seems to set the seed each time irlba is run.

Reproducible example:

data("volcano")
sapply(1:10, FUN = function(x) sign(irlba::irlba(volcano, nv = 1)$u[1]))
#>  [1] -1  1  1 -1 -1 -1  1  1  1  1

sapply(1:10, FUN = function(x) sign(svd(volcano, nv = 1)$u[1]))
#>  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

sapply(1:10, FUN = function(x) {
    set.seed(42) 
    sign(irlba::irlba(volcano, nv = 1)$u[1])})
#>  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

(42 turns out to be the answer to the ultimate question of life, the universe, everything and consistent singular value decompositions)

The text was updated successfully, but these errors were encountered:

bwlewis · 2017-10-06T10:21:43Z

Thanks, this is a good suggestion. I will work in a solution to this, probably using the sign of the 1st element of the first u vector.

eliocamp · 2018-03-26T19:04:17Z

Hey, I'm sorry to keep harping about this issue, but I found another source of inconsistency. The same data represented in slightly different way returns different signs even after setting the seed. Here's a reproducible example.

sign_test <- function(x, method) {
    # Synthetic data representing an scalar field in the
    # southern hemisphere in a "tidy" format
    data <- expand.grid(lon = seq(0, 360, by = 2),
                        lat = seq(-90, 0, by = 2),
                        date = 1:10)
    set.seed(x)
    data$z <- rnorm(nrow(data))
    
    # The same data represented as a matrix differing only  
    # in the order of the columns
    matrix1 <- as.matrix(reshape2::dcast(data, date ~ lon + lat, value.var = "z"))
    matrix2 <- as.matrix(reshape2::dcast(data, date ~ lat + lon, value.var = "z"))
    
    # Base svd has no problem 
    if (method == "base") {
        r1 <- base::svd(matrix1, nv = 1, nu = 1)$v
        r2 <- base::svd(matrix2, nv = 1, nu = 1)$v
    } else {
        set.seed(42)
        r1 <- irlba::irlba(matrix1, nu = 1, nv = 1)$v
        set.seed(42)
        r2 <- irlba::irlba(matrix2, nu = 1, nv = 1)$v
    }
    sign(r1[1]) == sign(r2[1]) 
}

N <- 20
sapply(1:N, sign_test, method = "base")    # all TRUE! :)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [15] TRUE TRUE TRUE TRUE TRUE TRUE
sapply(1:N, sign_test, method = "irlba")   # some FALSE! :(
#>  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#> [12] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE

bwlewis · 2018-04-10T22:23:35Z

Thanks, this is in fact a different, more serious, problem than the sign issue above--~~it's a bug!~~ EDIT: Whoops -- not a bug after all...see next long comment,

bwlewis · 2018-04-17T03:25:04Z

Sorry, this was not a bug, but it does show another example of the inconsistent sign issue that you already surmised abouve.

You'll be happy that a partial fix for the inconsistent sign issue is ready, with a full version to follow soon (more on that below). But first, some notes on your example above.

I had to think a bit about it. I don't think that the example illustrates things as clearly as it could, hence my initial confusion, because r1$v and r2$v should not really expected to exhibit the same signs. However, if P is a column permutaion matrix, then matrix2 = matrix1 %*% P = r1$u %*% diag(d) %*% t(r1$v) %*% P = r1$u %*% diag(d) %*% t(t(P) %*% r1$v). That means we expect r2$v = t(P) %*% r1$v in general (they are permuted versions of each other).

Here is a restatement of your example that builds a matrix with many more columns than rows using an explicitly known SVD. We then compute the SVD using R's version and irlba. It's really the same as your example, without the lat/long stuff and just focusing in on the specific issue...

m = 10
n = 1000
set.seed(1)
# explicitly construct a matrix x1 with known SVD
u = qr.Q(qr(matrix(rnorm(m * m), m)))
v = qr.Q(qr(matrix(rnorm(m * n), n)))
d = seq(m, 1)
x1 = u %*% (d * t(v))

# Now, permute its columns
i  = sample(n)
P  = diag(1, n, n)[, i]  # permutation matrix
x2 = x1 %*% P
#  = u %*% (d * t(v)) %*% P
#  = u %*% (d * t(t(P) %*% v))   # an explicit SVD of x2
# this shows that svd(x2)$v should equal t(P) %*% v.

s2 = svd(x2)

So at this point, s2$v should equal t(P) %*% v, right? But not so! We see that the non-uniqueness of the SVD affects R's native svd sometimes too:

round(apply(s2$v - t(P) %*% v, 2, crossprod))
# [1] 0 0 0 0 4 0 0 0 0 4

However, at least R's native svd function is self-consistent with respect to sign:

s1 = svd(x1)
round(apply(s2$v - t(P) %*% s1$v, 2, crossprod))
# [1] 0 0 0 0 0 0 0 0 0 0

This is really the crux of your issue report here in issue #24 -- irlba is unfortunately not even self-consistent :( shown next:

library(irlba)
set.seed(1)
l1 = irlba(x1, 3)
set.seed(1)
l2 = irlba(x2, 3)
round(apply(l2$v - t(P) %*% l1$v, 2, crossprod))
# [1] 0 4 0

I agree that this is somewhat unsatisfactory, even though technically still correct.

After thinking about this a lot, I am working on a surprisingly simple fix, that unfortunately is less simple than it seems and breaks some tests at the moment. I added a (for now, hidden) option to set the internally used RNG in the function. If we choose an RNG with uniform sign, then your problems above are mitigated:

set.seed(1)
l1u = irlba(x1, 3, rng=runif)
set.seed(1)
l2u = irlba(x2, 3, rng=runif)
round(apply(l2u$v - t(P) %*% l1u$v, 2, crossprod))
# [1] 0 0 0

Cazart! So this is a first draft/experimental of a fix for this issue. Unfortunately, it's not yet finished and is breaking a few things here and there. Try it out (it will be in the GitHub repository presently). For now, it's a hidden option with no documentation, etc.

I am interested in thoughts on this approach, any help appreciated, including more of your great examples which have helped so far.

Best,

Bryan

tests are breaking though, something not quite right yet...(in process)

bwlewis · 2018-04-17T04:08:38Z

Sorry, the test failure is due to an unrelated issue that I forgot about.

I think this new rng=runif option will fix your issue. If you agree, we can make it an official option with doc.

eliocamp · 2018-04-24T14:43:20Z

I sat down willing to test this and I now I understand your explanation of why my previous code was not correct. I was testing the sign of the first element of v, but that element does not represent the same point in space when the matrix is permuted (I come from the physical sciences, so my background in matrix algebra is rather shallow and I needed the physical dimension to understand things). After running proper tests, I see now that it is not a problem. Silly me 🤦‍♂️ .

If I understand your fix correctly, it fixes the issue in my first comment without the need to specify the seed, is that correct? That's wonderful for reproducibility. While setting the seed is always a good idea, it's not ideal because the result is still dependent on the chosen seed. Is there any drawback of using runif instead of rnorm? If not, maybe it's worth changing it to the default value of rng?

Works like a charm

data("volcano")
sapply(1:10, FUN = function(x) sign(irlba::irlba(volcano, nv = 1, rng = runif)$u[1]))
#>  [1] 1 1 1 1 1 1 1 1 1 1

Even across seeds. 🎆

sapply(1:10, FUN = function(x) {
    set.seed(x) 
    sign(irlba::irlba(volcano, nv = 1, rng = runif)$u[1])})
#>  [1] 1 1 1 1 1 1 1 1 1 1

The only potential issue is that the sign is not consistent with base::svd

sign(svd(volcano, nv = 1)$u[1]) == sign(irlba::irlba(volcano, nv = 1, rng = runif)$u[1])
#> [1] FALSE

In this case, it's solved by changing the sign of the rng.

sign(svd(volcano, nv = 1)$u[1]) == sign(irlba::irlba(volcano, nv = 1, rng = function(x) -runif(x))$u[1])
#> [1] TRUE

But it seems to be data-dependent :(

sign_test <- function(x) {
    # Synthetic data representing an scalar field in the
    # southern hemisphere in a "tidy" format
    data <- expand.grid(lon = seq(0, 360, by = 2),
                        lat = seq(-90, 0, by = 2),
                        date = 1:10)
    set.seed(x)
    data$z <- rnorm(nrow(data))
    matrix1 <- as.matrix(reshape2::dcast(data, date ~ lon + lat, value.var = "z"))
    
    r1 <- base::svd(matrix1, nv = 1, nu = 1)$u
    r2 <- irlba::irlba(matrix1, nu = 1, nv = 1, rng = function(x) -runif(x))$u
    sign(r1[1]) == sign(r2[1]) 
}

N <- 20
sapply(1:N, sign_test) 
#>  [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
#> [12]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE

I don't know if the above inconsistency can be considered a problem or not. It seems to me that different results returned from different algorithms is a natural and unavoidable consequence of the non-uniqueness of SVD. If you think this is not worth changing, then I think this issue is safely closed. Thanks!

bwlewis added a commit that referenced this issue Apr 17, 2018

Experimental partial fix for issue #24

7b40564

tests are breaking though, something not quite right yet...(in process)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency in sign. #24

Consistency in sign. #24

eliocamp commented Oct 4, 2017

bwlewis commented Oct 6, 2017

eliocamp commented Mar 26, 2018

bwlewis commented Apr 10, 2018 •

edited

bwlewis commented Apr 17, 2018

bwlewis commented Apr 17, 2018

eliocamp commented Apr 24, 2018 •

edited

Consistency in sign. #24

Consistency in sign. #24

Comments

eliocamp commented Oct 4, 2017

bwlewis commented Oct 6, 2017

eliocamp commented Mar 26, 2018

bwlewis commented Apr 10, 2018 • edited

bwlewis commented Apr 17, 2018

bwlewis commented Apr 17, 2018

eliocamp commented Apr 24, 2018 • edited

bwlewis commented Apr 10, 2018 •

edited

eliocamp commented Apr 24, 2018 •

edited