Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor puzzling loop in attributes.R #1327

Open
maelle opened this issue Mar 26, 2024 · 20 comments · May be fixed by #1330
Open

refactor puzzling loop in attributes.R #1327

maelle opened this issue Mar 26, 2024 · 20 comments · May be fixed by #1330
Assignees
Labels
upkeep maintenance, infrastructure, and similar

Comments

@maelle
Copy link
Contributor

maelle commented Mar 26, 2024

rigraph/R/attributes.R

Lines 557 to 564 in 36ad3c9

for (i in seq_along(value)) {
tmp <- value[[i]]
length(tmp) <- 0
length(tmp) <- length(vs)
tmp[index] <- value[[i]]
value[[i]] <- tmp
}
}

@maelle maelle added the upkeep maintenance, infrastructure, and similar label Mar 26, 2024
@maelle
Copy link
Contributor Author

maelle commented Mar 26, 2024

L559 can be deleted, the rest is more difficult.

@maelle
Copy link
Contributor Author

maelle commented Mar 26, 2024

value is a named list.

the code above is for the case where we don't have as many values as vertices. we create a list as long as the number of vertices. then the indices are used to fill the list elements. the other elements remain NULL.

surely there's an elegant way to do this?!

@maelle
Copy link
Contributor Author

maelle commented Mar 26, 2024

elegant and readable

@maelle
Copy link
Contributor Author

maelle commented Mar 29, 2024

first fill with NA, then make NA NULL? 🤔 no, they are NA in the current version.

@maelle
Copy link
Contributor Author

maelle commented Mar 29, 2024

a good thing is that the lines are covered by tests

@maelle maelle closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024
@maelle maelle reopened this Mar 29, 2024
@maelle
Copy link
Contributor Author

maelle commented Mar 29, 2024

first make the loop easier then use purrr::reduce?

@maelle
Copy link
Contributor Author

maelle commented Mar 29, 2024

      tmp <- rep(NA, length(vs))
      tmp[index] <- value[[i]]
      value[[i]] <- tmp

@maelle
Copy link
Contributor Author

maelle commented Mar 29, 2024

oooh for value a vector, we get NA as fillers but for value a named list, we get NULL as fillers.

@clpippel
Copy link
Contributor

clpippel commented Apr 2, 2024

My attempt to isolate the essence of the coding:

vs           <- c(1,2,3,4)
index        <- c(1,4)
value        <- list(list(a= 10, d = 14))
names(value) <- "color"

i            <- 1
tmp          <- value[[i]]         # Copy structure of value.
length(tmp)  <- 0                  # Set list to empty, while removing names.
length(tmp)  <- length(vs)         # Extend to all vertices with NULL.
tmp[index]   <- value[[i]]         # Copy indexed elements,
value[[1]]   <- tmp                # into value.
names(value[[i]])

The effect of length(tmp) <- 0 is to remove the names initially present in value.
When leaving out the statement the names in value in this example are: [1] "a" "d" "" "" .

In this example using the old logic the expression
>value[[1]]["a"] will give

$<NA>
NULL

thereby blocking indexing by name. In the new logic the result will be:

$a
[1] 10

Removing the names could be a matter of defensive coding.
All the code need is some explanation.

@clpippel
Copy link
Contributor

clpippel commented Apr 3, 2024

Create an empty named list.

L558, L559 are equivalent to:
tmp <- setNames(list(), character(0)) # Easy to read.
  or alternatively
tmp <- list(dummy = 1)[NULL] # Faster.
  or
tmp <- value[[i]][NULL] # As current coding.

To replace lines L558, L559, L560 by a single line:
tmp <- 'length<-'(list(dummy = 1)[NULL], length(vs))

@maelle
Copy link
Contributor Author

maelle commented Apr 4, 2024

thanks! @krlmlr helped me shorten the code, I'll soon finish up the related PR.

@clpippel
Copy link
Contributor

clpippel commented Apr 7, 2024

My thoughts.
The proposed solution is indeed concise. It uses the match() function, which can be costly if the graph has many vertices.
A similar solution can be achieved in base R by lapply() instead of map().
A base R solution using indexing by a vector is more verbose, but faster.

# -------------------------------
library(purrr)
library(microbenchmark)
n <- 1E5; m <- 1E2;
vs                <- seq(n)
index             <- seq(m)
names(index)      <- index
value             <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)) )

microbenchmark(
"This"  = { value_nw1 <- value
           for (i in seq_along(value)) { 
             tmp            <- value[[i]] 
             length(tmp)    <- 0 
             length(tmp)    <- length(vs) 
             tmp[index]     <- value[[i]] 
             value_nw1[[i]] <- tmp 
           } 
          },
"Purrr"  = { value_nw2 <- purrr::map(value, ~.x[match(seq_along(vs), index)]) },
"Rmatch" = { value_nw3 <- lapply(value, function(x) { x[match(seq_along(vs), index)]} ) },
"Rbase"  = { value_nw4 <-
  lapply(
    value,
    function(x)
      { tmp <- x[NULL]; length(tmp) <- length(vs); tmp[index] <- x; tmp}
  )
},
times= 100,
unit="relative"
)
# Unit: relative
#    expr      min       lq     mean   median       uq       max neval
#    This 9.187689 6.649874 4.097758 6.472774 5.603021 1.0953775   100
#   Purrr 4.552082 3.618245 2.618367 3.620971 3.551356 0.9894297   100
#  Rmatch 3.655005 3.473635 2.954367 3.458746 3.226840 1.0147469   100
#   Rbase 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
identical(value_nw2, value_nw3)
# [1] TRUE
identical(value_nw1, value_nw4)
# [1] TRUE

@maelle
Copy link
Contributor Author

maelle commented Apr 9, 2024

note to self: use rlang's "purrr" standalone file

@krlmlr
Copy link
Contributor

krlmlr commented Apr 9, 2024

Thanks, good catch. Using match() indeed causes an overhead. Let's go with lapply() (or map() from https://github.com/r-lib/rlang/blob/main/R/standalone-purrr.R) plus the Rbase variant plus comments why we chose this solution: don't want to slow down code that is used for many operations.

For continuous benchmarking, https://github.com/lorenzwalthert/touchstone does a decent job, but we'd need a good set of fast test cases -- can be small initially and grow over time.

@krlmlr
Copy link
Contributor

krlmlr commented Apr 9, 2024

For reference, my own analysis, with two other variants.

n <- 1E5
m <- 1E2
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3 <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
#> # A tibble: 5 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 This         2.79ms   3.01ms      316.    8.79MB     68.1
#> 2 Rmatch       1.48ms   1.76ms      543.    4.42MB    125. 
#> 3 vec_match    2.36ms   2.62ms      321.    4.99MB     72.9
#> 4 value_at   733.41µs   1.12ms      681.    4.61MB    176. 
#> 5 Rbase         774µs   1.15ms      759.    4.59MB    206.

Created on 2024-04-09 with reprex v2.1.0

@krlmlr
Copy link
Contributor

krlmlr commented Apr 9, 2024

Looking at the reprex results, the self-contained function doesn't seem to be that bad. With a larger input:

n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3 <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
#> # A tibble: 5 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 This        17.47ms   18.5ms      51.2      50MB     79.6
#> 2 Rmatch      21.03ms   22.5ms      44.3    42.2MB     44.3
#> 3 vec_match   29.86ms   33.8ms      29.9    49.6MB     39.8
#> 4 value_at     7.85ms    9.5ms     104.     45.8MB    173. 
#> 5 Rbase        9.34ms   11.5ms      89.6    45.8MB    122.

Created on 2024-04-09 with reprex v2.1.0

@clpippel
Copy link
Contributor

clpippel commented Apr 9, 2024

If no named list is needed, set_value_at() can be simplified to

set_value_at2 <- function(value, idx, length_out) {
  out      <- vector(mode='list', length = length_out)
  out[idx] <- value
  out
}

@clpippel
Copy link
Contributor

clpippel commented Apr 9, 2024

Another idea: Using magrittr s pipe %>%, assuming dependency is not a drawback.
Update: not faster.

#-------------------------------------------------------------------------------
library(purrr)
library(bench)
 
n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2a <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2b <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "pipe_at" = {
    value_nw3b <- lapply(value, function(x) { 
      vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
      # set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
# n = 1E6, m = 1E3
# A tibble: 6 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result           memory              time             gc                
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>           <list>              <list>           <list>            
# 1 This         23.9ms  24.37ms      40.1    45.8MB     45.8     7     8      175ms <named list [2]> <Rprofmem [91 × 3]> <bench_tm [15]>  <tibble [15 × 3]> 
# 2 Rmatch       22.6ms  22.75ms      43.8      42MB     73.0     6    10      137ms <named list [2]> <Rprofmem [49 × 3]> <bench_tm [16]>  <tibble [16 × 3]> 
# 3 vec_match    44.5ms  45.06ms      22.1    49.6MB     14.7     6     4      271ms <named list [2]> <Rprofmem [48 × 3]> <bench_tm [10]>  <tibble [10 × 3]> 
# 4 value_at     13.3ms   13.6ms      70.7    45.8MB     70.7    12    12      170ms <named list [2]> <Rprofmem [76 × 3]> <bench_tm [24]>  <tibble [24 × 3]> 
# 5 pipe_at       1.8ms   1.93ms     502.     15.3MB     82.9   115    19      229ms <named list [2]> <Rprofmem [42 × 3]> <bench_tm [134]> <tibble [134 × 3]>
# 6 Rbase          13ms   13.5ms      73.8    45.8MB     56.8    13    10      176ms <named list [2]> <Rprofmem [51 × 3]> <bench_tm [23]>  <tibble [23 × 3]> 

@clpippel
Copy link
Contributor

Magrittr s pipe %>% is not faster as I thought before.
Using no names seems to be the difference.

library(purrr)
library(bench)
 
n <- 1E6; m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

set_value_at2 <- function(value, idx, length_out) {
  out      <- vector(mode='list', length = length_out)
  out[idx] <- value
  out
}

bench::mark(
  "at_unname()" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "at_noname()" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at2(x, index, length(vs))
    })
  },
  "at_noname" = {
    value_nw3a <- lapply(value, function(x) {
      tmp <- vector(mode='list', length = length(vs)); tmp[index] <- x; tmp
    })
  },
  "at_pipe" = {
    value_nw3b <- lapply(value, function(x) { 
      vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
    })
  }
)

# A tibble: 4 × 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 at_unname… 13.31ms  13.7ms      73.3    45.8MB     178.     7    17     95.5ms
# 2 at_noname…  1.79ms  2.68ms     393.     15.3MB     174.    68    30    172.9ms
# 3 at_noname   1.79ms  2.98ms     365.     15.3MB     176.    60    29    164.6ms
# 4 at_pipe     1.81ms  2.87ms     396.     15.3MB     219.    56    31    141.5ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

@maelle
Copy link
Contributor Author

maelle commented May 7, 2024

see current state of #1330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upkeep maintenance, infrastructure, and similar
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants