Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multicore options #226

Closed
dfalster opened this issue May 6, 2024 · 6 comments
Closed

multicore options #226

dfalster opened this issue May 6, 2024 · 6 comments

Comments

@dfalster
Copy link
Member

dfalster commented May 6, 2024

hi! @wcornwell was wondering about multicore options.

Rather than making the package be able to handle multicore processing, which is hard,
It's relatively easy to put a wrapper around the package functions and thereby make it multicore. The code below demonstrates this. It leads to a speed up from 8.511s to 3.3s on a list of 226 names.

I had most success with parallel package, but this won't work on windows.

I had little success with futures, which should work across platform.

WDYT @wcornwell ? There's less need for this now that we've sped things up a lot. This demo could be added as a vignette for the limited number of possible users who need this?

library(APCalign)
library(dplyr)
library(purrr)
library(furrr)

resources <- load_taxonomic_resources()

taxon_list <-
  "https://raw.githubusercontent.com/traitecoevo/APCalign/develop/tests/testthat/benchmarks/test_matches_alignments_updates.csv" %>%
  # "tests/testthat/benchmarks/test_matches_alignments_updates.csv" %>%
   readr::read_csv(show_col_types = FALSE) %>%
   mutate(id = rep(1:10, n())[1:n()])
  
f <- function(taxon_list) create_taxonomic_update_lookup(
     taxa = taxon_list$original_name,
     resources = resources,
     quiet = TRUE
   )

# Single CPU - 8.511s (on Daniel's MBP)
system.time(
  out <- f(taxon_list)
)

# using map (slower) -13.9s
system.time(
  out1 <- 
    taxon_list %>%
    split(~id) %>%
    map_dfr(f)
)

# parallel package (fast) - 3.34s
system.time(
  out2 <- 
    taxon_list %>%
    split(~id) %>%
    parallel::mclapply(f,  mc.cores = 6) %>%
    bind_rows()
)

# futures (not so fast) - 12.45
# https://future.futureverse.org/#controlling-how-futures-are-resolved

options(mc.cores=6)
plan(multisession, workers = 6)
system.time(
  out2 <- 
    taxon_list %>%
    split(~id) %>%
    future_map_dfr(f)
)

@wcornwell
Copy link
Contributor

wcornwell commented May 6, 2024

Yeah a bit complicated. I have something very similar implemented here: https://github.com/traitecoevo/APCalign/blob/multicore/R/multicore.R .

My impression is that multicore is a bit in flux for the open source community. The older parallel based stuff currently works on mac and linux but may stop working in the near future for Mac because of the way M1-M2 (ie iPhone chips) etc mac chips are going.

There is a "new" open source way of doing parallel processing on Linux called "openMP" that's apparently a lot more efficient on newer chips. Windows 11 decided to support openMP it but Mac doesn't currently because their CPUs are different.

So the newest R packages are all going that way, including the stringdist function that we're using. So we've already got the new way working in our package, but that doesn't actually help on Mac currently. But we should already be blindingly fast on a Linux machine. So yeah complicated.

@wcornwell
Copy link
Contributor

wcornwell commented May 6, 2024

@wcornwell
Copy link
Contributor

I think the key advance is the "shared memory" part: https://en.wikipedia.org/wiki/Shared_memory

Which is also key for machine learning computing

@dfalster
Copy link
Member Author

dfalster commented May 6, 2024

Hi @wcornwell, That's very interesting. I didn't know about OpenMP, sounds great. (nice slide deck here: https://www.bu.edu/tech/files/2017/09/OpenMP_2017Fall.pdf). I looked into the code and it's surprisingly easy to implement. I could easily use this in plant for a big speed up.

BUT!!!! The tension with osx is discouraging. (some info here https://mac.r-project.org/openmp/ -> "Warning! Everything described on this page is strictly experimental and not officially supported by CRAN, R-core or R Foundation. In may break at any time. The information is provided in the hope of being useful to some tech-savvy people. It is not intended for the regular R user." )

So yes, let's leave this for now. Code is here in the issue for anyone who wants it.

@wcornwell
Copy link
Contributor

wcornwell commented May 6, 2024

A bit worrying about MacOS for scientific computing going forward...maybe should think more about Linux...

@wcornwell
Copy link
Contributor

closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants