multicore options #226

dfalster · 2024-05-06T03:21:42Z

hi! @wcornwell was wondering about multicore options.

Rather than making the package be able to handle multicore processing, which is hard,
It's relatively easy to put a wrapper around the package functions and thereby make it multicore. The code below demonstrates this. It leads to a speed up from 8.511s to 3.3s on a list of 226 names.

I had most success with parallel package, but this won't work on windows.

I had little success with futures, which should work across platform.

WDYT @wcornwell ? There's less need for this now that we've sped things up a lot. This demo could be added as a vignette for the limited number of possible users who need this?

library(APCalign)
library(dplyr)
library(purrr)
library(furrr)

resources <- load_taxonomic_resources()

taxon_list <-
  "https://raw.githubusercontent.com/traitecoevo/APCalign/develop/tests/testthat/benchmarks/test_matches_alignments_updates.csv" %>%
  # "tests/testthat/benchmarks/test_matches_alignments_updates.csv" %>%
   readr::read_csv(show_col_types = FALSE) %>%
   mutate(id = rep(1:10, n())[1:n()])
  
f <- function(taxon_list) create_taxonomic_update_lookup(
     taxa = taxon_list$original_name,
     resources = resources,
     quiet = TRUE
   )

# Single CPU - 8.511s (on Daniel's MBP)
system.time(
  out <- f(taxon_list)
)

# using map (slower) -13.9s
system.time(
  out1 <- 
    taxon_list %>%
    split(~id) %>%
    map_dfr(f)
)

# parallel package (fast) - 3.34s
system.time(
  out2 <- 
    taxon_list %>%
    split(~id) %>%
    parallel::mclapply(f,  mc.cores = 6) %>%
    bind_rows()
)

# futures (not so fast) - 12.45
# https://future.futureverse.org/#controlling-how-futures-are-resolved

options(mc.cores=6)
plan(multisession, workers = 6)
system.time(
  out2 <- 
    taxon_list %>%
    split(~id) %>%
    future_map_dfr(f)
)

The text was updated successfully, but these errors were encountered:

wcornwell · 2024-05-06T04:38:26Z

Yeah a bit complicated. I have something very similar implemented here: https://github.com/traitecoevo/APCalign/blob/multicore/R/multicore.R .

My impression is that multicore is a bit in flux for the open source community. The older parallel based stuff currently works on mac and linux but may stop working in the near future for Mac because of the way M1-M2 (ie iPhone chips) etc mac chips are going.

There is a "new" open source way of doing parallel processing on Linux called "openMP" that's apparently a lot more efficient on newer chips. Windows 11 decided to support openMP it but Mac doesn't currently because their CPUs are different.

So the newest R packages are all going that way, including the stringdist function that we're using. So we've already got the new way working in our package, but that doesn't actually help on Mac currently. But we should already be blindingly fast on a Linux machine. So yeah complicated.

wcornwell · 2024-05-06T04:41:24Z

more info: https://en.wikipedia.org/wiki/OpenMP

and

https://search.r-project.org/CRAN/refmans/stringdist/html/stringdist-parallelization.html

wcornwell · 2024-05-06T04:55:08Z

I think the key advance is the "shared memory" part: https://en.wikipedia.org/wiki/Shared_memory

Which is also key for machine learning computing

dfalster · 2024-05-06T21:44:08Z

Hi @wcornwell, That's very interesting. I didn't know about OpenMP, sounds great. (nice slide deck here: https://www.bu.edu/tech/files/2017/09/OpenMP_2017Fall.pdf). I looked into the code and it's surprisingly easy to implement. I could easily use this in plant for a big speed up.

BUT!!!! The tension with osx is discouraging. (some info here https://mac.r-project.org/openmp/ -> "Warning! Everything described on this page is strictly experimental and not officially supported by CRAN, R-core or R Foundation. In may break at any time. The information is provided in the hope of being useful to some tech-savvy people. It is not intended for the regular R user." )

So yes, let's leave this for now. Code is here in the issue for anyone who wants it.

wcornwell · 2024-05-06T23:37:02Z

A bit worrying about MacOS for scientific computing going forward...maybe should think more about Linux...

wcornwell · 2024-05-29T02:56:31Z

closing for now

wcornwell closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multicore options #226

multicore options #226

dfalster commented May 6, 2024 •

edited

wcornwell commented May 6, 2024 •

edited

wcornwell commented May 6, 2024 •

edited

wcornwell commented May 6, 2024

dfalster commented May 6, 2024

wcornwell commented May 6, 2024 •

edited

wcornwell commented May 29, 2024

multicore options #226

multicore options #226

Comments

dfalster commented May 6, 2024 • edited

wcornwell commented May 6, 2024 • edited

wcornwell commented May 6, 2024 • edited

wcornwell commented May 6, 2024

dfalster commented May 6, 2024

wcornwell commented May 6, 2024 • edited

wcornwell commented May 29, 2024

dfalster commented May 6, 2024 •

edited

wcornwell commented May 6, 2024 •

edited

wcornwell commented May 6, 2024 •

edited

wcornwell commented May 6, 2024 •

edited