Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_tsv() unexpectedly slow for tables with a large number of columns #1538

Open
khughitt opened this issue Apr 13, 2024 · 1 comment
Open

Comments

@khughitt
Copy link

Greetings!

I noticed some unexpectedly slow performance for readr when trying to load a table with few rows but many columns.

I found some earlier issues relating to slow read times, but those seemed to relate to different issues.

I thought that it might be an issue with the column type guessing, but the performance is similar even when the column types are indicated.

I did not check to see how the performance scales with the number of rows, but the code snippet could be modified to check this.

Time

Representative times it took to load a 40 x 20,000 table of random floats:

  • read.delim: 13.7s
  • read_tsv: 48.7s
  • read_tsv + coltypes: 49.2s

For comparison, the similar times for pandas/CSV.jl are:

  • pd.read_csv(): 0.68s
  • DataFrame(CSV.File()): 0.93s

These are just rough estimates intended to give a sense of the scale of performance discrepancies.

Transposing the data results in a ~100x speed-up, for this particular example (0.39s).

To reproduce:

library(readr)

set.seed(1)

# create test data
m <- 40
n <- 20000

dat <- matrix(rnorm(m * n), m, n)
write.table(dat, file="test.tsv", row.names=FALSE, sep="\t")
write.table(t(dat), file="test2.tsv", row.names=FALSE, sep="\t")

# 1) read.delim
t0 <- Sys.time()
dat <- read.delim("test.tsv", sep="\t")
t1 <- Sys.time()
t1 - t0

# Time difference of 13.68305 secs

# 2) read_tsv
t0 <- Sys.time()
dat <- readr::read_tsv("test.tsv")
t1 <- Sys.time()
t1 - t0

# Time difference of 48.74033 secs

# 3) read_tsv / coltypes indicated
t0 <- Sys.time()
dat <- readr::read_tsv("test.tsv", col_types=rep("d", 20000))
t1 <- Sys.time()
t1 - t0
# Time difference of 49.16811 secs

# 4) transposed version of data
t0 <- Sys.time()
dat <- readr::read_tsv("test2.tsv")
t1 <- Sys.time()
t1 - t0
# Time difference of 0.3927958 secs

Session Info:

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /xx/miniforge3/envs/tmp-readr-bug/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: US/Eastern
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_2.1.5

loaded via a namespace (and not attached):
 [1] utf8_1.2.4       R6_2.5.1         tidyselect_1.2.0 bit_4.0.5
 [5] tzdb_0.4.0       magrittr_2.0.3   glue_1.7.0       tibble_3.2.1
 [9] parallel_4.3.3   pkgconfig_2.0.3  bit64_4.0.5      lifecycle_1.0.4
[13] cli_3.6.2        fansi_1.0.6      vctrs_0.6.5      withr_3.0.0
[17] compiler_4.3.3   tools_4.3.3      hms_1.1.3        pillar_1.9.0
[21] crayon_1.5.2     rlang_1.1.3      vroom_1.6.5
>

Thanks for all of your work on readr!

@khughitt
Copy link
Author

_Just to be clear... this is definitely an edge case and something I can easily work-around; I do not want anyone to spend a ton of time on this! I'm sure you have better things to work on...

While 20k features and p >> n is not uncommon with genomics data, the historical convention there has been to store the data in the transposed order which avoids this issue, or to use a different file format.

I just thought I would report it here to raise awareness / in case it points to an underlying issue with wider ramifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant