read_tsv() unexpectedly slow for tables with a large number of columns #1538

khughitt · 2024-04-13T19:36:24Z

Greetings!

I noticed some unexpectedly slow performance for readr when trying to load a table with few rows but many columns.

I found some earlier issues relating to slow read times, but those seemed to relate to different issues.

I thought that it might be an issue with the column type guessing, but the performance is similar even when the column types are indicated.

I did not check to see how the performance scales with the number of rows, but the code snippet could be modified to check this.

Time

Representative times it took to load a 40 x 20,000 table of random floats:

read.delim: 13.7s
read_tsv: 48.7s
read_tsv + coltypes: 49.2s

For comparison, the similar times for pandas/CSV.jl are:

pd.read_csv(): 0.68s
DataFrame(CSV.File()): 0.93s

These are just rough estimates intended to give a sense of the scale of performance discrepancies.

Transposing the data results in a ~100x speed-up, for this particular example (0.39s).

To reproduce:

library(readr)

set.seed(1)

# create test data
m <- 40
n <- 20000

dat <- matrix(rnorm(m * n), m, n)
write.table(dat, file="test.tsv", row.names=FALSE, sep="\t")
write.table(t(dat), file="test2.tsv", row.names=FALSE, sep="\t")

# 1) read.delim
t0 <- Sys.time()
dat <- read.delim("test.tsv", sep="\t")
t1 <- Sys.time()
t1 - t0

# Time difference of 13.68305 secs

# 2) read_tsv
t0 <- Sys.time()
dat <- readr::read_tsv("test.tsv")
t1 <- Sys.time()
t1 - t0

# Time difference of 48.74033 secs

# 3) read_tsv / coltypes indicated
t0 <- Sys.time()
dat <- readr::read_tsv("test.tsv", col_types=rep("d", 20000))
t1 <- Sys.time()
t1 - t0
# Time difference of 49.16811 secs

# 4) transposed version of data
t0 <- Sys.time()
dat <- readr::read_tsv("test2.tsv")
t1 <- Sys.time()
t1 - t0
# Time difference of 0.3927958 secs

Session Info:

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /xx/miniforge3/envs/tmp-readr-bug/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: US/Eastern
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_2.1.5

loaded via a namespace (and not attached):
 [1] utf8_1.2.4       R6_2.5.1         tidyselect_1.2.0 bit_4.0.5
 [5] tzdb_0.4.0       magrittr_2.0.3   glue_1.7.0       tibble_3.2.1
 [9] parallel_4.3.3   pkgconfig_2.0.3  bit64_4.0.5      lifecycle_1.0.4
[13] cli_3.6.2        fansi_1.0.6      vctrs_0.6.5      withr_3.0.0
[17] compiler_4.3.3   tools_4.3.3      hms_1.1.3        pillar_1.9.0
[21] crayon_1.5.2     rlang_1.1.3      vroom_1.6.5
>

Thanks for all of your work on readr!

The text was updated successfully, but these errors were encountered:

khughitt · 2024-04-13T19:55:49Z

_Just to be clear... this is definitely an edge case and something I can easily work-around; I do not want anyone to spend a ton of time on this! I'm sure you have better things to work on...

While 20k features and p >> n is not uncommon with genomics data, the historical convention there has been to store the data in the transposed order which avoids this issue, or to use a different file format.

I just thought I would report it here to raise awareness / in case it points to an underlying issue with wider ramifications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_tsv() unexpectedly slow for tables with a large number of columns #1538

read_tsv() unexpectedly slow for tables with a large number of columns #1538

khughitt commented Apr 13, 2024

khughitt commented Apr 13, 2024

read_tsv() unexpectedly slow for tables with a large number of columns #1538

read_tsv() unexpectedly slow for tables with a large number of columns #1538

Comments

khughitt commented Apr 13, 2024

khughitt commented Apr 13, 2024