Skip to content
This repository has been archived by the owner on Aug 12, 2020. It is now read-only.

union_all() fills up disk when passed many tables #197

Open
nalimilan opened this issue Oct 24, 2017 · 9 comments
Open

union_all() fills up disk when passed many tables #197

nalimilan opened this issue Oct 24, 2017 · 9 comments
Assignees
Milestone

Comments

@nalimilan
Copy link

nalimilan commented Oct 24, 2017

I need to concatenate about 160 tables stored in a MonetDBLite database into a single table. I used to do so with the SQLite backend to dplyr, like this: Reduce(union_all, tables_list). Unfortunately, this doesn't work with MonetDBLite: the database grows from 3.5GB to 70GB, and the merge fails after there's no free disk space left.

I've eventually found a workaround by splitting the operation in smaller parts:

indiv1 <- compute(Reduce(union_all, indl[1:50]))
indiv2 <- compute(Reduce(union_all, indl[51:100]))
indiv3 <- compute(Reduce(union_all, indl[101:150]))
indiv4 <- compute(Reduce(union_all, indl[151:length(indl)]))
indiv <- Reduce(union_all, list(indiv1, indiv2, indiv3, indiv4))

In the end, the database only takes 8.5GB (including original small tables and the big concatenated table).

Is this expected? FWIW, I've checked that the SQL commands generated by dplyr are very clean, i.e. a series of (SELECT * FROM TABLE1) UNION ALL .... I was wondering whether some temporary files were not freed as they should in the middle of the operation.

@hannes
Copy link
Owner

hannes commented Oct 24, 2017

Could you perhaps provide me with a reproducible example? Its possible there is an issue here.

@hannes hannes self-assigned this Oct 24, 2017
@nalimilan
Copy link
Author

Unfortunately, I'm not allowed to share the dataset. As I said, it's made of about 160 individual files with 10,000 to 600,000 rows, for a total of 19M rows. There are 50 variables, most of them strings with a small number of unique values. I can try to generate random data to reproduce the problem, but maybe you already have toy datasets for that?

@hannes
Copy link
Owner

hannes commented Oct 24, 2017

I don't have a toy dataset, so generated files that provoke the issue would be welcome!

@nalimilan
Copy link
Author

nalimilan commented Oct 24, 2017

Here's a simple example where the database takes 500MB before calling union_all and 1,1GB after, but up to 28GB during the operation:

library(dplyr)
library(dbplyr)
library(MonetDBLite)
x <- sample(LETTERS[1:10], 100000, replace=TRUE)
df <- data.frame(x=x)
for(i in 1:50)
    df[paste0("x", i)] = x

db <- src_monetdblite("test.monetdb", create=TRUE)

tables <- list()
for(i in 1:100)
    tables[[i]] <- copy_to(db, df, paste0("table", i), temporary=FALSE)

total <- compute(Reduce(union_all, tables))

@hannes
Copy link
Owner

hannes commented Oct 24, 2017

I'm getting

> total <- compute(Reduce(union_all, tables))
Error in UseMethod("compute") : 
  no applicable method for 'compute' applied to an object of class "list"

@nalimilan
Copy link
Author

Hmm, weird. Is that with the latest released dplyr/dbplyr? I'm using dplyr_0.7.4 and dbplyr_1.1.0. What's the traceback()?

@hannes
Copy link
Owner

hannes commented Oct 24, 2017

> traceback()
1: compute(Reduce(union_all, tables))
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MonetDBLite_0.5.0 dbplyr_1.1.0      dplyr_0.7.4      

@nalimilan
Copy link
Author

Actually that was just because of a silly mistake: should have used tables[[i]]. Not sure why I posted a different code from the one I run. Please try with the updated code above.

@hannes
Copy link
Owner

hannes commented Oct 26, 2017

confirmed there is something really fishy going on here. Thanks for creating the example!

@hannes hannes added this to the 0.5.0 milestone Oct 26, 2017
@hannes hannes modified the milestones: 0.5.0, 0.6.0 Nov 14, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants