Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory consumption when using sentiment() function #39

Open
contefranz opened this issue May 10, 2017 · 10 comments
Open

High memory consumption when using sentiment() function #39

contefranz opened this issue May 10, 2017 · 10 comments

Comments

@contefranz
Copy link

I am running some polarity computation through the function sentiment(). What I am experiencing is, even for small piece of text, a huge amount of allocated RAM. Sometimes I get also the following error:

Error in ` [.data.table ` (word_dat, , .(non_pol = unlist(non_pol)), by = c("id", : negative length vectors are not allowed Calls: assign -> compute_tone -> sentiment -> [ -> [.data.table Execution halted

A character vector of 669 kB (computed through object_size() in the package pryr leads to a peak allocation of 3.590 GB in RAM which is impressive. This is causing some problems, as you can imagine, when texts get longer.

I know you have developed everything using the data.table package (I did the same for my own package), so this sounds strange to me.

Do you have any hints or are you aware of this issue?
I am not including any minimal since this analysis can be easily performed through the profiling tool in RStudio.

Thanks

@contefranz contefranz changed the title High memory consumption when using sentiment() function High memory consumption when using sentiment() function May 10, 2017
@trinker
Copy link
Owner

trinker commented May 10, 2017

Can you make both parts of this reproducible. The stringi package has tools to generate andom text that you can use to mimic the data you're talking about.

@contefranz
Copy link
Author

Thank you for the hint. Below you can find the minimal.

# minimal example
rm( list = ls() )
gc( reset = T )

library( pryr )
library( stringi )
library( data.table )
library( sentimentr )

# generating some paragraphs of random text and make them flat
set.seed( 2017 )
text = stri_flatten( stri_rand_lipsum( 50000 ), " " )

object_size( text )
object.size( text )

# computing tone
tone = sentiment( text )

object_size( tone )
object.size( tone )

The profiler run through profvis::profvis() says that that memory went up to 4.153 GB despite an initial object (text) of just 6 MB. Unfortunately, I can't upload the screenshot.
Could you please run this and see what is happening? My problem is even worse since some text are above 50 MB and when I compute the tone the RAM can reach 600 GB. This forces the job to be killed right away even if the workstation is really powerful.

Below you find my session infos.
Thank you again.

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sentimentr_1.0.0  data.table_1.10.4 stringi_1.1.5     pryr_0.1.2       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10     codetools_0.2-15 digest_0.6.12    jsonlite_1.4     magrittr_1.5     syuzhet_1.0.1    textclean_0.3.1  tools_3.4.0     
 [9] stringr_1.2.0    htmlwidgets_0.8  yaml_2.1.14      compiler_3.4.0   lexicon_0.3.1    htmltools_0.3.6  profvis_0.3.3   

@trinker
Copy link
Owner

trinker commented May 10, 2017

sentimentr works at the sentence level. So in the example you provide a split into sentences produced ~500K sentences. This runs for me but will certainly consume a bit of memory. There may be ways to improve sentimentr's memory consumption but I have not found a way. If someone sees this and sees a way to make sentimentr more memory efficient a PR is welcomed. I used data.table for speed reasons, not memory. I'm guessing there are ways to improve my code in this respect.

Until then my suggestion is to chunk the text and loop through with a manual memory release (gc) after each iteration.

My second thought is that perhaps sentimentr isn't the right tool for this job. I state in the README that the tool is designed to balance the tradeoff between accuracy and speed. I don't address memory but if you're chugging through that much text you're going to have to balance your own trade offs. I evaluate a number of sentiment tools in the package README. One tool I evaluate is Drew Schmidt's meanr (https://github.com/wrathematics/meanr). This is written in low level C and is very fast. It should be memory efficient as well. His work is excellent and specifically targeted at the type of analysis you seem to be doing. This might be the better choice. Both of our packages have READMEs that explain the package philosophies/goals very well. I think starting there and asking if you care about the added accuracy of sentimentr enough to chunk your text and loop through it. If not it's not the tool for this task.

That being said I want to leave this issue open if any community members want to look through the code and optimize memory usage the improvement would be welcomed.

@contefranz
Copy link
Author

Thank you for the precious answer. You provide good reasons so I'll check the meanr package as you suggest. What puzzles me though, is that data.table has been conceived not only for speed, but also memory efficiency. Its "by-reference" paradigm aims specifically at minimising all internal copies which are very common under the R environment.

Anyway, I suspect you are right. My texts contain many sentences, sometimes even more than it should because of HTML tagging and other fancy stuff.

I will try to keep this updated and when I will have time, it could be worthwhile to take a look at the internals of sentimentr.

Thank you again.

@trinker
Copy link
Owner

trinker commented May 10, 2017

What puzzles me though, is that data.table has been conceived not only for speed, but also memory efficiency. Its "by-reference" paradigm aims specifically at minimising all internal copies which are very common under the R environment.

I suspect a true data.table whizz would see how to optimize this (@mattdowle would likely feel sick if he sees how I've used data.table). So I'm saying let's assume the issue is my misuse of data.table not data.table itself.

@contefranz
Copy link
Author

data.table works like magic. No doubt about this. Stop.
The only suggestion I can give you is to carefully profile your function which I saw exploits many other internal functions. For someone who did not develop the code, it is hard to see issues, but I think for you should be much easier.

@MarcoDVisser
Copy link

Referred here from another forum by Trinker.

Profiling should give you what is consuming the most memory, here is a quick guide: https://github.com/MarcoDVisser/aprof#memory-statisics

[This is on condition that you aren't working in a lower-level language].

I' ll be happy to help think what is causing the "high consumption".

M

@trinker
Copy link
Owner

trinker commented May 13, 2017

Not surprisingly...the comma_reducer is causing huge memory use.

Per Marco's aprof:

image

@MarcoDVisser
Copy link

MarcoDVisser commented May 23, 2017

Hi trinker,

Looking at https://github.com/trinker/sentimentr/blob/master/R/utils.R

I see a bunch of potential problems (e.g. the potential use of non vectorized ifelse statements), which may in fact not be problems at all. It all depends on how these functions are used, and how they are "fed" data. Hence, we would need more detailed profiling.

Would you mind running the targetedSummary function on line 262?
https://www.rdocumentation.org/packages/aprof/versions/0.3.2/topics/targetedSummary

As you appear to use data.table, I'll be interested to see which functions are consuming so much memory.

M.

@trinker
Copy link
Owner

trinker commented Jul 26, 2017

#46 may reduce some memory consumption:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants