add cores arg to methSeg #120

katwre · 2018-06-28T16:37:41Z

Hi guys,
I added parallel methSeg() with methylDB objects and non-tabix objects.
I separated code for running fastseg and mclust into two auxiliary function .run.fastseg and .run.mclust. Parallelization is in the step of fastseg and each fastseg run is concatenated. For that, I added to return.type in applyTbxByChr "GRanges" to concatenate GRanges.
I wrote some tests, but I think I maybe should add more of them
Let me know what do you think about it
Kasia

- changed parameter name from `estimate.params.density` to `initialize.on.subset` - allow for values higher than 1, which directly yields number of samples - priorize Mclust argument `initialization` - check for to few samples, 9 seems to be the magic number - add tests

Remove duplicate check for sample size.

alexg9010

Good Job! @katwre

alexg9010 · 2018-06-28T18:39:06Z

tests/testthat/test-8-methSeg.r

+  expect_equal(a,b)
+})
+
+test_that("check if methSeg with cores > 1 is the same as cores=1 (non-tabix file)" ,{


This seems to be the same test as before.

alexg9010 · 2018-06-28T18:39:57Z

tests/testthat/test-8-methSeg.r

+  expect_equal(a,b)
+})
+
+methylRawDB.obj <- methRead(


Is this really a tabix based object?
I think you have to set save.db=TRUE.

if I am correct it's dbtype="tabix"

alexg9010 · 2018-06-28T18:42:14Z

R/methSeg.R

+          gr0 = gr0[,"meth"]
+        }else if("meth.diff" %in% names(mcols(gr0))){
+          gr0 = gr0[,"meth.diff"]
+        }else if (class(obj) != "GRanges"){


This is unnecessary since you are already checking above for the class.

you are right!

al2na · 2018-06-28T23:48:19Z

R/methSeg.R

-  # match argument names to fastseg arguments
-  args.fastseg=dots[names(dots) %in% names(formals(fastseg)[-1] ) ]  
+                  initialize.on.subset=1,
+                  cores=1, ...){


cores should be mc.cores to keep argument names consistent with other methylKit functions

al2na · 2018-06-28T23:54:50Z

R/methSeg.R

+  seg.res <- do.call("fastseg", args.fastseg)
+
+  # stop if segmentation produced only one range
+  if(length(seg.res)==1) {


shouldn't this if clause be called after fastseg and before calling mclust?

yes, I think I know what you mean, this if clause is here in the original code and you return seg.res if there is only 1 segment

methylKit/R/methSeg.R

Line 127 in 0d007f2

return(seg.res)

but since I separated lines fo code for calling mclust into an auxiliary function, I need to make sure that this function ( .run.mclust() ) won't be run if there is only 1 segment..

al2na · 2018-06-29T00:11:09Z

R/methSeg.R

+        # methylKit naming convention
+        df2getcolnames = as.data.frame(gr0[1])
+        df2getcolnames$width = NULL 
+        methylKit:::.setMethylDBNames(df2getcolnames)


why do we need this line ?
methylKit:::.setMethylDBNames(df2getcolnames)

This function is used to predict the column names of the given data.frame.

But we do not need the methylKit:::!

sorry! my fault

OK, I got confused, I thought we are resetting names on the tabix files but this doesn't do that, right ?

No, actually the data.frame that we get from the tabix file does not have column names and with that function we retrieve them.

alexg9010 · 2018-06-29T08:03:33Z

R/methSeg.R

+      ## Tabix files
+    } else if(class(obj)=="methylDiffDB" | class(obj)=="methylRawDB"){
+
+      .run.fastseg.tabix = function(gr0, ...){


I would suggest a function which actually takes the class of the object as argument:

.run.fastseg.tabix = function(gr0, class ,...) { ### and then you can directly set the colnames .setMethylDBNames(df2getcolnames,class) }

R/methSeg.R

katwre · 2018-06-29T13:26:48Z

I improved the code according to your suggestions besides @al2na suggestion about the if clause #120 (diff)

al2na · 2018-06-29T14:37:34Z

could we also comment the code wherever possible, please think about people who will maintain this in the future or your future selves. Certain things that are trivial are not going to be trivial after 3 months of not looking at the code.

katwre · 2018-07-02T10:14:32Z

@al2na I added more comments, hope it's better now

katwre · 2018-07-02T14:51:25Z

there is something wrong when join.neighbours=TRUE and initialize.on.subset!=1, I am checking it

katwre · 2018-07-04T08:55:24Z

I checked if with methylRawDB and multiple cores is faster than using methylRaw object on example of data with ~350K Cs (two chromosomes) and methylRaw is faster. I don't know why. Maybe it depends on the size of the input, I will check that

b <- benchmark(methylRaw.cores.1 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE),
               methylRaw.cores.2 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
               methylRawDB.cores.1 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE),
               methylRawDB.cores.2 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
               replications=5,
               columns=c('test', 'replications', 'elapsed'))
> print(b)
                test replications elapsed
1   methylRaw.cores.1            5  39.026
2   methylRaw.cores.2            5  38.495
3 methylRawDB.cores.1            5  46.146
4 methylRawDB.cores.2            5  45.640

al2na · 2018-07-04T09:35:08Z

please check datasets that have multiple chromosomes lets say at least 5 chromosomes, compare also memory consumption.

…

On Wed, Jul 4, 2018 at 10:55 AM katwre ***@***.***> wrote: I checked if with methylRawDB and multiple cores is faster than using methylRaw object on example of data with ~350K Cs and methylRaw is faster. I don't know why. Maybe it depends on the size of the input, I will check that — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAm9ERdF218-iOSwujVjdGoOCvaGaiKcks5uDIL8gaJpZM4U7trW> .

katwre · 2018-07-16T09:06:28Z

I checked it using 5 chromosomes and it's not better.

> myRaw
methylRaw object with 3784497 rows
--------------
  chr   start     end strand coverage numCs numTs
1 chr21 9411552 9411552      +       45    12    33
2 chr21 9411553 9411553      -       70    27    43
3 chr21 9411784 9411784      +       31     4    27
4 chr21 9411785 9411785      -       46    12    34
5 chr21 9412099 9412099      +       26    15    11
6 chr21 9412100 9412100      -       35    16    19
--------------
  sample.id: id 
assembly: assembly 
context: CpG 
resolution: base 

library(rbenchmark)

b <- benchmark(methylRaw.cores.1 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE),
               methylRaw.cores.5 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
               methylRawDB.cores.1 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE),
               methylRawDB.cores.5 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
               replications=3,
               columns=c('test', 'replications', 'elapsed'))
> print(b)
test replications elapsed
1   methylRaw.cores.1            3 257.613
2   methylRaw.cores.5            3 259.420
3 methylRawDB.cores.1            3 295.970
4 methylRawDB.cores.5            3 297.785

thanks @alexg9010 for the suggestion to use profvis, but it didnt work for me, I got an error that I didn't what to do with. I used profmem instead and it showed that memory usage when there are parallel cores is smaller than without using multiple cores.

methylRaw.cores.1 = 47888 bytes 
methylRaw.cores.5 = 39656 bytes
methylRawDB.cores.1 = 151380128 bytes
methylRawDB.cores.5 = 121104112 bytes

katwre · 2018-08-13T14:22:17Z

@al2na @alexg9010 I didn't manage to show that this method is faster. Should we close this pull request?

katwre and others added 9 commits June 18, 2018 15:40

added estimate.params.density param to methSeg

c75ce35

remove if clause

5378e61

fix previous commit

eb85d01

Update methSeg.R

d3970a8

Remove duplicate check for sample size.

update documentation

b5df9d0

update NEWS and version bump

75b949f

Added cores to methSeg

4395aaf

modified man/methSeg.Rd

6e9dd28

alexg9010 requested changes Jun 28, 2018

View reviewed changes

al2na reviewed Jun 28, 2018

View reviewed changes

al2na reviewed Jun 29, 2018

View reviewed changes

alexg9010 reviewed Jun 29, 2018

View reviewed changes

R/methSeg.R Show resolved Hide resolved

mc.cores, tests methylRawDB and other improvem.

e628ede

Add more comments

aff6561

don't subsample segments after joining neighbouring segments

f8020c0

keep only seqlevels that are non-empty

f747c67

alexg9010 force-pushed the master branch from 75b949f to 77aa6c7 Compare September 12, 2018 22:03

alexg9010 added this to In progress in Issue Tracker Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cores arg to methSeg #120

add cores arg to methSeg #120

katwre commented Jun 28, 2018

alexg9010 left a comment •

edited

alexg9010 Jun 28, 2018

alexg9010 Jun 28, 2018

katwre Jun 29, 2018

katwre Jun 29, 2018

alexg9010 Jun 28, 2018

katwre Jun 29, 2018 •

edited

al2na Jun 28, 2018

al2na Jun 28, 2018

katwre Jun 29, 2018 •

edited

al2na Jun 29, 2018

alexg9010 Jun 29, 2018

alexg9010 Jun 29, 2018

katwre Jun 29, 2018

al2na Jun 29, 2018

alexg9010 Jun 29, 2018

alexg9010 Jun 29, 2018

katwre commented Jun 29, 2018

al2na commented Jun 29, 2018

katwre commented Jul 2, 2018

katwre commented Jul 2, 2018

katwre commented Jul 4, 2018 •

edited

al2na commented Jul 4, 2018 via email

katwre commented Jul 16, 2018 •

edited

katwre commented Aug 13, 2018 •

edited

add cores arg to methSeg #120

Are you sure you want to change the base?

add cores arg to methSeg #120

Conversation

katwre commented Jun 28, 2018

alexg9010 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katwre Jun 29, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katwre Jun 29, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katwre commented Jun 29, 2018

al2na commented Jun 29, 2018

katwre commented Jul 2, 2018

katwre commented Jul 2, 2018

katwre commented Jul 4, 2018 • edited

al2na commented Jul 4, 2018 via email

katwre commented Jul 16, 2018 • edited

katwre commented Aug 13, 2018 • edited

alexg9010 left a comment •

edited

katwre Jun 29, 2018 •

edited

katwre Jun 29, 2018 •

edited

katwre commented Jul 4, 2018 •

edited

katwre commented Jul 16, 2018 •

edited

katwre commented Aug 13, 2018 •

edited