Spark random forest issues #19

szilard · 2015-07-23T16:00:05Z

This is to collaborate on some issues with Spark RF also addressed by @jkbradley in comments to this post http://datascience.la/benchmarking-random-forest-implementations/ (see comments by Joseph Bradley). cc: @mengxr

Please see “Absolute Minimal Benchmark” for random forests https://github.com/szilard/benchm-ml/tree/master/z-other-tools and let's use the 1M row training set and the test set linked in from there.

@jkbradley says: One-hot encoder: Spark 1.4 includes this, plus a lot more feature transformers. Preprocessing should become ever-easier, especially using DataFrames (Spark 1.3+).

Yes, indeed. Can you please provide code that reads in the original dataset (pre- 1-hot encoding) and does the 1-hot encoding in Spark. Also, if random forest 1.4 API can use data frames, I guess we should use that for the training. Can you please provide code for that too.

@jkbradley says: AUC/accuracy: The AUC issue appears to be caused by MLlib tree ensembles aggregating votes, rather than class probabilities, as you suggested. I re-ran your test using class probabilities (which can be aggregated by hand), and then got the same AUC as other libraries. We’re planning on including this fix in Spark 1.5 (and thanks for providing some evidence of its importance!).

Fantastic. Can you please share code that does that already? I would be happy to check it out.

jkbradley · 2015-09-08T23:59:58Z

Apologies for the slow reply! I've been bogged down with QA for the next Spark release. I wrote the code for one-hot encoding + running RFs and computing AUC. Let me know if you run into issues using it:

[https://gist.github.com/jkbradley/1e3cc0b3116f2f615b3f]

szilard · 2015-09-09T02:18:43Z

Fantastic, thanks Joseph! I'll take a look asap and re-run the benchmarks with the new code. I'll let you know if I have more questions along the way...

szilard · 2015-09-09T23:46:02Z

@jkbradley
Hi Joseph:

The 1-hot encoding code you provided works great, thanks.
Also, the new scoring by aggregating probabilities works way better than the original aggregation of votes. For n=1M and

val numTrees = 100
val featureSubsetStrategy = "sqrt"   
val impurity = "entropy"    //val impurity = "gini"
val maxDepth = 20           
val maxBins = 100           //val maxBins = 50

I can verify I get AUC 71.4 instead of 62.5
3. If I run it on 1.5.0 (Tungsten) runtime and memory footprint is still the same.
4. However, something weird is still going on with accuracy, the AUC does not increase (enough) with increasing size (similar to the results with the other tools):
For n=0.1M: AUC 71.1
For n=1M: AUC 71.4
See also this graph for a different (the original) setup (500 trees etc.):
https://raw.githubusercontent.com/szilard/benchm-ml/master/2-rf/x-plot-auc.png

Any comments on 3 and especially on 4 above?
Thanks.

jkbradley · 2015-09-10T00:36:15Z

I'm glad 1 & 2 worked out!

For 3, I should have been more specific. Tungsten makes improvements on DataFrames, so it should improve the performance of simple ML Pipeline operations like feature transformation and prediction. However, to get the same benefits for model training, we'll need to rewrite the algorithms to use DataFrames and not RDDs. Future work...

For 4: Are you able to examine the tree structure? I'm wondering if the limit on MLlib tree depth is the difference. What happens when you set a max depth for the other libraries? I'll try to look into it more too.

szilard · 2015-09-10T02:36:46Z

Thanks again for 1 & 2.

For 3: Yes, that was my guess too. One more question: I re-run the logistic regression https://github.com/szilard/benchm-ml/blob/master/1-linear/5-spark.txt with 1.5.0 as well and got same training time as with 1.4.0. While my guess was that RF will be the same, I thought LR uses the DF as underlying and it will be faster with Tungsten. No?

szilard · 2015-09-10T02:59:54Z

For 4: I'm not sure if it's max_depth, for example H2O does not have similar behavior (with same max_depth = 20). To make further testing simple I zipped the 0.1M and 1M train and test datasets after being 1-hot encoded and exported to parquet format with your code and you can get it all in 1 from here: https://s3.amazonaws.com/benchm-ml--spark2/spark-0.1m%2B1m-1hot-parquet.tgz

As for the training/AUC part I made some little changes and my code is here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xb-spark-trainpred.txt
The changes are mainly:

val trainDataPath = "spark1hot-train-1m.parquet"
val testDataPath = "spark1hot-test-1m.parquet"

that is file names (in current dir) and

val impurity = "entropy"    
val maxDepth = 20        
val maxBins = 100

to match the "absolute minimal benchmark".

So with these I get:

n=0.1M: AUC 71.1
n=1M: AUC 71.4

as I mentioned above.

I think now it would be easier for you to take a look. There should be a higher increase in AUC from n=0.1M to 1M.

jkbradley · 2015-09-10T03:20:52Z

For 3: Yes, that was my guess too. One more question: I re-run the logistic regression https://github.com/szilard/benchm-ml/blob/master/1-linear/5-spark.txt with 1.5.0 as well and got same training time as with 1.4.0. While my guess was that RF will be the same, I thought LR uses the DF as underlying and it will be faster with Tungsten. No?

No, unfortunately. It's using a different optimizer (OWLQN) for some models, but it's still using the RDD API.

I'll take a look at the AUC issue.

szilard · 2015-09-10T03:25:38Z

Thanks for answer to 3 and thanks for looking into 4 :)

jkbradley · 2015-09-17T18:20:24Z

I didn't find anything obvious yet for issue 4, but there are still some items I want to investigate. (I haven't yet scanned the sklearn implementation carefully for comparison.) In the meantime, would you be able to update the blog post to include the soft predictions? It'd be nice now that it's easily available in Spark 1.5. Thanks!

szilard · 2015-09-18T00:23:02Z

Thanks @jkbradley for working on this. I already updated the github README a few days ago and I'll update the http://datascience.la post as well soon.

ehiggs · 2016-02-19T16:15:22Z

Hi there. It's been a few months and I'm curious if any progress has been made here? Is there a JIRA on the apache page tracking the work in Spark for this?

Thanks

szilard · 2016-02-19T17:06:07Z

@ehiggs thanks for asking, though I defer this to @jkbradley and Team

ehiggs · 2016-02-22T13:42:43Z

I made an issue on the Apache Jira here.

szilard · 2016-02-22T17:27:30Z

Great, thanks.

The memory and speed are one issue, but there is also the issue of accuracy, see point numbered as 4 earlier in this thread.

In short the AUC does not increase from 100K records to 1M as it does for all other tools:
https://raw.githubusercontent.com/szilard/benchm-ml/master/2-rf/x-plot-auc.png

It can be an implementation bug or an "architecture bug" (by using some approximation that does not work well). I'm curious if anyone is using RF in production for large datasets and if they are really validating the results by comparing the accuracy to other tools.

szilard · 2016-09-24T00:13:54Z

I rewrote the RF code to use the pipeline/spark.ml API:
https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xb-spark-trainpred--sp20.txt
and ran it on Spark 2.0.

It is slower than before (despite Tungsten/new data.frame API), 100 trees 20 deep:

MLlib 1.5 - 250 sec
ML    2.0 - 400 sec

more details here: https://github.com/szilard/benchm-ml/tree/master/z-other-tools#how-to-benchmark-your-tool-of-choice-with-minimal-work

In addition Spark RF is still less accurate than piers. Also, it still shows the weird learning curve:
https://github.com/szilard/benchm-ml/raw/master/2-rf/x-plot-auc.png

So basically issues 3 & 4 above did not change for the better.

ehiggs · 2017-04-11T09:00:40Z

Progress seems to be taking places here. Thanks to @smurching.

riya2216 · 2018-03-28T13:37:16Z

@szilard @jkbradley I am able to reproduce AUC gain on spark 1.4 using @jkbradley code shared over gist(https://gist.github.com/jkbradley/1e3cc0b3116f2f615b3f#file-benchm-ml-spark-L6) at the start of this thread.
However, we are facing issues in reproducing the same AUC gain over spark version 2.1.1 or 2.2. Tried couple of things:

Refactored the scala code at the gist link to run against spark version 2.1.1.
Getting scala match error:
scala.MatchError: [0.0,(689,[0,1,2,34,50,53,75,384],[1934.0,732.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Rewrote the same code in java and ran in spark 2.1.1.
Got this error(known spark issue, fixed in version 2.2):
java.lang.UnsupportedOperationException: empty.maxBy
at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:831)
at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:561)
at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:553)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
Run the java code in spark 2.2, above issue got fixed but AUC numbers are not the same as spark 1.4:

Area Under ROC without using soft predictions: 0.5102859778512581
Area Under ROC with soft predictions: 0.5

Please provide any insights what could be the issue here.

JavaCode.txt
ScalaCode_Spark2.1Ready.txt

szilard · 2018-03-28T15:01:22Z

@riya2216 the only insight I can give after doing all these benchmarks is forget Spark for machine learning, it's slow, inaccurate, buggy and uses tons of RAM. There is no reason to waste your time using clearly inferior products. Just use e.g. h2o.
PS: If you have all your ETL in Spark, you can use sparkling water (which allows you to call h2o from spark).

jakesherman · 2018-06-26T19:24:06Z

@szilard Thanks for all of your work in putting this together and noticing these issues with Spark ML's Random Forest implementation. I have a couple of quick questions, please pardon my ignorance if you've answered these elsewhere:

Have you tried repeating these benchmarks using the same hyperparameter values across packages versus the defaults? I'm wondering if perhaps Spark ML's implementation is just using default hyperparameters that tend to perform worse than other implementations? For example, in Spark 2.2 the maximum depth of trees has a default of 5, whereas in scikit-learn the default is "nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples". I can't easily figure out what the default value is in Spark 2.3 .
Have you repeated this benchmark using Spark 2.3?

Thanks!

137alpha · 2018-08-30T06:33:21Z

Have you tried increasing the maxBins parameter? Could be due to the binning approximation that Spark uses.

szilard · 2018-08-30T15:44:04Z

@137alpha No, I gave up on Spark for ML long time ago. There are way better tools for machine learning than Spark MLlib (e.g. h2o), just use those.

szilard mentioned this issue Jul 23, 2015

Spark random forest low AUC etc #16

Closed

boegel mentioned this issue May 20, 2016

Spark contexts sometimes die and don't come back hpcugent/hanythingondemand#144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark random forest issues #19

Spark random forest issues #19

szilard commented Jul 23, 2015

jkbradley commented Sep 8, 2015

szilard commented Sep 9, 2015

szilard commented Sep 9, 2015

jkbradley commented Sep 10, 2015

szilard commented Sep 10, 2015

szilard commented Sep 10, 2015

jkbradley commented Sep 10, 2015

szilard commented Sep 10, 2015

jkbradley commented Sep 17, 2015

szilard commented Sep 18, 2015

ehiggs commented Feb 19, 2016

szilard commented Feb 19, 2016

ehiggs commented Feb 22, 2016

szilard commented Feb 22, 2016

szilard commented Sep 24, 2016 •

edited

ehiggs commented Apr 11, 2017

riya2216 commented Mar 28, 2018

szilard commented Mar 28, 2018

jakesherman commented Jun 26, 2018

137alpha commented Aug 30, 2018

szilard commented Aug 30, 2018

Spark random forest issues #19

Spark random forest issues #19

Comments

szilard commented Jul 23, 2015

jkbradley commented Sep 8, 2015

szilard commented Sep 9, 2015

szilard commented Sep 9, 2015

jkbradley commented Sep 10, 2015

szilard commented Sep 10, 2015

szilard commented Sep 10, 2015

jkbradley commented Sep 10, 2015

szilard commented Sep 10, 2015

jkbradley commented Sep 17, 2015

szilard commented Sep 18, 2015

ehiggs commented Feb 19, 2016

szilard commented Feb 19, 2016

ehiggs commented Feb 22, 2016

szilard commented Feb 22, 2016

szilard commented Sep 24, 2016 • edited

ehiggs commented Apr 11, 2017

riya2216 commented Mar 28, 2018

szilard commented Mar 28, 2018

jakesherman commented Jun 26, 2018

137alpha commented Aug 30, 2018

szilard commented Aug 30, 2018

szilard commented Sep 24, 2016 •

edited