Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark random forest issues #19

Open
szilard opened this issue Jul 23, 2015 · 21 comments
Open

Spark random forest issues #19

szilard opened this issue Jul 23, 2015 · 21 comments

Comments

@szilard
Copy link
Owner

szilard commented Jul 23, 2015

This is to collaborate on some issues with Spark RF also addressed by @jkbradley in comments to this post http://datascience.la/benchmarking-random-forest-implementations/ (see comments by Joseph Bradley). cc: @mengxr

Please see “Absolute Minimal Benchmark” for random forests https://github.com/szilard/benchm-ml/tree/master/z-other-tools and let's use the 1M row training set and the test set linked in from there.

@jkbradley says: One-hot encoder: Spark 1.4 includes this, plus a lot more feature transformers. Preprocessing should become ever-easier, especially using DataFrames (Spark 1.3+).

Yes, indeed. Can you please provide code that reads in the original dataset (pre- 1-hot encoding) and does the 1-hot encoding in Spark. Also, if random forest 1.4 API can use data frames, I guess we should use that for the training. Can you please provide code for that too.

@jkbradley says: AUC/accuracy: The AUC issue appears to be caused by MLlib tree ensembles aggregating votes, rather than class probabilities, as you suggested. I re-ran your test using class probabilities (which can be aggregated by hand), and then got the same AUC as other libraries. We’re planning on including this fix in Spark 1.5 (and thanks for providing some evidence of its importance!).

Fantastic. Can you please share code that does that already? I would be happy to check it out.

@jkbradley
Copy link

Apologies for the slow reply! I've been bogged down with QA for the next Spark release. I wrote the code for one-hot encoding + running RFs and computing AUC. Let me know if you run into issues using it:

[https://gist.github.com/jkbradley/1e3cc0b3116f2f615b3f]

@szilard
Copy link
Owner Author

szilard commented Sep 9, 2015

Fantastic, thanks Joseph! I'll take a look asap and re-run the benchmarks with the new code. I'll let you know if I have more questions along the way...

@szilard
Copy link
Owner Author

szilard commented Sep 9, 2015

@jkbradley
Hi Joseph:

  1. The 1-hot encoding code you provided works great, thanks.
  2. Also, the new scoring by aggregating probabilities works way better than the original aggregation of votes. For n=1M and
val numTrees = 100
val featureSubsetStrategy = "sqrt"   
val impurity = "entropy"    //val impurity = "gini"
val maxDepth = 20           
val maxBins = 100           //val maxBins = 50

I can verify I get AUC 71.4 instead of 62.5
3. If I run it on 1.5.0 (Tungsten) runtime and memory footprint is still the same.
4. However, something weird is still going on with accuracy, the AUC does not increase (enough) with increasing size (similar to the results with the other tools):
For n=0.1M: AUC 71.1
For n=1M: AUC 71.4
See also this graph for a different (the original) setup (500 trees etc.):
https://raw.githubusercontent.com/szilard/benchm-ml/master/2-rf/x-plot-auc.png

Any comments on 3 and especially on 4 above?
Thanks.

@jkbradley
Copy link

I'm glad 1 & 2 worked out!

For 3, I should have been more specific. Tungsten makes improvements on DataFrames, so it should improve the performance of simple ML Pipeline operations like feature transformation and prediction. However, to get the same benefits for model training, we'll need to rewrite the algorithms to use DataFrames and not RDDs. Future work...

For 4: Are you able to examine the tree structure? I'm wondering if the limit on MLlib tree depth is the difference. What happens when you set a max depth for the other libraries? I'll try to look into it more too.

@szilard
Copy link
Owner Author

szilard commented Sep 10, 2015

Thanks again for 1 & 2.

For 3: Yes, that was my guess too. One more question: I re-run the logistic regression https://github.com/szilard/benchm-ml/blob/master/1-linear/5-spark.txt with 1.5.0 as well and got same training time as with 1.4.0. While my guess was that RF will be the same, I thought LR uses the DF as underlying and it will be faster with Tungsten. No?

@szilard
Copy link
Owner Author

szilard commented Sep 10, 2015

For 4: I'm not sure if it's max_depth, for example H2O does not have similar behavior (with same max_depth = 20). To make further testing simple I zipped the 0.1M and 1M train and test datasets after being 1-hot encoded and exported to parquet format with your code and you can get it all in 1 from here: https://s3.amazonaws.com/benchm-ml--spark2/spark-0.1m%2B1m-1hot-parquet.tgz

As for the training/AUC part I made some little changes and my code is here: https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xb-spark-trainpred.txt
The changes are mainly:

val trainDataPath = "spark1hot-train-1m.parquet"
val testDataPath = "spark1hot-test-1m.parquet"

that is file names (in current dir) and

val impurity = "entropy"    
val maxDepth = 20        
val maxBins = 100

to match the "absolute minimal benchmark".

So with these I get:

n=0.1M: AUC 71.1
n=1M: AUC 71.4

as I mentioned above.

I think now it would be easier for you to take a look. There should be a higher increase in AUC from n=0.1M to 1M.

@jkbradley
Copy link

For 3: Yes, that was my guess too. One more question: I re-run the logistic regression https://github.com/szilard/benchm-ml/blob/master/1-linear/5-spark.txt with 1.5.0 as well and got same training time as with 1.4.0. While my guess was that RF will be the same, I thought LR uses the DF as underlying and it will be faster with Tungsten. No?

No, unfortunately. It's using a different optimizer (OWLQN) for some models, but it's still using the RDD API.

I'll take a look at the AUC issue.

@szilard
Copy link
Owner Author

szilard commented Sep 10, 2015

Thanks for answer to 3 and thanks for looking into 4 :)

@jkbradley
Copy link

I didn't find anything obvious yet for issue 4, but there are still some items I want to investigate. (I haven't yet scanned the sklearn implementation carefully for comparison.) In the meantime, would you be able to update the blog post to include the soft predictions? It'd be nice now that it's easily available in Spark 1.5. Thanks!

@szilard
Copy link
Owner Author

szilard commented Sep 18, 2015

Thanks @jkbradley for working on this. I already updated the github README a few days ago and I'll update the http://datascience.la post as well soon.

@ehiggs
Copy link

ehiggs commented Feb 19, 2016

Hi there. It's been a few months and I'm curious if any progress has been made here? Is there a JIRA on the apache page tracking the work in Spark for this?

Thanks

@szilard
Copy link
Owner Author

szilard commented Feb 19, 2016

@ehiggs thanks for asking, though I defer this to @jkbradley and Team

@ehiggs
Copy link

ehiggs commented Feb 22, 2016

I made an issue on the Apache Jira here.

@szilard
Copy link
Owner Author

szilard commented Feb 22, 2016

Great, thanks.

The memory and speed are one issue, but there is also the issue of accuracy, see point numbered as 4 earlier in this thread.

In short the AUC does not increase from 100K records to 1M as it does for all other tools:
https://raw.githubusercontent.com/szilard/benchm-ml/master/2-rf/x-plot-auc.png

It can be an implementation bug or an "architecture bug" (by using some approximation that does not work well). I'm curious if anyone is using RF in production for large datasets and if they are really validating the results by comparing the accuracy to other tools.

@szilard
Copy link
Owner Author

szilard commented Sep 24, 2016

I rewrote the RF code to use the pipeline/spark.ml API:
https://github.com/szilard/benchm-ml/blob/master/z-other-tools/5xb-spark-trainpred--sp20.txt
and ran it on Spark 2.0.

It is slower than before (despite Tungsten/new data.frame API), 100 trees 20 deep:

MLlib 1.5 - 250 sec
ML    2.0 - 400 sec

more details here: https://github.com/szilard/benchm-ml/tree/master/z-other-tools#how-to-benchmark-your-tool-of-choice-with-minimal-work

In addition Spark RF is still less accurate than piers. Also, it still shows the weird learning curve:
https://github.com/szilard/benchm-ml/raw/master/2-rf/x-plot-auc.png

So basically issues 3 & 4 above did not change for the better.

@ehiggs
Copy link

ehiggs commented Apr 11, 2017

Progress seems to be taking places here. Thanks to @smurching.

@riya2216
Copy link

@szilard @jkbradley I am able to reproduce AUC gain on spark 1.4 using @jkbradley code shared over gist(https://gist.github.com/jkbradley/1e3cc0b3116f2f615b3f#file-benchm-ml-spark-L6) at the start of this thread.
However, we are facing issues in reproducing the same AUC gain over spark version 2.1.1 or 2.2. Tried couple of things:

  1. Refactored the scala code at the gist link to run against spark version 2.1.1.
    Getting scala match error:
    scala.MatchError: [0.0,(689,[0,1,2,34,50,53,75,384],[1934.0,732.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

  2. Rewrote the same code in java and ran in spark 2.1.1.
    Got this error(known spark issue, fixed in version 2.2):
    java.lang.UnsupportedOperationException: empty.maxBy
    at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
    at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
    at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:831)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:561)
    at org.apache.spark.ml.tree.impl.RandomForest$$anonfun$14.apply(RandomForest.scala:553)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)

  3. Run the java code in spark 2.2, above issue got fixed but AUC numbers are not the same as spark 1.4:

Area Under ROC without using soft predictions: 0.5102859778512581
Area Under ROC with soft predictions: 0.5

Please provide any insights what could be the issue here.

JavaCode.txt
ScalaCode_Spark2.1Ready.txt

@szilard
Copy link
Owner Author

szilard commented Mar 28, 2018

@riya2216 the only insight I can give after doing all these benchmarks is forget Spark for machine learning, it's slow, inaccurate, buggy and uses tons of RAM. There is no reason to waste your time using clearly inferior products. Just use e.g. h2o.
PS: If you have all your ETL in Spark, you can use sparkling water (which allows you to call h2o from spark).

@jakesherman
Copy link

@szilard Thanks for all of your work in putting this together and noticing these issues with Spark ML's Random Forest implementation. I have a couple of quick questions, please pardon my ignorance if you've answered these elsewhere:

  • Have you tried repeating these benchmarks using the same hyperparameter values across packages versus the defaults? I'm wondering if perhaps Spark ML's implementation is just using default hyperparameters that tend to perform worse than other implementations? For example, in Spark 2.2 the maximum depth of trees has a default of 5, whereas in scikit-learn the default is "nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples". I can't easily figure out what the default value is in Spark 2.3 .

  • Have you repeated this benchmark using Spark 2.3?

Thanks!

@137alpha
Copy link

Have you tried increasing the maxBins parameter? Could be due to the binning approximation that Spark uses.

@szilard
Copy link
Owner Author

szilard commented Aug 30, 2018

@137alpha No, I gave up on Spark for ML long time ago. There are way better tools for machine learning than Spark MLlib (e.g. h2o), just use those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants