Implement I/O for datasets of LabeledPoints #158

ecurtin · 2018-02-19T16:37:58Z

Many ML workloads such as LogisticRegression generate and require as input datasets of the form RDD[LabeledPoint]. Converting back and forth from a weakly typed dataframe to an RDD of LabeledPoint is no issue. The issue is getting datasets of LabeledPoints out to disk from the generators and back in from a variety of formats.

The legacy version of Spark-Bench wrote out datasets of LabeledPoints out as text files so each row was a string that would need to be parsed by the workload. This string parsing is a major hit to performance, particularly when formats like Parquet could be used to drastically cut down on storage space and transport time.

Spark-Bench needs a way to:

generically write out Dataframes of LabeledPoint out to disk in a variety of formats
generically read datasets of LabeledPoint from disk in a variety of formats and convert them to Dataframes for usage in workloads.

ecurtin · 2018-02-19T20:53:00Z

Might be assisted by: https://spark.apache.org/docs/2.1.2/api/java/org/apache/spark/mllib/util/MLUtils.html

ecurtin added Type: New Feature Difficulty: Medium help wanted labels Feb 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement I/O for datasets of LabeledPoints #158

Implement I/O for datasets of LabeledPoints #158

ecurtin commented Feb 19, 2018

ecurtin commented Feb 19, 2018

Implement I/O for datasets of LabeledPoints #158

Implement I/O for datasets of LabeledPoints #158

Comments

ecurtin commented Feb 19, 2018

ecurtin commented Feb 19, 2018