Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement I/O for datasets of LabeledPoints #158

Open
ecurtin opened this issue Feb 19, 2018 · 1 comment
Open

Implement I/O for datasets of LabeledPoints #158

ecurtin opened this issue Feb 19, 2018 · 1 comment

Comments

@ecurtin
Copy link
Contributor

ecurtin commented Feb 19, 2018

Many ML workloads such as LogisticRegression generate and require as input datasets of the form RDD[LabeledPoint]. Converting back and forth from a weakly typed dataframe to an RDD of LabeledPoint is no issue. The issue is getting datasets of LabeledPoints out to disk from the generators and back in from a variety of formats.

The legacy version of Spark-Bench wrote out datasets of LabeledPoints out as text files so each row was a string that would need to be parsed by the workload. This string parsing is a major hit to performance, particularly when formats like Parquet could be used to drastically cut down on storage space and transport time.

Spark-Bench needs a way to:

  • generically write out Dataframes of LabeledPoint out to disk in a variety of formats
  • generically read datasets of LabeledPoint from disk in a variety of formats and convert them to Dataframes for usage in workloads.
@ecurtin
Copy link
Contributor Author

ecurtin commented Feb 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant