Skip to content

gpcodervn/learn-apache-spark

Repository files navigation

learn-apache-spark

Learn Apache Spark Java

This repo to demonstrate some features of Apache Spark like RDD, SQL, Streaming, ...


Apache Spark

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters

Apache Spark features

  • Batch/streaming data: Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
  • SQL analytics: Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. Apache Spark™ is built on an advanced distributed SQL engine for large-scale data.
  • Machine learning: Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
  • Data science at scale: Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling

Demo

RDD

Spark SQL

Streaming

References

Releases

No releases published

Packages

No packages published

Languages