Project | Big Data Analytics w/ Hadoop and Apache Spark

Background

The project aims at showing the combined capabilities of Hadoop and Apache Spark on data analytics of a student score dataset. The practice of combining the strong sides of these two frameworks (i.e., Hadoop HDFS + Apache Spark) is regarderd highly by the data teams in these days.

What is Hadoop?

The Apache Hadoop Project is an open source project and consists of four main modules:

HDFS – Hadoop Distributed File System.
MapReduce. The processing component of the Hadoop ecosystem. It scales horizontally and is slow as it uses data storage. In the last decade, professionals use Apache Spark or Flink instead of MapReduce.
YARN. Yet Another Resource Negotiator. It is used for computing resources and job scheduling.
Hadoop Common. Also called as Hadoop core providing support for all other Hadoop components.

Among these modules, this project focuses on Hadoop HDFS. It is the file system managing the storage of large data sets. It can handle both structured and unstructured data. Hadoop stores the data to disks using HDFS.

What is Apache Spark?

Apache Spark is an open source project. It uses RAM for caching and processing data and is designed for fast performance. Resilient Distributed Dataset (RDD) is the data structure of Spark. It consists of five main components:

Apache Spark Core. Spark Core is the basis and includes functions of scheduling, task dispatching, input and output operations, etc.
Spark Streaming. It is used for processing of live data streams with data sources of Kafka, Kinesis, Flume, etc.
Spark SQL uses this component to gather information about the structured data and how the data is processed.
Machine Learning Library (MLlib) includes machine learning algorithms.
GraphX is for facilitating graph analytics tasks.

In this notebook, we use Apache Spark Core functions using memory to speed up the computations.

Dataset

Dataset consists of student scores of different subjects in CSV file format. There are 40 rows of data and one header. It is relatively small for the sake of practicing two frameworks. In the field of big data, the data size can reach over petabytes.

4 columns are listed as => Student | Subject | Class Score | Test Score

Goals

This notebook will walk you through the (1) data loading into HDFS format and (2) data processing with Spark. Here's the outline:

Import functions
Data load
- Parquet File + Gzip codec
  We prefer Parquet in HDFS as it reads col by col, provides schema, is compressible and splittable. It is ideal for analytics.
  We prefer gzip codec as it provides good compression but is not splittable and provides moderate performance. It is also commonly used for analytical purposes.
- Schema optimization
Data processing
- Computing total score
- Printing total score for physics
- Computing avg total score
- Finding student with highest score

Technologies

Project is created with the community version of DataBricks. It is the free edition of Databricks and only requires signing up. No installation is neccessary to run this repo. Selected cluster for this project is 11.3 LTS. It includes

Apache Spark 3.3.0
Scala 2.12

Driver type includes 15.3 GB Memory, 2 Cores and 1 DBU. It is the default resource for the community edition of Databricks.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Author

levist7

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
.gitignore		.gitignore
L01-Project_Big Data Analytics with Hadoop and Apache Spark.ipynb		L01-Project_Big Data Analytics with Hadoop and Apache Spark.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

L01-Project_Big Data Analytics with Hadoop and Apache Spark.ipynb

L01-Project_Big Data Analytics with Hadoop and Apache Spark.ipynb

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Project | Big Data Analytics w/ Hadoop and Apache Spark

Table of contents

Background

Dataset

Goals

Technologies

License

Author

About

Releases

Packages

Languages

License

levist7/Spark_Big_Data_Project

Folders and files

Latest commit

History

Repository files navigation

Project | Big Data Analytics w/ Hadoop and Apache Spark

Table of contents

Background

Dataset

Goals

Technologies

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages