GitHub

Introduction

Spark has distributed collection of items called as RDDs (Resilient Distributed Dataset). They can be created from the Hadoop InputFormats.

Prerequsites

Have Spark installed in you systems.
Make sure you changed the permissions to modify the database.

Aggregate:

A new RDD from the text of the "winemag_data" file in the directory. For reading the content from the file:

>val inputFile = sc.textFile("C:/44564/scala_sample/winemag_data.txt")

The filter command is used to filter the rows if they are the null or not. It returns a new dataset formed by selecting those elements of the source on which second column is null or not and returns true.

>val filcountry = inputFile.filter(rec => (rec.split(",")(1)!=null))

The below line defines "tmp1" as the result of a map transformation by passing each element of the source.

>val tmp1 = filcountry.map(_.split(",")).map( p=>p(1))

The following is the command for calculating the aggregate count of the countries

>val fin = tmp1.flatMap(line => line.split("\n")).map(word => (word,1)).reduceByKey(_ + _);

The following command writes the elements of the dataset as a text file (or set of text files) in "show-spark" directory in the local filesystem. Spark will call toString on each element to convert it to a line of text in the file.

>fin.saveAsTextFile("c:/tmp/show-spark/output")

If you want to check the output for first 100 lines you can use the below code

>val tmp1 = filcountry.map(_.split(",")).map( p=>p(1)).take(100).foreach(println)

Using the command below you end up with rdd to be divided exactly to 5 partitions of roughly equal sizes.

>val repart = fin.repartition(5)

Wordcount:

The following are the command to count the words in a text file.

>val inputFile = sc.textFile("C:/44564/scala_sample/winemag_data.txt")

>val counts = inputFile.flatMap(line => line.split(",")).map(word => (word,1)).reduceByKey(_ + _);

>counts.toDebugString

>counts.cache

For storing the values as partitions

>counts.saveAsTextFile("c:/tmp/show-spark/output")

Other commands

spark.read.csv("C:/44564/scala_sample/winemag_data.txt")

.show()

spark.read.option("header",true).csv("C:/44564/scala_sample/winemag_data.txt")

val counts1 = country.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _);

val country = wineVal.filter(rec => (rec.split(",")(4) == "15"))

counts.saveAsTextFile("/tmp/show-spark/output1")

new PrintWriter("countries.txt") { fin.foreach { case (k, v) => write(k + ":" + v) write("\n") } close() }

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
metastore_db		metastore_db
Countries_Wine.twb		Countries_Wine.twb
README.md		README.md
debug.log		debug.log
derby.log		derby.log
ramen-ratings.csv		ramen-ratings.csv
ramen_ratings.txt		ramen_ratings.txt
winemag-data-130k-v2.csv		winemag-data-130k-v2.csv
winemag-data-130k-v2.json		winemag-data-130k-v2.json
winemag-data_first150k.csv		winemag-data_first150k.csv
winemag_data.txt		winemag_data.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metastore_db

metastore_db

Countries_Wine.twb

Countries_Wine.twb

README.md

README.md

debug.log

debug.log

derby.log

derby.log

ramen-ratings.csv

ramen-ratings.csv

ramen_ratings.txt

ramen_ratings.txt

winemag-data-130k-v2.csv

winemag-data-130k-v2.csv

winemag-data-130k-v2.json

winemag-data-130k-v2.json

winemag-data_first150k.csv

winemag-data_first150k.csv

winemag_data.txt

winemag_data.txt

Repository files navigation

Introduction

Prerequsites

Aggregate:

Wordcount:

Other commands

Data Chart for the obtained partitions

References:

About

Releases

Packages

S530484/Spark_wordcount

Folders and files

Latest commit

History

Repository files navigation

Introduction

Prerequsites

Aggregate:

Wordcount:

Other commands

Data Chart for the obtained partitions

References:

About

Resources

Stars

Watchers

Forks