Skip to content

Commit

Permalink
new stuff
Browse files Browse the repository at this point in the history
  • Loading branch information
stevelowenthal committed Sep 10, 2015
1 parent 132e07a commit fa11543
Show file tree
Hide file tree
Showing 3 changed files with 691 additions and 338 deletions.
324 changes: 324 additions & 0 deletions jupyter/Collect data as lists.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,324 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lets \"pivot\" data and nest it into a list.\n",
"We currently have our tracks_by_album table modeled as such:\n",
"\n",
"````\n",
"CREATE TABLE music.tracks_by_album (\n",
" album_title text,\n",
" album_year int,\n",
" track_number int,\n",
" album_genre text static,\n",
" performer text static,\n",
" track_title text,\n",
" PRIMARY KEY ((album_title, album_year), track_number)\n",
"````\n",
"\n",
"Let's model it as one partition per album, and the track titles in a list\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%Cql create table if not exists music.albums (album_title text,\n",
"album_genre text,\n",
"album_year int,\n",
"performer text,\n",
"tracks list<text>, \n",
"primary key ((album_title, album_year)) )"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<table><tr><tr></table>"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%Cql truncate music.albums"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now a handy case class"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"case class Track(album_title: String,\n",
"album_year:Int, number:Int,\n",
"album_genre: String,\n",
"performer: String,\n",
"track_title: String)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### And an RDD to pick up the original data"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"val tracks = sc.cassandraTable(\"music\",\"tracks_by_album\").as(Track)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Track(Duos For Violin and Cello,2000,1,Classical,Nigel Kennedy,Sonata for Violin and Cello - Allegro)"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tracks.first"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Pair RDD with key, track name. The key in this case (album_title, year). So we want tuples ((album_title, year), track_title). Also filter out tracks that are null"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"val track_pairs = tracks.filter(t => t.track_title != null).map(t => ((t.album_title, t.album_year), t.track_title))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"track_pairs.first.toString"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use groupByBey, to convert it to (key, collection of track names)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"val albums_with_tracks = track_pairs.groupByKey"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"albums_with_tracks.first.toString\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Flatten it out so we have (album_title, year, collection of tracks)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"val albums_with_tracks2 = albums_with_tracks.map{ case ((album, year), tracks) => (album, year, tracks) }"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(Bramble Rose,2002,CompactBuffer(Trouble Over Me, Virginia, No One Can Warn You, Neighborhood, Bird Of Freedom, Bramble Rose, I Know Him Too, Sunday, Supposed To Make You Happy, Diamond Shoes, Are You Still In Love With Me?, When I Cross Over))"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"albums_with_tracks2.first.toString"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"albums_with_tracks2.saveToCassandra(\"music\",\"albums\", SomeColumns(\"album_title\", \"album_year\", \"tracks\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ... And check it out"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<table><tr><th>album_title</th><th>album_year</th><th>album_genre</th><th>performer</th><th>tracks</th><tr><tr><td>Duos For Violin and Cello</td><td>2000</td><td>&lt;null&gt;</td><td>&lt;null&gt;</td><td>[Sonata for Violin and Cello - Allegro, Sonata for Violin and Cello - Tres vif, Sonata for Violin and Cello - Lent, Sonata for Violin and Cello - Vif, avec entrain, Passacaglia, Duo for Violin and Cello Op. 7 - Allegro serioso, non troppo, Duo for Violin and Cello Op. 7 - Adagio-Andante-Tempo I, Duo for Violin and Cello Op. 7 - Maestoso e largamente, ma non troppo lento-Presto, Two-Part Intervention No. 6 in E]</td></tr><tr><td>Golden Boy Elvis</td><td>1981</td><td>&lt;null&gt;</td><td>&lt;null&gt;</td><td>[She's Not You, Return To Sender, Joshua Fit The Battle, Don't, Tutti Frutti, It's Now Or Never, Surrender, Do The Clam, Kiss Me Quick, Such A Night, Blueberry Hill, Don't Be Cruel, Heartbreak Hotel, Fun In Acapulco, Hound Dog, Wooden Heart]</td></tr><tr><td>Home</td><td>1995</td><td>&lt;null&gt;</td><td>&lt;null&gt;</td><td>[I Believe, Let Me Be The One, All Along, Oh Virginia, Nora, Would You Be There, Home, End Of The World, Heaven, Forever For Tonight, Lucky To Be Here]</td></tr><tr><td>Little Sweet Delirium</td><td>1996</td><td>&lt;null&gt;</td><td>&lt;null&gt;</td><td>[Chase, Revolution, 829, Reaction, Hands Like Wings, God Shaped Hole, Fall, Point of You]</td></tr><tr><td>Merry Christmas From Elvis Presley, A</td><td>1982</td><td>&lt;null&gt;</td><td>&lt;null&gt;</td><td>[O Come All Ye Faithful, The First Noel, On A Snowy Christmas Night, Winter Wonderland, The Wonderful World Of Christmas, It Won't Seem Like Christmas, I'll Be Home On Christmas Day, If I Get Home On Christmas Day, Holly Leaves And Christmas Trees, Merry Christmas Baby, Silver Bells]</td></tr></table>"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%Cql select * from music.albums limit 5;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Now we need to fill in genre and year. Let's just upsert them in! This will also insert any albums with no tracks"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tracks.saveToCassandra(\"music\",\"albums\",SomeColumns(\"album_title\",\"album_year\",\"album_genre\",\"performer\"))"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<table><tr><th>album_title</th><th>album_year</th><th>album_genre</th><th>performer</th><th>tracks</th><tr><tr><td>Duos For Violin and Cello</td><td>2000</td><td>Classical</td><td>Nigel Kennedy</td><td>[Sonata for Violin and Cello - Allegro, Sonata for Violin and Cello - Tres vif, Sonata for Violin and Cello - Lent, Sonata for Violin and Cello - Vif, avec entrain, Passacaglia, Duo for Violin and Cello Op. 7 - Allegro serioso, non troppo, Duo for Violin and Cello Op. 7 - Adagio-Andante-Tempo I, Duo for Violin and Cello Op. 7 - Maestoso e largamente, ma non troppo lento-Presto, Two-Part Intervention No. 6 in E]</td></tr><tr><td>Golden Boy Elvis</td><td>1981</td><td>Rock</td><td>Elvis Presley</td><td>[She's Not You, Return To Sender, Joshua Fit The Battle, Don't, Tutti Frutti, It's Now Or Never, Surrender, Do The Clam, Kiss Me Quick, Such A Night, Blueberry Hill, Don't Be Cruel, Heartbreak Hotel, Fun In Acapulco, Hound Dog, Wooden Heart]</td></tr><tr><td>Home</td><td>1995</td><td>Rock</td><td>Blessid Union of Souls</td><td>[I Believe, Let Me Be The One, All Along, Oh Virginia, Nora, Would You Be There, Home, End Of The World, Heaven, Forever For Tonight, Lucky To Be Here]</td></tr><tr><td>Little Sweet Delirium</td><td>1996</td><td>Unknown</td><td>Creeker</td><td>[Chase, Revolution, 829, Reaction, Hands Like Wings, God Shaped Hole, Fall, Point of You]</td></tr><tr><td>Merry Christmas From Elvis Presley, A</td><td>1982</td><td>Rock</td><td>Elvis Presley</td><td>[O Come All Ye Faithful, The First Noel, On A Snowy Christmas Night, Winter Wonderland, The Wonderful World Of Christmas, It Won't Seem Like Christmas, I'll Be Home On Christmas Day, If I Get Home On Christmas Day, Holly Leaves And Christmas Trees, Merry Christmas Baby, Silver Bells]</td></tr></table>"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%Cql select * from music.albums limit 5;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Spark 1.2.1 (Scala 2.10.4)",
"language": "scala",
"name": "spark"
},
"language_info": {
"name": "scala"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

0 comments on commit fa11543

Please sign in to comment.