Skip to content

Scalding 0.13.1, the most convenient scalding we’ve ever released!

Compare
Choose a tag to compare
@isnotinvain isnotinvain released this 11 Feb 01:39
· 1493 commits to develop since this release

Scala 2.11 Support is here!

We’re now publishing scalding for scala 2.11! Get it while it’s hot!

Easier aggregation via the latest Algebird

Algebird now comes with some very powerful aggregators that make it easy to compose aggregations and apply them in a single pass.

For example, to find each customer's order with the max quantity, as well as the order with the min price, in a single pass:

val maxOp = maxBy[Order, Long](_.orderQuantity).andThenPresent(_.orderQuantity)
val minOp = minBy[Order, Long](_.orderPrice).andThenPresent(_.orderPrice)
TypedPipe.from(orders)
      .groupBy(_.customerName)
      .aggregate(maxOp.join(minOp))

For more examples and documentation see: Aggregation using Algebird Aggregators
And for a hands on walkthrough in the REPL, see Alice In Aggregator Land

Read-Eval-Print-Love

We’ve made some improvements that make day to day use of the REPL more convenient:

Easily switch between local and hdfs mode

#1113 Makes it easy to switch between local and hdfs mode in the REPL, without losing your session.
So you can iterate locally on some small data, and once that’s working, run a hadoop job on your real data, all from within the same REPL session. You can also sample some data down to fit into memory, then switch to local mode where you can really quickly get the answers you’re looking for.

For example:

$ ./sbt assembly
$ ./scripts/scald.rb --repl --hdfs --host <host to ssh to and launch jobs from>
scalding> useLocalMode()
scalding> def helper(x: Int) = (x * x) / 2
helper: (x: Int)Int
scalding> val dummyData = TypedPipe.from(Seq(10, 11, 12))
scalding> dummyData.map(helper).dump
50
60
72
scalding> useHdfsMode()
scalding> val realData = TypedPipe.from(MySource(“/logs/some/real/data”)
scalding> realData.map(helper).dump

Easily save TypedPipes of case classes to disk

#1129 Lets you save any TypedPipe to disk from the REPL, regardless of format, so you can load it back up again later from another session. This is useful for saving an intermediate TypedPipe[MyCaseClass] without figuring out how to map it to a TSV or some other format. This works by serializing the objects to json behind the scenes.
For example:

$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson

scalding> case class Bio(text: String, language: String)
defined class Bio

scalding> case class User(id: Long, bio: Bio)
defined class User

// in a real use case, getUsers might load a few sources, do some projections + joins, and then return
// a TypedPipe[User]
scalding> def getUsers() = TypedPipe.from(Seq( User(7, Bio("hello", "en")), User(8, Bio("hola", "es")) ))
getUsers: ()com.twitter.scalding.typed.TypedPipe[User]

scalding> getUsers().filter(_.bio.language == "en").save(TypedJson("/tmp/en-users"))
res0: com.twitter.scalding.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@7cccf31c

scalding> exit
$ cat /tmp/en-users 
{"id":7,"bio":{"text":"hello","language":"en"}}

$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson

scalding> case class Bio(text: String, language: String)
defined class Bio

scalding> case class User(id: Long, bio: Bio)
defined class User

scalding> val filteredUsers = TypedPipe.from(TypedJson[User]("/tmp/en-users"))
filteredUsers: com.twitter.scalding.typed.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@44bb1922

scalding> filteredUsers.dump
User(7,Bio(hello,en))

ValuePipe.dump

#1157 Adds dump to ValuePipe, so now you can not only print the contents of TypedPipes but on ValuePipes as well (see above for examples of using dump in the REPL).

Execution Improvements

The scaladoc for Execution is complete, but some additional exposition was added to the wiki: Calling Scalding from inside your application. We added two helper methods to object Execution: Execution.failed creates an Execution from a Throwable (like Future.failed), and Execution.unit which creates a successful Execution[Unit], which is handy in some branching loops.

Bugfixes

The final bugs were finally removed from scalding*. Including #1190, a bug that effected the hashCode for Args instances and issue #1184 that made Stats unreliable for some users.
*some humor is used in scalding notes.

See CHANGES.md for a full change log.

Thanks to @avibryant, @danielhfrank, @DanielleSucher, @miguno, and the rest of the algebird contributors for the new aggregations, as well as all the scalding contributors