Skip to content

Releases: twitter/scalding

Configure to require SUCCESS

22 Dec 20:23
Compare
Choose a tag to compare

This release is binary compatible with 0.17.3 so it should be safe to use. One behavior change is that skipping null counters is now opt in (which was a default we regretted when shipping 0.17.3). See: #1716

  • add DateRange.prepend: #1748
  • TextLine is now a TypedSink[String]: #1752
  • check for _SUCCESS file in any FileSource based on a config flag: #1758
  • add a setting to skip null counters: #1759

Workaround null counters

30 Sep 19:10
Compare
Choose a tag to compare

This is a minor bugfix release that works around hadoop giving us a null counter reporter. We work around by ignoring counters. This may not be the best solution, but it unblocks some users. We don't yet know why hadoop sometimes gives this users a null counter reporter.

See #1726

Scalding 0.17.2

14 Jul 21:33
Compare
Choose a tag to compare

This version is basically the same as 0.17.1 but backward compatible with 0.17.0.

  • Revert memory estimator changes on 0.17.x branch: #1704
  • Turn on mima checks on 0.17.x branch: #1706

Scalding 0.17.1 with 2.11 and 2.12 support

11 Jul 00:00
Compare
Choose a tag to compare

Changes of this release:

Scalding 0.17.0 with 2.12 support!

11 Apr 17:40
Compare
Choose a tag to compare

This is the first Scalding release that publishes artifacts for Scala 2.12! Here are some of the changes that are part of this release:

  • 2.12 releated updates: #1663, #1646
  • Use reflection over Jobs to find serialized classes: #1654, #1662
  • Simplify match statement and use collection.breakOut: #1661
  • Add explicit types to implicit methods and values: #1660
  • Reducer estimation size fixes: #1652, #1650, #1645, #1644
  • Use Combined*SequenceFile for VKVS, WritableSequenceFileScheme, SequenceFileScheme: #1647
  • Improve Vertica support in scalding-db: #1655
  • Add andThen to Mappable: #1656
  • Expand libjars globs in ScaldingShell to match the behavior of Tool: #1651
  • Use Batched in Sketch production: #1648
  • Pick up Algebird 0.13.0: #1640
  • Added API for Execution/Config to work with DistributedCache: #1635
  • Bump chill version to 0.8.3: #1634
  • Fixes a bug in how we use this stack: #1632
  • Upgrade build to sbt 0.13.13: #1629
  • Generate Scalding microsite via sbt-microsites: #1623
  • FileSource support for empty directories: #1622, #1618, #1613, #1611, #1591
  • Clean up temporary files created by forceToDiskExecution: #1621
  • Moving the repl in wonderland to a dedicated md file: #1614
  • Update Scala and sbt version: #1610
  • REFACTOR: Fixed some compilation warnings: #1604
  • REFACTOR: Rename parameter to reflect expectation: #1601
  • Add partitioned sources for Parquet thrift / scrooge: #1590
  • Add a test for sortBy: #1594
  • Create COMMITTERS.md: #1589
  • Use ExecutionContext in Execution.from/fromTry: #1587
  • Support custom parquet field name strategies: #1580
  • Deprecate reflection-based JobTest apply method: #1578
  • Use Caching for FlowDefExecution: #1581
    [parquet tuple macros] listType was deprecated in favor of listOfElements: #1579
  • Use Batched to speed up CMS summing on mappers: #1575
  • Remove a TypedPipeFactory wrapper which seems unneeded: #1576
  • Make Writeable sources Mappable to get toIterator: #1573
  • case class implicit children: #1569

Scalding 0.16.0 Released!

04 May 01:21
Compare
Choose a tag to compare

28 Contributors to this release:

@Gabriel439, @JiJiTang, @MansurAshraf, @QuantumBear, @afsalthaj, @benpence, @danosipov, @epishkin, @gerashegalov, @ianoc, @isnotinvain, @jnievelt, @johnynek, @joshualande, @megaserg, @nevillelyh, @oeddyo, @piyushnarang, @reconditesea, @richwhitjr, @rubanm, @sid-kap, @sriramkrishnan, @stuhood, @tdyas, @tglstory, @vikasgorur, @zaneli

Release Notes

This release is a performance and correctness improvement release. The biggest improvements are to the Execution API and to OrderedSerialization.

Execution allows a reliable way to compose jobs and use scalding as a library, rather than running subclasses of Job in a framework style. In this release we have improved the performance and added some methods for more control of Executions (such as .withNewCache for cases where caching in the whole flow is not desired).

OrderedSerialization is a way to easily leverage binary comparators, comparators that act directly on serialized data so they don’t need to allocate nearly as much when the data is partitioned by key. These were discussed in presentation at the Hadoop summit [slides]. These are generated by macros so most simple types (case classes, scala collections, primitives, and recursion of these) are easy to use with a single import (see this note).

Here’s a list of some of the features we’ve added to Scalding in this release.

New Features

- OrderedSerialization (fast binary comparators for grouping and joining + macros to create them) are production ready. To use them, and other macros, [see this note](https://github.com/twitter/scalding/wiki/Automatic-Orderings,-Monoids-and-Arbitraries). Updates related to OrderedSerialization - #1307, #1316, #1320, #1321, #1329, #1338, #1457 - Add TypedParquet macros for parquet read / write support (Note: this might not be ready for production use as it doesn’t support schema evolution) - #1303 - Naming of Executions is supported - #1334 - Add line numbers at .group and .toPipe boundaries - #1335 - Make some repl components extensible to allow setting custom implicits and config to load at boot time - #1342 - Implement flatMapValues method - #1348 - Add NullSink, can be used with .onComplete to drive a side-effecting (but idempotent) job - #1378 - Add monoid and semigroup for Execution - #1379 - Support nesting Options in TypeDescriptor - #1387 - Add .groupWith method to TypedPipe - #1406 - Add counter verification logic - #1409 - Scalding viz options - #1426 - Add TypedPipeChecker to assist in testing a TypedPipe - #1478 - Add withConfig api to allow running an execution with a transformed config to override hadoop or source level options in subsections - #1489 - Add a liftToTry function to Execution - #1499 - Utility methods for running Executions in parallel - #1507 - Add's support for OrderedSerialization on sealed abstract classes - #1518 - Support for more formats to work with RichDate - #1522

Important Bug Fixes

- Add InvalidSourceTap to catch all cases for no good path - #1458 - SuccessFileSource: correctness for multi-dir globs - #1470 - A serialization error we were seeing in repl usage : #1376 - Fix lack of Externalizer in joins. : #1421 - Requires a DateRange's "end" to be after its "start" : #1425 - Fixes map-only jobs to accommodate both an lzo source and sink binary converter : #1431 - Fix bug with sketch joins and single keys : #1451 - Fix FileSystem.get issue : #1487 - Fix scrooge + OrderedSerialization for field names starting with `_`: #1534 - Add before() and after() to RichDate : #1538

Performance Improvements

- Change defaults for Scalding reducer estimator - #1333 - Add runtime-based reducer estimators - #1358 - When using WriteExecution and forceToDisk we can share the same flowdef closer in construction - #1414 - Cache the zipped up write executions - #1415 - Cache counters for stat updates rather than doing a lookup for every increment - #1495 - Cache boxed classes - #1501 - Typed Mapside Reduce - #1508 - Add auto forceToDisk support to hashJoin in TypedPipe - #1529 - Fix performance bug in TypedPipeDiff : #1300 - Make sure Execution.zip fails fast : #1412 - Fix Rounding Bug in RatioBasedEstimator : #1542

Full change list is here

v0.16.0-RC6

06 Apr 00:33
Compare
Choose a tag to compare

This is the candidate that we are considering for the 0.16.0 release. We will be testing this RC out internally at Twitter and if it looks good and other folks are on board, this can be promoted to 0.16.0

LzoGenericScheme/Source, Typed Parquet Tuple and Better Performance with a new Elephant-Bird API

21 May 23:48
Compare
Choose a tag to compare
  • Typed Parquet Tuple #1198
  • LzoGenericScheme and Source #1268
  • Move OrderedSerialization into zero-dep scalding-serialization module #1289
  • bump elephantbird to 4.8 #1292
  • Fix OrderedSerialization for some forked graphs #1293
  • Add serialization modules to aggregate list #1298

OrderedSerialization is work-in-progress and is not ready to be used.

Scalding 0.14.0 Released!

18 May 21:06
Compare
Choose a tag to compare

ExecutionApp tutorial

A new tutorial for ExecutionApp is added in #1196. You can check out ExecutionTutorial.scala for the source.

Simple HDFS local mode REPL

#1244 adds an easy to use useHdfsLocalMode method to the REPL for running hadoop locally. useHdfsMode reverts the behavior.

TypedPipe conditional execution via #make

TypedPipe now exposes the make method for fallback computation/production of an optional store in an Execution. If the store already exists, the computation is skipped. Otherwise, the computation is performed and the store is created before proceeding with execution.

TypedPipeDiff

#1266 adds TypedPipeDiff and helper enrichments for comparing the contents of two pipes.

RichPipe#skewJoinWithSmaller now works

A data bug with the fields API method skewJoinWithSmaller was discovered and fixed. The API should be functionally equivalent to joinWithSmaller now.

See CHANGES.md for the full list of changes.

Scalding 0.13.1, the most convenient scalding we’ve ever released!

11 Feb 01:39
Compare
Choose a tag to compare

Scala 2.11 Support is here!

We’re now publishing scalding for scala 2.11! Get it while it’s hot!

Easier aggregation via the latest Algebird

Algebird now comes with some very powerful aggregators that make it easy to compose aggregations and apply them in a single pass.

For example, to find each customer's order with the max quantity, as well as the order with the min price, in a single pass:

val maxOp = maxBy[Order, Long](_.orderQuantity).andThenPresent(_.orderQuantity)
val minOp = minBy[Order, Long](_.orderPrice).andThenPresent(_.orderPrice)
TypedPipe.from(orders)
      .groupBy(_.customerName)
      .aggregate(maxOp.join(minOp))

For more examples and documentation see: Aggregation using Algebird Aggregators
And for a hands on walkthrough in the REPL, see Alice In Aggregator Land

Read-Eval-Print-Love

We’ve made some improvements that make day to day use of the REPL more convenient:

Easily switch between local and hdfs mode

#1113 Makes it easy to switch between local and hdfs mode in the REPL, without losing your session.
So you can iterate locally on some small data, and once that’s working, run a hadoop job on your real data, all from within the same REPL session. You can also sample some data down to fit into memory, then switch to local mode where you can really quickly get the answers you’re looking for.

For example:

$ ./sbt assembly
$ ./scripts/scald.rb --repl --hdfs --host <host to ssh to and launch jobs from>
scalding> useLocalMode()
scalding> def helper(x: Int) = (x * x) / 2
helper: (x: Int)Int
scalding> val dummyData = TypedPipe.from(Seq(10, 11, 12))
scalding> dummyData.map(helper).dump
50
60
72
scalding> useHdfsMode()
scalding> val realData = TypedPipe.from(MySource(“/logs/some/real/data”)
scalding> realData.map(helper).dump

Easily save TypedPipes of case classes to disk

#1129 Lets you save any TypedPipe to disk from the REPL, regardless of format, so you can load it back up again later from another session. This is useful for saving an intermediate TypedPipe[MyCaseClass] without figuring out how to map it to a TSV or some other format. This works by serializing the objects to json behind the scenes.
For example:

$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson

scalding> case class Bio(text: String, language: String)
defined class Bio

scalding> case class User(id: Long, bio: Bio)
defined class User

// in a real use case, getUsers might load a few sources, do some projections + joins, and then return
// a TypedPipe[User]
scalding> def getUsers() = TypedPipe.from(Seq( User(7, Bio("hello", "en")), User(8, Bio("hola", "es")) ))
getUsers: ()com.twitter.scalding.typed.TypedPipe[User]

scalding> getUsers().filter(_.bio.language == "en").save(TypedJson("/tmp/en-users"))
res0: com.twitter.scalding.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@7cccf31c

scalding> exit
$ cat /tmp/en-users 
{"id":7,"bio":{"text":"hello","language":"en"}}

$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson

scalding> case class Bio(text: String, language: String)
defined class Bio

scalding> case class User(id: Long, bio: Bio)
defined class User

scalding> val filteredUsers = TypedPipe.from(TypedJson[User]("/tmp/en-users"))
filteredUsers: com.twitter.scalding.typed.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@44bb1922

scalding> filteredUsers.dump
User(7,Bio(hello,en))

ValuePipe.dump

#1157 Adds dump to ValuePipe, so now you can not only print the contents of TypedPipes but on ValuePipes as well (see above for examples of using dump in the REPL).

Execution Improvements

The scaladoc for Execution is complete, but some additional exposition was added to the wiki: Calling Scalding from inside your application. We added two helper methods to object Execution: Execution.failed creates an Execution from a Throwable (like Future.failed), and Execution.unit which creates a successful Execution[Unit], which is handy in some branching loops.

Bugfixes

The final bugs were finally removed from scalding*. Including #1190, a bug that effected the hashCode for Args instances and issue #1184 that made Stats unreliable for some users.
*some humor is used in scalding notes.

See CHANGES.md for a full change log.

Thanks to @avibryant, @danielhfrank, @DanielleSucher, @miguno, and the rest of the algebird contributors for the new aggregations, as well as all the scalding contributors