Skip to content

Scalding 0.16.0 Released!

Compare
Choose a tag to compare
@piyushnarang piyushnarang released this 04 May 01:21
· 432 commits to develop since this release

28 Contributors to this release:

@Gabriel439, @JiJiTang, @MansurAshraf, @QuantumBear, @afsalthaj, @benpence, @danosipov, @epishkin, @gerashegalov, @ianoc, @isnotinvain, @jnievelt, @johnynek, @joshualande, @megaserg, @nevillelyh, @oeddyo, @piyushnarang, @reconditesea, @richwhitjr, @rubanm, @sid-kap, @sriramkrishnan, @stuhood, @tdyas, @tglstory, @vikasgorur, @zaneli

Release Notes

This release is a performance and correctness improvement release. The biggest improvements are to the Execution API and to OrderedSerialization.

Execution allows a reliable way to compose jobs and use scalding as a library, rather than running subclasses of Job in a framework style. In this release we have improved the performance and added some methods for more control of Executions (such as .withNewCache for cases where caching in the whole flow is not desired).

OrderedSerialization is a way to easily leverage binary comparators, comparators that act directly on serialized data so they don’t need to allocate nearly as much when the data is partitioned by key. These were discussed in presentation at the Hadoop summit [slides]. These are generated by macros so most simple types (case classes, scala collections, primitives, and recursion of these) are easy to use with a single import (see this note).

Here’s a list of some of the features we’ve added to Scalding in this release.

New Features

- OrderedSerialization (fast binary comparators for grouping and joining + macros to create them) are production ready. To use them, and other macros, [see this note](https://github.com/twitter/scalding/wiki/Automatic-Orderings,-Monoids-and-Arbitraries). Updates related to OrderedSerialization - #1307, #1316, #1320, #1321, #1329, #1338, #1457 - Add TypedParquet macros for parquet read / write support (Note: this might not be ready for production use as it doesn’t support schema evolution) - #1303 - Naming of Executions is supported - #1334 - Add line numbers at .group and .toPipe boundaries - #1335 - Make some repl components extensible to allow setting custom implicits and config to load at boot time - #1342 - Implement flatMapValues method - #1348 - Add NullSink, can be used with .onComplete to drive a side-effecting (but idempotent) job - #1378 - Add monoid and semigroup for Execution - #1379 - Support nesting Options in TypeDescriptor - #1387 - Add .groupWith method to TypedPipe - #1406 - Add counter verification logic - #1409 - Scalding viz options - #1426 - Add TypedPipeChecker to assist in testing a TypedPipe - #1478 - Add withConfig api to allow running an execution with a transformed config to override hadoop or source level options in subsections - #1489 - Add a liftToTry function to Execution - #1499 - Utility methods for running Executions in parallel - #1507 - Add's support for OrderedSerialization on sealed abstract classes - #1518 - Support for more formats to work with RichDate - #1522

Important Bug Fixes

- Add InvalidSourceTap to catch all cases for no good path - #1458 - SuccessFileSource: correctness for multi-dir globs - #1470 - A serialization error we were seeing in repl usage : #1376 - Fix lack of Externalizer in joins. : #1421 - Requires a DateRange's "end" to be after its "start" : #1425 - Fixes map-only jobs to accommodate both an lzo source and sink binary converter : #1431 - Fix bug with sketch joins and single keys : #1451 - Fix FileSystem.get issue : #1487 - Fix scrooge + OrderedSerialization for field names starting with `_`: #1534 - Add before() and after() to RichDate : #1538

Performance Improvements

- Change defaults for Scalding reducer estimator - #1333 - Add runtime-based reducer estimators - #1358 - When using WriteExecution and forceToDisk we can share the same flowdef closer in construction - #1414 - Cache the zipped up write executions - #1415 - Cache counters for stat updates rather than doing a lookup for every increment - #1495 - Cache boxed classes - #1501 - Typed Mapside Reduce - #1508 - Add auto forceToDisk support to hashJoin in TypedPipe - #1529 - Fix performance bug in TypedPipeDiff : #1300 - Make sure Execution.zip fails fast : #1412 - Fix Rounding Bug in RatioBasedEstimator : #1542

Full change list is here