22 Dec 20:23

johnynek

8009691

Latest

This release is binary compatible with 0.17.3 so it should be safe to use. One behavior change is that skipping null counters is now opt in (which was a default we regretted when shipping 0.17.3). See: #1716

add DateRange.prepend: #1748
TextLine is now a TypedSink[String]: #1752
check for _SUCCESS file in any FileSource based on a config flag: #1758
add a setting to skip null counters: #1759

Assets 2

30 Sep 19:10

johnynek

v0.17.3

2aeac43

Workaround null counters

This is a minor bugfix release that works around hadoop giving us a null counter reporter. We work around by ignoring counters. This may not be the best solution, but it unblocks some users. We don't yet know why hadoop sometimes gives this users a null counter reporter.

See #1726

Assets 2

14 Jul 21:33

ttim

v0.17.2

c8e4d28

Scalding 0.17.2

This version is basically the same as 0.17.1 but backward compatible with 0.17.0.

Revert memory estimator changes on 0.17.x branch: #1704
Turn on mima checks on 0.17.x branch: #1706

Assets 2

11 Jul 00:00

tonyjzhou

v0.17.1

135fc82

Scalding 0.17.1 with 2.11 and 2.12 support

Changes of this release:

Request for Scalding release 0.17.0 (#1641) … b1bf1be
make ordered serialization stable across compilations (#1664) … 0f95484
Remove unnecessary semicolon (#1668) 14d3f76
Add tailrec annotation (#1671) 693e6b7
Be more paranoid about Kryo registration order (#1673) 0b291f2
Update sbt version to 0.13.15 (#1677) bf0e724
Register all Boxed classes in Kryo (#1678) … c930bcd
Fix serialization of KryoHadoop (#1685) … a72fd72
Merge pull request #1686 from ttim/cherry_pick_0.17.x_changes … 9da38a1
Fix stack overflow in typedPipeMonoid.zero (#1688) … 6bcb169
A couple of fixes into the 0.17.x branch (#1695) … a6d3775
Memory estimator changes to 0.17.x branch (#1700) … 9b8ea00

Assets 2

11 Apr 17:40

piyushnarang

v0.17.0

d46959f

Scalding 0.17.0 with 2.12 support!

This is the first Scalding release that publishes artifacts for Scala 2.12! Here are some of the changes that are part of this release:

2.12 releated updates: #1663, #1646
Use reflection over Jobs to find serialized classes: #1654, #1662
Simplify match statement and use collection.breakOut: #1661
Add explicit types to implicit methods and values: #1660
Reducer estimation size fixes: #1652, #1650, #1645, #1644
Use Combined*SequenceFile for VKVS, WritableSequenceFileScheme, SequenceFileScheme: #1647
Improve Vertica support in scalding-db: #1655
Add andThen to Mappable: #1656
Expand libjars globs in ScaldingShell to match the behavior of Tool: #1651
Use Batched in Sketch production: #1648
Pick up Algebird 0.13.0: #1640
Added API for Execution/Config to work with DistributedCache: #1635
Bump chill version to 0.8.3: #1634
Fixes a bug in how we use this stack: #1632
Upgrade build to sbt 0.13.13: #1629
Generate Scalding microsite via sbt-microsites: #1623
FileSource support for empty directories: #1622, #1618, #1613, #1611, #1591
Clean up temporary files created by forceToDiskExecution: #1621
Moving the repl in wonderland to a dedicated md file: #1614
Update Scala and sbt version: #1610
REFACTOR: Fixed some compilation warnings: #1604
REFACTOR: Rename parameter to reflect expectation: #1601
Add partitioned sources for Parquet thrift / scrooge: #1590
Add a test for sortBy: #1594
Create COMMITTERS.md: #1589
Use ExecutionContext in Execution.from/fromTry: #1587
Support custom parquet field name strategies: #1580
Deprecate reflection-based JobTest apply method: #1578
Use Caching for FlowDefExecution: #1581
[parquet tuple macros] listType was deprecated in favor of listOfElements: #1579
Use Batched to speed up CMS summing on mappers: #1575
Remove a TypedPipeFactory wrapper which seems unneeded: #1576
Make Writeable sources Mappable to get toIterator: #1573
case class implicit children: #1569

Assets 2

04 May 01:21

piyushnarang

v0.16.0

f1286e5

Scalding 0.16.0 Released!

28 Contributors to this release:

@Gabriel439, @JiJiTang, @MansurAshraf, @QuantumBear, @afsalthaj, @benpence, @danosipov, @epishkin, @gerashegalov, @ianoc, @isnotinvain, @jnievelt, @johnynek, @joshualande, @megaserg, @nevillelyh, @oeddyo, @piyushnarang, @reconditesea, @richwhitjr, @rubanm, @sid-kap, @sriramkrishnan, @stuhood, @tdyas, @tglstory, @vikasgorur, @zaneli

Release Notes

This release is a performance and correctness improvement release. The biggest improvements are to the Execution API and to OrderedSerialization.

Execution allows a reliable way to compose jobs and use scalding as a library, rather than running subclasses of Job in a framework style. In this release we have improved the performance and added some methods for more control of Executions (such as .withNewCache for cases where caching in the whole flow is not desired).

OrderedSerialization is a way to easily leverage binary comparators, comparators that act directly on serialized data so they don’t need to allocate nearly as much when the data is partitioned by key. These were discussed in presentation at the Hadoop summit [slides]. These are generated by macros so most simple types (case classes, scala collections, primitives, and recursion of these) are easy to use with a single import (see this note).

Here’s a list of some of the features we’ve added to Scalding in this release.

New Features

- OrderedSerialization (fast binary comparators for grouping and joining + macros to create them) are production ready. To use them, and other macros, [see this note](https://github.com/twitter/scalding/wiki/Automatic-Orderings,-Monoids-and-Arbitraries). Updates related to OrderedSerialization - #1307, #1316, #1320, #1321, #1329, #1338, #1457 - Add TypedParquet macros for parquet read / write support (Note: this might not be ready for production use as it doesn’t support schema evolution) - #1303 - Naming of Executions is supported - #1334 - Add line numbers at .group and .toPipe boundaries - #1335 - Make some repl components extensible to allow setting custom implicits and config to load at boot time - #1342 - Implement flatMapValues method - #1348 - Add NullSink, can be used with .onComplete to drive a side-effecting (but idempotent) job - #1378 - Add monoid and semigroup for Execution - #1379 - Support nesting Options in TypeDescriptor - #1387 - Add .groupWith method to TypedPipe - #1406 - Add counter verification logic - #1409 - Scalding viz options - #1426 - Add TypedPipeChecker to assist in testing a TypedPipe - #1478 - Add withConfig api to allow running an execution with a transformed config to override hadoop or source level options in subsections - #1489 - Add a liftToTry function to Execution - #1499 - Utility methods for running Executions in parallel - #1507 - Add's support for OrderedSerialization on sealed abstract classes - #1518 - Support for more formats to work with RichDate - #1522

Important Bug Fixes

- Add InvalidSourceTap to catch all cases for no good path - #1458 - SuccessFileSource: correctness for multi-dir globs - #1470 - A serialization error we were seeing in repl usage : #1376 - Fix lack of Externalizer in joins. : #1421 - Requires a DateRange's "end" to be after its "start" : #1425 - Fixes map-only jobs to accommodate both an lzo source and sink binary converter : #1431 - Fix bug with sketch joins and single keys : #1451 - Fix FileSystem.get issue : #1487 - Fix scrooge + OrderedSerialization for field names starting with `_`: #1534 - Add before() and after() to RichDate : #1538

Performance Improvements

- Change defaults for Scalding reducer estimator - #1333 - Add runtime-based reducer estimators - #1358 - When using WriteExecution and forceToDisk we can share the same flowdef closer in construction - #1414 - Cache the zipped up write executions - #1415 - Cache counters for stat updates rather than doing a lookup for every increment - #1495 - Cache boxed classes - #1501 - Typed Mapside Reduce - #1508 - Add auto forceToDisk support to hashJoin in TypedPipe - #1529 - Fix performance bug in TypedPipeDiff : #1300 - Make sure Execution.zip fails fast : #1412 - Fix Rounding Bug in RatioBasedEstimator : #1542

Full change list is here

Assets 2

06 Apr 00:33

piyushnarang

v0.16.0-RC6

a5f183f

v0.16.0-RC6

This is the candidate that we are considering for the 0.16.0 release. We will be testing this RC out internally at Twitter and if it looks good and other folks are on board, this can be promoted to 0.16.0

Assets 2

21 May 23:48

egonina

0.15.0

afbde5a

LzoGenericScheme/Source, Typed Parquet Tuple and Better Performance with a new Elephant-Bird API

Typed Parquet Tuple #1198
LzoGenericScheme and Source #1268
Move OrderedSerialization into zero-dep scalding-serialization module #1289
bump elephantbird to 4.8 #1292
Fix OrderedSerialization for some forked graphs #1293
Add serialization modules to aggregate list #1298

OrderedSerialization is work-in-progress and is not ready to be used.

Assets 2

18 May 21:06

jnievelt

0.14.0

ca905a2

Scalding 0.14.0 Released!

ExecutionApp tutorial

A new tutorial for ExecutionApp is added in #1196. You can check out ExecutionTutorial.scala for the source.

Simple HDFS local mode REPL

#1244 adds an easy to use useHdfsLocalMode method to the REPL for running hadoop locally. useHdfsMode reverts the behavior.

TypedPipe conditional execution via #make

TypedPipe now exposes the make method for fallback computation/production of an optional store in an Execution. If the store already exists, the computation is skipped. Otherwise, the computation is performed and the store is created before proceeding with execution.

TypedPipeDiff

#1266 adds TypedPipeDiff and helper enrichments for comparing the contents of two pipes.

RichPipe#skewJoinWithSmaller now works

A data bug with the fields API method skewJoinWithSmaller was discovered and fixed. The API should be functionally equivalent to joinWithSmaller now.

See CHANGES.md for the full list of changes.

Assets 2

11 Feb 01:39

isnotinvain

0.13.1

83dba38

Scalding 0.13.1, the most convenient scalding we’ve ever released!

Scala 2.11 Support is here!

We’re now publishing scalding for scala 2.11! Get it while it’s hot!

Easier aggregation via the latest Algebird

Algebird now comes with some very powerful aggregators that make it easy to compose aggregations and apply them in a single pass.

For example, to find each customer's order with the max quantity, as well as the order with the min price, in a single pass:

val maxOp = maxBy[Order, Long](_.orderQuantity).andThenPresent(_.orderQuantity)
val minOp = minBy[Order, Long](_.orderPrice).andThenPresent(_.orderPrice)
TypedPipe.from(orders)
      .groupBy(_.customerName)
      .aggregate(maxOp.join(minOp))

For more examples and documentation see: Aggregation using Algebird Aggregators
And for a hands on walkthrough in the REPL, see Alice In Aggregator Land

Read-Eval-Print-Love

We’ve made some improvements that make day to day use of the REPL more convenient:

Easily switch between local and hdfs mode

#1113 Makes it easy to switch between local and hdfs mode in the REPL, without losing your session.
So you can iterate locally on some small data, and once that’s working, run a hadoop job on your real data, all from within the same REPL session. You can also sample some data down to fit into memory, then switch to local mode where you can really quickly get the answers you’re looking for.

For example:

$ ./sbt assembly
$ ./scripts/scald.rb --repl --hdfs --host <host to ssh to and launch jobs from>

scalding> useLocalMode()
scalding> def helper(x: Int) = (x * x) / 2
helper: (x: Int)Int
scalding> val dummyData = TypedPipe.from(Seq(10, 11, 12))
scalding> dummyData.map(helper).dump
50
60
72
scalding> useHdfsMode()
scalding> val realData = TypedPipe.from(MySource(“/logs/some/real/data”)
scalding> realData.map(helper).dump

Easily save TypedPipes of case classes to disk

#1129 Lets you save any TypedPipe to disk from the REPL, regardless of format, so you can load it back up again later from another session. This is useful for saving an intermediate TypedPipe[MyCaseClass] without figuring out how to map it to a TSV or some other format. This works by serializing the objects to json behind the scenes.
For example:

$ ./scripts/scald.rb --json --repl --local

scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson

scalding> case class Bio(text: String, language: String)
defined class Bio

scalding> case class User(id: Long, bio: Bio)
defined class User

// in a real use case, getUsers might load a few sources, do some projections + joins, and then return
// a TypedPipe[User]
scalding> def getUsers() = TypedPipe.from(Seq( User(7, Bio("hello", "en")), User(8, Bio("hola", "es")) ))
getUsers: ()com.twitter.scalding.typed.TypedPipe[User]

scalding> getUsers().filter(_.bio.language == "en").save(TypedJson("/tmp/en-users"))
res0: com.twitter.scalding.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@7cccf31c

scalding> exit

$ cat /tmp/en-users 
{"id":7,"bio":{"text":"hello","language":"en"}}

$ ./scripts/scald.rb --json --repl --local

scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson

scalding> case class Bio(text: String, language: String)
defined class Bio

scalding> case class User(id: Long, bio: Bio)
defined class User

scalding> val filteredUsers = TypedPipe.from(TypedJson[User]("/tmp/en-users"))
filteredUsers: com.twitter.scalding.typed.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@44bb1922

scalding> filteredUsers.dump
User(7,Bio(hello,en))

ValuePipe.dump

#1157 Adds dump to ValuePipe, so now you can not only print the contents of TypedPipes but on ValuePipes as well (see above for examples of using dump in the REPL).

Execution Improvements

The scaladoc for Execution is complete, but some additional exposition was added to the wiki: Calling Scalding from inside your application. We added two helper methods to object Execution: Execution.failed creates an Execution from a Throwable (like Future.failed), and Execution.unit which creates a successful Execution[Unit], which is handy in some branching loops.

Bugfixes

The final bugs were finally removed from scalding*. Including #1190, a bug that effected the hashCode for Args instances and issue #1184 that made Stats unreliable for some users.
*some humor is used in scalding notes.

See CHANGES.md for a full change log.

Thanks to @avibryant, @danielhfrank, @DanielleSucher, @miguno, and the rest of the algebird contributors for the new aggregations, as well as all the scalding contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

28 Contributors to this release:

Release Notes

New Features

Important Bug Fixes

Performance Improvements

ExecutionApp tutorial

Simple HDFS local mode REPL

TypedPipe conditional execution via #make

TypedPipeDiff

RichPipe#skewJoinWithSmaller now works

Scala 2.11 Support is here!

Easier aggregation via the latest Algebird

Read-Eval-Print-Love

Easily switch between local and hdfs mode

Easily save TypedPipes of case classes to disk

ValuePipe.dump

Execution Improvements

Bugfixes

Releases: twitter/scalding

Configure to require SUCCESS

Workaround null counters

Scalding 0.17.2

Scalding 0.17.1 with 2.11 and 2.12 support

Scalding 0.17.0 with 2.12 support!

Scalding 0.16.0 Released!

28 Contributors to this release:

Release Notes

New Features

Important Bug Fixes

Performance Improvements

v0.16.0-RC6

LzoGenericScheme/Source, Typed Parquet Tuple and Better Performance with a new Elephant-Bird API

Scalding 0.14.0 Released!

ExecutionApp tutorial

Simple HDFS local mode REPL

TypedPipe conditional execution via #make

TypedPipeDiff

RichPipe#skewJoinWithSmaller now works

Scalding 0.13.1, the most convenient scalding we’ve ever released!

Scala 2.11 Support is here!

Easier aggregation via the latest Algebird

Read-Eval-Print-Love

Easily switch between local and hdfs mode

Easily save TypedPipes of case classes to disk

ValuePipe.dump

Execution Improvements

Bugfixes