Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Trying cascading3 #1446

Closed
wants to merge 60 commits into from

Conversation

cchepelov
Copy link
Contributor

A quick-n-dirty attempt at using Cascading3 in scalding, at the source level.

At this point, this will choke quickly based on a few API that either disappeared or moved within Cascading; specifically, FlowStep#getSources and FlowStep#getGraph.

Really only a quick RFC at this point, will need to investigate how to best to support the "missing" APIs
(suspecting this is really a matter of adapting to the new API made just to help building ExecutionContext.getDesc in a nicer way)

@@ -101,15 +103,21 @@ trait Config extends Serializable {
def setMapSpillThreshold(count: Int): Config =
this + (SpillableProps.MAP_THRESHOLD -> count.toString)

@deprecated("deprecated in Cascading 2.7 and dropped in Cascading 3.0, use setMapSideAggregationCapacity", "cascading 2.7")
def setMapSideAggregationThreshold(count: Int): Config = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really correct for scalding, the comment mentions the method is realted to the associated operations which wires down to https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Operations.scala#L126 which uses the same key as cascading. Not sure we shouldn't just be keeping the method/key as it was just defining the constant in scalding instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @ianoc said. I'd rather us take the max value of the two keys in Operations.scala, and here set the non-deprecated internally, but not deprecate our API (instead explain what it means to the user, not necessarily in cascading terms).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call the capacity variant?

@cchepelov
Copy link
Contributor Author

Thanks for the prompt feedback, @ianoc and @johnynek ! Will address the various points now, then proceed with the rest

@cchepelov
Copy link
Contributor Author

… Right now, busy updating quite a few dependencies to Cascading3 (cascading-parquet, cascading-avro, elephant-bird-cascading3, etc.
To be continued!

@cchepelov
Copy link
Contributor Author

@ianoc @johnynek Done for now. Trouble is, this requires that code from a few dependencies' pull requests be available at compile time, and I'm not quite sure how best to convince Travis to do this:

For that reason, the code currently fails to run on Travis.

@cchepelov
Copy link
Contributor Author

Now working off binaries dumped onto conjars.org for patched dependencies.

There is an outstanding problem at test stage: some Test Jobs seem to deadlock, apparently within the Cascading test driver. Investigating this week (hopefully). Already tried to run the tests single core, no behaviour change.

@cchepelov
Copy link
Contributor Author

synced with 0.15.1-RC1+ at 55f03f7

@cchepelov
Copy link
Contributor Author

The deadlock in SkewJoinPipeTest should hopefully be fixed by cwensel/cascading#47 (unless that one breaks more things elsewhere ☺).

@cchepelov
Copy link
Contributor Author

Update: some tests fixed with respect to .WithDescription("hi there")
(9896ed9 will not build on Travis for now, awaiting for successful integration of cwensel/cascading#47 )

cc @simondumas


val printDependencyClasspath = taskKey[Unit]("Prints location of the dependencies")

val sharedSettings = Project.defaultSettings ++ assemblySettings ++ scalariformSettings ++ Seq(
organization := "com.twitter",

scalaVersion := "2.10.6",
scalaVersion := "2.10.5",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reasons for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, oversight. reverting.

@cchepelov cchepelov force-pushed the try_cascading3 branch 7 times, most recently from fc006e0 to d33b76d Compare January 21, 2016 20:20
Cyrille Chépélov (TP12) and others added 25 commits January 26, 2016 14:55
dfs-datastores' semantics have shifted since 1.3.4, and now it is of the consumer's responsibility to create
the empty version directory which will contain the version's data.
In this whitebox test, nobody was creating the empty directory anymore, so the test failed.

Added a "hint" and an explanation (not sure this tests much now, but keeps the original dance mostly in)
	+ cancel accidental downgrade of scala 2.10
@johnynek
Copy link
Collaborator

I think #1586 may actually be getting close to merge... I'm closing this in favor of that.

If that is not right. Reopen. Thanks a ton @cchepelov for kicking this all off. I'm sorry it has taken so long.

@johnynek johnynek closed this Jan 25, 2018
@cchepelov
Copy link
Contributor Author

cchepelov commented Jan 25, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants