Support `sortedTake` in beam runner #1949

nownikhil · 2021-09-21T22:34:43Z

We flatten all the input PriorityQueues and construct a new PQ using the monoid provided.
TESTS: Updated failing unit test

We flatten all the input PriorityQueues and construct a new PQ using the monoid provided. TESTS: Updated failing unit test

johnynek · 2021-09-22T00:31:57Z

scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala

@@ -59,6 +61,25 @@ object BeamOp extends Serializable {
  )(implicit ordK: Ordering[K], kryoCoder: KryoCoder): PCollection[KV[K, java.lang.Iterable[U]]] = {
    reduceFn match {
      case ComposedMapGroup(f, g) => planMapGroup(planMapGroup(pcoll, f), g)
+      case EmptyGuard(MapValueStream(SumAll(pqm: PriorityQueueMonoid[V]))) =>


I don't think this should either compile correctly. I probably wrote some bad code in my example. The SumAll type is TraversableOnce[A] => Iterator[A] where in this case A = V. So when you pattern match on PriorityQueueMonoid[T] that means V = PriorityQueue[T], so, it shouldn't be PriorityQueueMonoid[V]. A little known fact is that you can use a lower case to bind an unknown type: pqm: PriorityQueueMonoid[v] and then use v lower down.

johnynek · 2021-09-22T00:33:48Z

scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala

+              // We are not using plus method defined in PriorityQueueMonoid as it is mutating
+              // input Priority Queues. We create a new PQ from the individual ones.
+              // We didn't use Top PTransformation in beam as it is not needed, also
+              // we cannot access `max` defined in PQ monoid.


why not extract the comparator: val cmp = pqm.zero.comparator() and then use Top.of as @tlazaro suggested here:

#1947 (comment)

Top requires max no of elements in its constructor. But we cannot access it from the monoid.

ahh... man that PriorityQueueMonoid is really bad. :(

ahh, but there is a way out!

https://github.com/twitter/algebird/blob/0c45f9395020ceecd28899fe156b7e3c92b29814/algebird-core/src/main/scala/com/twitter/algebird/mutable/PriorityQueueMonoid.scala#L27

you could subclass it! class ScaldingPriorityQueueMonoid[K](val count: Int)(implicit val ordering: Ordering[K]) extends PriorityQueueMonoid(count)(ordering) and then you can access the count and the ordering.

I made this quick PR: twitter/algebird#1008

johnynek · 2021-09-22T00:36:13Z

scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala

+              // input Priority Queues. We create a new PQ from the individual ones.
+              // We didn't use Top PTransformation in beam as it is not needed, also
+              // we cannot access `max` defined in PQ monoid.
+              val flattenedValues = input.getValue.asScala.flatMap { value =>


I think this is going to materialize everything in memory. I think you should do input.getValue.asScala.toStream.flatMap... to make sure this is a lazy Iterable.

input.getValue.asScala gives an Iterator only and flatmap also works lazily. Which call would materialize it?

getValue returns the Iterable[_] no? so, asScala on that is going to give you an Iterable if I'm not mistaken. You could call .iterator on that before you call .asScala and that will certainly be lazy, which is better than the toStream since we would rather not materialize the whole set.

actually... build requires an Iterable and since flattenedValues is a val it will pin things in memory.

To take this suggestion, I think you need @inline def flattenedValues = input.getValue.asScala.toStream.flatMap...

I think I understand all the pieces but would really appreciate if you could clarify how to make it properly lazy.

In the first part are you referring to Iterable keeping references to all elements, as opposed to Iterator which wouldn't?

Then when you mention using inline, is it to avoid a closure over that variable keeping the reference alive for longer or landing in more expensive generation? Or more about escape analysis or similar?

In scala, most Iterable methods (like flatMap) are not lazy and will materialize the entire result in many cases (basically the only exception is Stream or LazyList in scala 2.13).

An Iterator generally drops references to the items it has iterated past, although you could imagine making one that doesn't. Consider a List iterator:

class ListIterator[A](lst: List[A]) extends Iterator[A] { private[this] var current: List[A] = lst def hasNext: Boolean = current.nonEmpty def next(): A = { val result = current.head current = current.tail result } }

The way the scala compiler works, lst is only referenced by the constructor, so you won't keep a reference to that beyond the constructor which just uses it to initialize.

Finally, the @inline is just a hint, but the def means there is nothing keeping the reference around so when the GC runs it could colllect anything not pointed to by the Stream at that moment.

BUT, all of this is a bit moot because looking inside PriorityQueueMonoid we see that build traverses the Iterable twice... so if it is a large Stream you will materialize the whole thing...

I think the best idea would be just subclass PriorityQueueMonoid so you can access the max size and the ordering, and use that call Top.of.

johnynek · 2021-09-22T00:41:28Z

scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala

+        pcoll.apply(MapElements.via(
+          new SimpleFunction[KV[K, java.lang.Iterable[V]], KV[K, java.lang.Iterable[U]]]() {
+            override def apply(input: KV[K, lang.Iterable[V]]): KV[K, java.lang.Iterable[U]] = {
+              // We are not using plus method defined in PriorityQueueMonoid as it is mutating


I think you also need to match on:

scalding/scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala

Line 271 in e198811

def mapSideAggregator(

since I guess this restriction is also enforced on the mapside operations...

You'll need a slightly different approach there (basically a custom cache).

But maybe they don't do their mutation detection on the mappers?

Tried with a unit test and it failed because of mutation. Instead of overwriting the put method in the SummingCache I was hoping to pattern match on Semigroup and overwrite the plus method.

case class ImmutablePQMonoid[T]( size: Int )(implicit ord: Ordering[T]) extends PriorityQueueMonoid[T](size) { override def plus(left: PriorityQueue[T], right: PriorityQueue[T]): PriorityQueue[T] = { super.build(left.iterator().asScala.toIterable ++ right.iterator().asScala.toIterable) } }

But the same issue is there. I cannot access the size field.

Or maybe this?

case class ImmutablePQMonoid[T]( pqm: PriorityQueueMonoid[T] )(implicit ord: Ordering[T]) extends Monoid[PriorityQueue[T]] { override def zero: PriorityQueue[T] = pqm.zero override def plus( l: PriorityQueue[T], r: PriorityQueue[T] ): PriorityQueue[T] = ??? }

why not just pattern match on this monoid and not use it at all? Like, as a short fix, you could just make sumByLocalKeys a no-op for this particular monoid in the short term, but for a better fix you could implement a summingcache for this particular monoid.

I think making a full copy of the priority queue each time will be a perf killer and I doubt it will be better than just not using a mapside cache at all...

The more we talk about this, the more it seems like maybe copying the immutable heap in from cats-collections is the right move (or send a PR to algebird and we can publish a new version and use it from there, either way...)

Seems like we keep getting paper cuts here.

johnynek · 2021-09-22T00:46:41Z

scalding-beam/src/test/scala/com/twitter/scalding/beam_backend/BeamBackendTests.scala

-          .groupAll
-          .sortedReverseTake(3),
-        Seq(5, 4, 3)
+  test("sortedTake"){


note that bufferedTake also uses the problematic monoid internally. Worth testing:

scalding/scalding-core/src/main/scala/com/twitter/scalding/typed/Grouped.scala

Line 662 in e198811

implicit val mon: PriorityQueueMonoid[V1] = new PriorityQueueMonoid[V1](n)(fakeOrdering)

johnynek · 2021-09-22T00:47:34Z

Good lesson here: mutation, NOT EVEN ONCE!

Added ScaldingPriorityQueueMonoid which exposes count which we later use in TopCombineFn. Added unit test for bufferedTake Disabled map side aggregation when using ScaldingPriorityQueueMonoid

scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala

johnynek · 2021-09-24T00:10:18Z

scalding-beam/src/test/scala/com/twitter/scalding/beam_backend/BeamBackendTests.scala

+      TypedPipe
+        .from(1 to 50)
+        .groupAll
+        .bufferedTake(100)


just a note: this is going to be really bad in a real job without map-side aggregation: the key is Unit so there is only one key, so this would have each mapper send 100, then have the reducers pick 100 of those.

But with no mapside aggregation, all the data will be sent to the reducers, and they will throw away all but 100.

But we can add an issue and come back and address this.

Opened a ticket for this.
#1952

scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala

tlazaro · 2021-09-26T20:26:46Z

I'm fine with landing this, how do we usually do it? 'Squash and merge' ?

johnynek · 2021-09-26T20:30:35Z

yes, squash and merge if you think it is ready. Looks good to me too.

Support sortedTake in beam runner

7106653

We flatten all the input PriorityQueues and construct a new PQ using the monoid provided. TESTS: Updated failing unit test

tlazaro mentioned this pull request Sep 22, 2021

Solving sortedTake in beam runner #1947

Closed

johnynek reviewed Sep 22, 2021

View reviewed changes

Added extension of PriorityQueueMonoid

8683119

Added ScaldingPriorityQueueMonoid which exposes count which we later use in TopCombineFn. Added unit test for bufferedTake Disabled map side aggregation when using ScaldingPriorityQueueMonoid

johnynek reviewed Sep 24, 2021

View reviewed changes

scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamOp.scala Outdated Show resolved Hide resolved

Use externalizer in SerializableComparator

49af384

nownikhil force-pushed the ngoyal/sorted-take branch from 6279624 to 49af384 Compare September 24, 2021 18:45

johnynek approved these changes Sep 24, 2021

View reviewed changes

tlazaro approved these changes Sep 26, 2021

View reviewed changes

tlazaro merged commit c596db9 into twitter:develop Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `sortedTake` in beam runner #1949

Support `sortedTake` in beam runner #1949

nownikhil commented Sep 21, 2021

johnynek Sep 22, 2021 •

edited

johnynek Sep 22, 2021

nownikhil Sep 22, 2021

johnynek Sep 22, 2021

johnynek Sep 22, 2021

johnynek Sep 22, 2021

nownikhil Sep 22, 2021

johnynek Sep 22, 2021

johnynek Sep 22, 2021

tlazaro Sep 22, 2021

johnynek Sep 23, 2021

johnynek Sep 22, 2021 •

edited

nownikhil Sep 22, 2021 •

edited

nownikhil Sep 22, 2021

johnynek Sep 23, 2021

johnynek Sep 22, 2021

johnynek commented Sep 22, 2021

johnynek Sep 24, 2021

nownikhil Sep 24, 2021

tlazaro commented Sep 26, 2021

johnynek commented Sep 26, 2021

Support sortedTake in beam runner #1949

Support sortedTake in beam runner #1949

Conversation

nownikhil commented Sep 21, 2021

johnynek Sep 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek Sep 22, 2021 • edited

Choose a reason for hiding this comment

nownikhil Sep 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Sep 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlazaro commented Sep 26, 2021

johnynek commented Sep 26, 2021

Support `sortedTake` in beam runner #1949

Support `sortedTake` in beam runner #1949

johnynek Sep 22, 2021 •

edited

johnynek Sep 22, 2021 •

edited

nownikhil Sep 22, 2021 •

edited