[WIP] PullByteBufferOut as a default ordering #1728

ianoc-stripe · 2017-09-26T23:50:59Z

First extraction.

Right now i don't think we need the macros and can probably drop them again.

Naming/code organization and how this should all get imported open to opinions

We could have the defaults show up in the package object ordered serialization itself?

johnynek · 2017-09-27T01:38:10Z

scalding-serialization/src/main/scala/com/twitter/scalding/serialization/Exported.scala

@@ -0,0 +1,18 @@
+/*
+Copyright 2015 Twitter, Inc.


can we skip the bogus copyright?

year is wrong

stripe is maybe paying you to write this code.

Yeah i was going to ask you if you thought we need to put the licence and/or copyright in these at all

johnynek · 2017-09-27T01:39:50Z

scalding-serialization/src/main/scala/com/twitter/scalding/serialization/Exported.scala

+*/
+package com.twitter.scalding.serialization
+
+case class Exported[T](instance: T) extends AnyVal


I don't understand this pattern. Can you document it or link to an explanation? I know @travisbrown likes it.

The general gist is just that by supplying instances of Exported you can import an object/package supplying them and inject them at a lower priority than a normal import.

So here if you import the default object it would just supply low priority implicits so won't override a user supplied one.

https://github.com/milessabin/export-hook

(i can add a link in the code too)

johnynek · 2017-09-27T01:42:01Z

...ng-serialization/src/main/scala/com/twitter/scalding/serialization/macros/ExportMacros.scala

+import com.twitter.scalding.serialization.{OrderedSerialization, DefaultOrderedSerialization}
+import scala.reflect.macros.whitebox
+
+class ExportMacros(val c: whitebox.Context) {


are we actually calling this anywhere?

Nope, i think its killable/not sure if adds something. Was discussing that with @travisbrown towards EOD here.

Right now the only exported instance has a concrete type, but once you have generically derived ones the macro here will be necessary (although it turns out it doesn't need to be whitebox, I believe). I've got some notes on this from when I was prepping this talk—I'll try to find them.

travisbrown · 2017-09-27T15:09:26Z

scalding-serialization/src/main/scala/com/twitter/scalding/serialization/ComplexHelper.scala

+      if (staticSize.isEmpty)
+        in.readPosVarInt
+
+      _root_.scala.util.Success(unsafeRead(in))


Is the _root_ just an artifact of a previous macro context?

Yep, i've cleaned these up i think so it should be more sane now

travisbrown · 2017-09-27T15:16:33Z

...ng-serialization/src/main/scala/com/twitter/scalding/serialization/macros/ExportMacros.scala

+import com.twitter.scalding.serialization.{OrderedSerialization, DefaultOrderedSerialization}
+import scala.reflect.macros.whitebox
+
+class ExportMacros(val c: whitebox.Context) {


Right now the only exported instance has a concrete type, but once you have generically derived ones the macro here will be necessary (although it turns out it doesn't need to be whitebox, I believe). I've got some notes on this from when I was prepping this talk—I'll try to find them.

travisbrown · 2017-09-27T15:17:51Z

...ation/src/test/scala/com/twitter/scalding/serialization/macros/MacroOrderingProperties.scala

@@ -510,7 +512,7 @@ class MacroOrderingProperties
  }

  test("Test out ByteBuffer") {
-    BinaryOrdering.ordSer[ByteBuffer]
+    implicitly[OrderedSerialization[ByteBuffer]]


It'd be nice to have a test confirming that an explicit instance doesn't get overridden by the DefaultOrderedSerialization import.

Done, did a test with a companion object defined ordered serialization not being overridden

ianoc · 2017-09-27T16:08:33Z

Ok I think this is looking a bit more cleaned up and may be worth looking at again design wise. I'd be slow to merge this as-is if we think develop should always be ready to released since i think for a release we would want to get everything from the macro out and under provided split up. It would be binary breaking since today a single method can be used to do everything. (Though we could possibly inject a different/new whitebox macro to smooth over most of this... but i'm not sure its worth that vs just moving onto a nicer split up world).

There is also a pre-release question here if the old school macros should be under the same package/import as say our ByteBuffer implementations. This would make it harder to shop around and use the ByteBuffer implementation, but then also use Shapeless.

johnynek · 2017-09-27T17:39:05Z

scalding-serialization/src/main/scala/com/twitter/scalding/serialization/ComplexHelper.scala

+  private[this] def noLengthWrite(element: T, outerOutputStream: OutputStream): Unit = {
+    // Start with pretty big buffers because reallocation will be expensive
+    val baos = new ByteArrayOutputStream(512)
+    unsafeWrite(baos, element)


this implies that unsafeWrite means no size. Can we add that to the comments below?

It implies no outer size, it can be a bit wasteful. I'm adding a comment to the definition of this method that I hope will help here a little:

// This will write out the interior data as a blob with no prepended length // This means binary compare cannot skip on this data. // However the contract remains that one should be able to _read_ the data // back out again. def unsafeWrite(out: java.io.OutputStream, t: T): Unit

johnynek · 2017-09-27T17:40:55Z

...serialization/src/main/scala/com/twitter/scalding/serialization/HasUnsafeCompareBinary.scala

+
+trait HasUnsafeCompareBinary[T] extends OrderedSerialization[T] {
+  def unsafeCompareBinary(inputStreamA: InputStream, inputStreamB: InputStream): Int
+  def unsafeWrite(out: java.io.OutputStream, t: T): Unit


can we drop java.io? Also, I think this has to be the unsized output if there is ever a size header added, right? It seems confusing above since it is used that way, but it is not clear.

It may or may not be sized? Can you add some laws about how to reason about these things?

johnynek · 2017-09-27T17:41:12Z

...serialization/src/main/scala/com/twitter/scalding/serialization/HasUnsafeCompareBinary.scala

+  def unsafeCompareBinary(inputStreamA: InputStream, inputStreamB: InputStream): Int
+  def unsafeWrite(out: java.io.OutputStream, t: T): Unit
+  def unsafeRead(in: java.io.InputStream): T
+  def unsafeSize(t: T): Option[Int]


what is the contract here? Again similar concerns as above.

johnynek · 2017-09-27T17:43:01Z

...serialization/src/main/scala/com/twitter/scalding/serialization/HasUnsafeCompareBinary.scala

+        // Members declared in com.twitter.scalding.serialization.Serialization
+        def read(in: java.io.InputStream): scala.util.Try[T] = o.read(in)
+        def staticSize: Option[Int] = o.staticSize
+        def unsafeWrite(out: java.io.OutputStream, t: T): Unit = o.write(out, t).get


here, unsafeWrite could have a size if the original did, but then I guess you could add two sizes, couldn't you (since above we might call noLengthWrite)?

johnynek · 2017-09-27T17:46:06Z

.../main/scala/com/twitter/scalding/serialization/provided/OrderedSerializationByteBuffer.scala

+  }
+
+  def unsafeRead(inputStream: java.io.InputStream): ByteBuffer = {
+    val lenA = inputStream.readPosVarInt


do we write an additional length header on this thing currently?

Yep we always write a length header for ByteBuffers in the existing master
https://github.com/twitter/scalding/blob/develop/scalding-serialization/src/main/scala/com/twitter/scalding/serialization/macros/impl/ordered_serialization/providers/ByteBufferOrderedBuf.scala#L82

johnynek

What about this: can you update the PR to include 3 hand-written combinators:

Tuple2OrderedSerialization EitherOrderedSerialization and ListOrderedSerialization. I think if we can do those three (static-sized product, static-sized sum/union, dynamic-sized product) I think we will see what methods we want to have to enable.

You can tell something is wrong with OrderedSerialization because in our current tuple2 we have no way to avoid deserializing the second part.

I think if we exercise your code in the same PR that does those three we will be able to see more clearly if we have the API improved or not, or if we are still missing something.

johnynek · 2017-10-06T02:12:26Z

scalding-serialization/src/main/scala/com/twitter/scalding/serialization/Serialization.scala

@@ -53,6 +54,11 @@ trait Serialization[T] extends Equiv[T] with Hashing[T] with Serializable {
   * otherwise the caller should just serialize into an ByteArrayOutputStream
   */
  def dynamicSize(t: T): Option[Int]
+
+  // Override this to provide more efficient
+  def skip(in: InputStream): Try[Unit] = {


I think we may want def skip(count: Int, in: InputStream): Try[Unit] so in say List[T] we can skip the rest of the collection. If count <= 0 do nothing, and otherwise in the worst case just read and throw them away as you do below.

johnynek · 2017-10-06T02:17:19Z

...g-serialization/src/main/scala/com/twitter/scalding/serialization/OrderedSerialization.scala

+   * This compares two InputStreams. After this call, the position in
+   * the InputStreams may or may not be at the end of the record.
+   */
+  def compareBinaryNoConsume(a: InputStream, b: InputStream): OrderedSerialization.Result = {


I get really worried about how to compose methods like this that lack a strong contract. Also, I don't see that we ever call this.

johnynek · 2017-10-06T02:17:58Z

...ization/src/main/scala/com/twitter/scalding/serialization/provided/UnsafeCompareBinary.scala

+        Failure(e)
+    }
+
+  override def compareBinaryNoConsume(inputStreamA: InputStream, inputStreamB: InputStream): OrderedSerialization.Result =


should this be final?

johnynek · 2017-10-06T02:18:04Z

...ization/src/main/scala/com/twitter/scalding/serialization/provided/UnsafeCompareBinary.scala

+        OrderedSerialization.CompareFailure(e)
+    }
+
+  override def compareBinary(inputStreamA: InputStream, inputStreamB: InputStream): OrderedSerialization.Result =


can this be final?

CLAassistant · 2019-11-16T00:29:38Z

All committers have signed the CLA.

PullByteBufferOut as a default ordering

8c3948b

johnynek reviewed Sep 27, 2017

View reviewed changes

travisbrown reviewed Sep 27, 2017

View reviewed changes

ianoc added 3 commits September 27, 2017 08:59

Move new code into a more useful package space. Add a priority test

a12f49b

Remove copyright

e03faa4

Might aswell use a val to avoid the allocation during setup

7624dac

johnynek reviewed Sep 27, 2017

View reviewed changes

Alternate approach

3bc7afb

johnynek requested changes Oct 6, 2017

View reviewed changes

johnynek added this to reviewable in Improve the Serialization macros Oct 7, 2017

Add either ordered buf

fa386af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] PullByteBufferOut as a default ordering #1728

[WIP] PullByteBufferOut as a default ordering #1728

ianoc-stripe commented Sep 26, 2017

johnynek Sep 27, 2017

ianoc Sep 27, 2017

johnynek Sep 27, 2017

ianoc Sep 27, 2017

johnynek Sep 27, 2017

ianoc Sep 27, 2017

travisbrown Sep 27, 2017

travisbrown Sep 27, 2017

ianoc-stripe Sep 27, 2017

travisbrown Sep 27, 2017

travisbrown Sep 27, 2017

ianoc-stripe Sep 27, 2017

ianoc commented Sep 27, 2017

johnynek Sep 27, 2017

ianoc-stripe Sep 27, 2017

johnynek Sep 27, 2017

johnynek Sep 27, 2017

johnynek Sep 27, 2017

johnynek Sep 27, 2017

ianoc-stripe Sep 27, 2017

johnynek left a comment

johnynek Oct 6, 2017

johnynek Oct 6, 2017

johnynek Oct 6, 2017

johnynek Oct 6, 2017

CLAassistant commented Nov 16, 2019 •

edited

[WIP] PullByteBufferOut as a default ordering #1728

Are you sure you want to change the base?

[WIP] PullByteBufferOut as a default ordering #1728

Conversation

ianoc-stripe commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianoc commented Sep 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Nov 16, 2019 • edited

CLAassistant commented Nov 16, 2019 •

edited