Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize traverse #4498

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Conversation

TimWSpence
Copy link
Member

@TimWSpence TimWSpence commented Aug 23, 2023

This addresses the StackSafeMonad part of #4408. I thought that there were enough changes in this to merit a separate PR for the Defer part.

Open questions (will edit these out of the description later):

  • The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

The following benchmarks were run on an AWS t3.xlarge:

Baseline (b08196e)

Benchmark (length) Mode Cnt Score Error Units
TraverseBench.filterList 10000 thrpt 20 3247.486 ± 270.817 ops/s
TraverseBench.filterVector 10000 thrpt 20 3351.585 ± 255.883 ops/s
TraverseBench.mapList 10000 thrpt 20 3198.905 ± 235.865 ops/s
TraverseBench.mapVector 10000 thrpt 20 3906.339 ± 225.508 ops/s
TraverseBench.traverseChain 10000 thrpt 20 556.706 ± 31.814 ops/s
TraverseBench.traverseChainError 10000 thrpt 20 1278.954 ± 74.242 ops/s
TraverseBench.traverseFilterChain 10000 thrpt 20 536.841 ± 38.696 ops/s
TraverseBench.traverseFilterList 10000 thrpt 20 532.103 ± 30.473 ops/s
TraverseBench.traverseFilterVector 10000 thrpt 20 590.312 ± 29.534 ops/s
TraverseBench.traverseList 10000 thrpt 20 537.023 ± 33.287 ops/s
TraverseBench.traverseListError 10000 thrpt 20 1674.930 ± 48.771 ops/s
TraverseBench.traverseVector 10000 thrpt 20 581.844 ± 32.074 ops/s
TraverseBench.traverseVectorError 10000 thrpt 20 1778.334 ± 135.859 ops/s
TraverseBench.traverse_Chain 10000 thrpt 20 490.822 ± 14.145 ops/s
TraverseBench.traverse_List 10000 thrpt 20 474.568 ± 28.995 ops/s
TraverseBench.traverse_Vector 10000 thrpt 20 622.924 ± 38.002 ops/s

Chain (2d5f4d7)

Benchmark (length) Mode Cnt Score Error Units
TraverseBench.filterList 10000 thrpt 20 3170.974 ± 146.628 ops/s
TraverseBench.filterVector 10000 thrpt 20 3217.881 ± 181.087 ops/s
TraverseBench.mapList 10000 thrpt 20 3443.969 ± 104.117 ops/s
TraverseBench.mapVector 10000 thrpt 20 3827.401 ± 148.074 ops/s
TraverseBench.traverseChain 10000 thrpt 20 784.845 ± 25.990 ops/s
TraverseBench.traverseChainError 10000 thrpt 20 1302.782 ± 62.342 ops/s
TraverseBench.traverseFilterChain 10000 thrpt 20 748.986 ± 62.151 ops/s
TraverseBench.traverseFilterList 10000 thrpt 20 669.817 ± 41.798 ops/s
TraverseBench.traverseFilterVector 10000 thrpt 20 618.644 ± 28.558 ops/s
TraverseBench.traverseList 10000 thrpt 20 583.413 ± 27.079 ops/s
TraverseBench.traverseListError 10000 thrpt 20 1253.510 ± 76.328 ops/s
TraverseBench.traverseVector 10000 thrpt 20 581.975 ± 27.619 ops/s
TraverseBench.traverseVectorError 10000 thrpt 20 1302.362 ± 66.015 ops/s
TraverseBench.traverse_Chain 10000 thrpt 20 3598.033 ± 108.420 ops/s
TraverseBench.traverse_List 10000 thrpt 20 4048.287 ± 158.090 ops/s
TraverseBench.traverse_Vector 10000 thrpt 20 4094.024 ± 101.115 ops/s

Vector (702ab8b)

Benchmark (length) Mode Cnt Score Error Units
TraverseBench.filterList 10000 thrpt 20 3305.299 ± 233.483 ops/s
TraverseBench.filterVector 10000 thrpt 20 3253.531 ± 133.993 ops/s
TraverseBench.mapList 10000 thrpt 20 3308.961 ± 138.697 ops/s
TraverseBench.mapVector 10000 thrpt 20 3911.187 ± 142.582 ops/s
TraverseBench.traverseChain 10000 thrpt 20 594.535 ± 29.777 ops/s
TraverseBench.traverseChainError 10000 thrpt 20 1084.548 ± 75.686 ops/s
TraverseBench.traverseFilterChain 10000 thrpt 20 642.895 ± 55.621 ops/s
TraverseBench.traverseFilterList 10000 thrpt 20 682.965 ± 41.234 ops/s
TraverseBench.traverseFilterVector 10000 thrpt 20 696.235 ± 59.945 ops/s
TraverseBench.traverseList 10000 thrpt 20 582.275 ± 35.877 ops/s
TraverseBench.traverseListError 10000 thrpt 20 1114.959 ± 52.641 ops/s
TraverseBench.traverseVector 10000 thrpt 20 638.235 ± 42.533 ops/s
TraverseBench.traverseVectorError 10000 thrpt 20 1098.038 ± 62.946 ops/s
TraverseBench.traverse_Chain 10000 thrpt 20 3480.086 ± 178.654 ops/s
TraverseBench.traverse_List 10000 thrpt 20 3950.650 ± 171.441 ops/s
TraverseBench.traverse_Vector 10000 thrpt 20 3695.495 ± 182.401 ops/s

@armanbilge
Copy link
Member

armanbilge commented Aug 23, 2023

  • The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

Can we add some new laws in terms of OptionT[Eval, _]? Edit: I don't think this will work unless we make sure that OptionT offers a StackSafeMonad if F is a StackSafeMonad.

Should I open a separate PR for the new benchmarks first?

Nah.

@johnynek
Copy link
Contributor

Can you post the results of the benchmarks? Apologies if I missed them.

val iter = fa.iterator
if (iter.hasNext) {
val first = iter.next()
G.map(iter.foldLeft(f(first)) { case (g, a) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be G.void(.. so that has a chance for an optimized implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch!

G.map(fa.iterator.foldLeft(G.pure(builder)) { case (bldrG, a) =>
G.flatMap(bldrG) { bldr =>
G.map(f(a)) {
case Some(b) => bldr += b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see why we know this is safe or lawful.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

It can't be: every append to the builder happens in flatMap so by law it must be sequential, not concurrent/parallel.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

Hmm. I'm not entirely sure how this applies to traverse. It doesn't have a notion of "recovery". For sure, each individual step that the traverse runs may have a notion of recovery, but that's just it: it will either succeed or it will fail. But there's no way to "recover" an intermediate structure from the traverse itself.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

Won't disagree here. We could do a benchmark to see how much performance we are leaving on the table with that strategy.

@TimWSpence
Copy link
Member Author

Can you post the results of the benchmarks? Apologies if I missed them.

Yes will do. I had originally asked whether I should add the new ones in a separate PR first which was why I didn't run them straight away

@TimWSpence
Copy link
Member Author

@armanbilge @johnynek apologies it's taken me so long to get round to this. I've updated with benchmarks for the mutable builder approach. Are you still worried about lawfulness? Shall I try to do it with immutable data structures?

Those traverse_ improvements though 😆

Is this necessary? The monad laws should ensure that it's safe to use
mutable builders. Nonetheless it will be good to confirm the performance
delta for using immutable data structures
@johnynek
Copy link
Contributor

johnynek commented Nov 8, 2023

Thanks for working on this. What a don't see is how a Monad such as fs2.Stream[IO, A] would interact here. In that. Case you can have laziness, concurrency and more than one output per input.

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

@armanbilge
Copy link
Member

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

private[cats] def traverseDirectly[G[_], A, B](
fa: IterableOnce[A]
)(f: A => G[B])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (accG, a) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do even better here by using a Vector.newBuilder, we can do even even better by using a dedicated builder for the desired collection (e.g. ArraySeq or List) and we can do even even even better by providing a size hint to the builder.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have misunderstood something, but it looks like that's how things were done before the commit 8cc742e was introduced, which added the immutable data structures. 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right :) let's just say this PR has been open a while and I lost track of the old stuff 😇

@johnynek
Copy link
Contributor

johnynek commented Nov 9, 2023

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

This is definitely a colloquial understanding of Monads, that they represent an sequential computation, but I don't see the law that allows us to make the claim you want.

For instance, in the current PR we are using immutable items inside the monadic type constructor. I think the current writing will work. But if we changed those to be mutable, I don't see a proof that it will work, even though it may.

For instance, consider this type:

sealed trait Tree[F[_], A]
case class Item[F[_], A](fa: A) extends Tree[F, A]
case class InsideF[F[_], A](treeF: F[Tree[F, A]]) extends Tree[F, A]
case class Branch[A](left: Tree[F, A], right: Tree[F, A]) extends Tree[F, A]

object Tree {
  def pure[F[_], A](a: A): Tree[F, A] = Item(a)
  def flatMap[F[_]: Functor, A, B](fa: Tree[F, A])(fn: A => Tree[F, B]): Tree[F, B] =
    fa match {
      case Item(a) => fn(a)
      case InsideF(fa) => InsideF(fa.map(flatMap(_)(fn)))
      case Branch(left, right) =>
        Branch(flatMap(left)(fn), flatMap(right)(fn))
    }
}

Now, I haven't checked if this follows the monad laws as we have expressed them, but I think it or something like it could. In this picture note we are recursively calling flatMap inside the F context and outside of it. So, if F is lazy, we return before we have fully recursed.

Couldn't something like this cause a problem?

I think saying "monads are sequential" is a bit too hand-wavey to bolt into the default implementation for every user of cats. I think we need a stronger argument than that.

To see if this is more performant than an immutable vector
if (iter.hasNext) {
val first = iter.next()
G.void(iter.foldLeft(f(first)) { case (g, a) =>
G.flatMap(g) { _ =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not G.productR(g, f(a)) here? Is it to avoid calling f when g may have already failed? I think a comment would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no good reason 😅 Thanks, I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it to avoid calling f when g may have already failed?

Maybe this is something we should consider though?

Edit: ahh, I see your comment on f95b087. Fair enough.

fa: IterableOnce[A]
)(f: A => G[Option[B]])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (bldrG, a) =>
G.flatMap(bldrG) { acc =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on why we didn't write:

G.map2(bldrG, f(a)) {
  case (acc, Some(b)) => acc :+ b
  case (acc, _) => acc
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point 👍

fa: IterableOnce[A]
)(f: A => G[B])(implicit G: StackSafeMonad[G]): G[List[B]] = {
G.map(fa.iterator.foldLeft(G.pure(List.empty[B])) { case (accG, a) =>
G.flatMap(accG) { acc =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment why not G.map2(accG, f(a))(_ :: _)

Since we know G is lazy (a StackSafeMonad has to be I think), I'm not sure the downside here. One answer would be we don't have to call f if we have already failed for a short-circuiting monad, but we are still iterating the whole list, so we are doing O(N) work. Adding the call to allocate the Monad doesn't seem like a big problem, since we have to allocate the function to pass to flatMap in the current case.

By calling map2 we are at least communicating to G what we are doing, and in principle, some monads could optimize this (e.g. a Parser can make a more optimized map2 than flatMap, and it can also be StackSafe since runs lazily only when input is passed to the resulting parser).

@johnynek
Copy link
Contributor

johnynek commented Mar 4, 2024

I would go where the benchmarks suggest if we have confidence in them.

@TimWSpence
Copy link
Member Author

@valencik you foolishly volunteered yourself on Discord to run the benchmarks as well 😆 If you have the chance, I would massively appreciate another set of data to confirm that what I'm seeing is correct

Copy link
Member

@valencik valencik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lingering mutable builder I think we want to remove, and a few more G.pure(())s.

@TimWSpence I can benchmark later today :)

G match {
case x: StackSafeMonad[G] => Traverse.traverse_Directly(fa)(f)(x)
case _ =>
foldRight(fa, Always(G.pure(()))) { (a, acc) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
foldRight(fa, Always(G.pure(()))) { (a, acc) =>
foldRight(fa, Eval.now(G.unit)) { (a, acc) =>

})(_.iterator.toMap)
G match {
case x: StackSafeMonad[G] =>
x.map(fa.foldLeft(G.pure(Map.newBuilder[K, B])) { case (accG, (k, a)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still using the mutable builder, I think we want something like:

case x: StackSafeMonad[G] =>
  fa.iterator.foldLeft(G.pure(Map.empty[K, B])) { case (accG, (k, a)) =>
    x.map2(accG, f(a)) { case (acc, b) => acc + (k -> b) }
  }

G match {
case x: StackSafeMonad[G] => Traverse.traverse_Directly(fa)(f)(x)
case _ =>
foldRight(fa, Always(G.pure(()))) { (a, acc) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
foldRight(fa, Always(G.pure(()))) { (a, acc) =>
foldRight(fa, Eval.now(G.unit)) { (a, acc) =>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 the Map instance. Sorry, the conversation has gotten so long that it's hard to follow. I think we agreed upon only benchmarking Chain, Vector and List so I said that I'd update the Map instance once we've decided what implementation to use for all the other collections.

@valencik
Copy link
Member

valencik commented Mar 6, 2024

Mostly 803107d *
( Sorry, I actually did apply suggested change in #4498 (comment) )

[info] Benchmark                           (length)   Mode  Cnt     Score     Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5683.590 ±  73.700  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5665.475 ±  34.436  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5833.001 ±  45.470  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6206.558 ±  45.317  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1157.471 ±   8.626  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2080.678 ±  15.341  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1201.729 ±   6.121  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20  1143.251 ±   8.666  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1232.990 ±   7.090  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20  1085.565 ±   8.561  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2192.903 ±   9.450  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1162.033 ±  22.101  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  2178.812 ±  22.956  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20  7644.606 ± 151.222  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20  8946.374 ± 124.209  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  9036.810 ±  37.314  ops/s

and against main (d323fac) but with TraverseBench.scala from 803107d:

[info] Benchmark                           (length)   Mode  Cnt     Score     Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5781.907 ±  31.753  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5667.024 ±  53.352  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5864.307 ±  45.807  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6245.999 ±  46.899  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1072.360 ±  25.156  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2716.055 ± 100.545  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1029.492 ±  51.776  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20   928.622 ±   3.589  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1002.081 ±  44.357  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20   948.305 ±  26.980  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2722.420 ±  77.210  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1011.813 ±  19.254  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  3366.467 ±  35.241  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20   876.282 ± 100.385  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20   779.133 ±  14.965  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  1093.288 ±  45.168  ops/s

By the looks of things filterList, filterVector, mapList, and mapVector are all within errror when comparing this PR against latest main.
We have small improvements for traverseChain, traverseFilterChain, traverseFilterList, traverseFilterVector, traverseList, and traverseVector.
Medium regressions for traverseChainError, traverseListError, traverseVectorError.
And then very large improvements for traverse_Chain, traverse_List, and traverse_Vector.

Summarized differently:

🏎️ traverse_
🏃 traverse
🏃 traverseFilter
🆗 filter
🆗 map
🐢 traverseError

@valencik
Copy link
Member

valencik commented Mar 6, 2024

Wanted to confirm things, so I ran again and came up with the same results.

I ran these benchmarks (both previously and here) with the following in sbt:
bench/jmh:run -i 10 -wi 10 -f 2 -t 1 cats.bench.TraverseBench

And with the following jvmopts: -Xms2G -Xmx12G

Details

on exactly 803107d

[info] Benchmark                           (length)   Mode  Cnt     Score    Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5715.626 ±  29.321  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5546.558 ±  19.276  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5767.038 ±  34.619  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6173.065 ±  40.277  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1123.427 ±   4.987  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2127.662 ±  13.144  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1180.463 ±   5.585  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20  1206.215 ±   9.615  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1244.894 ±  25.955  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20  1069.802 ±  13.869  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2167.488 ±  21.981  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1152.973 ±  25.078  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  2207.517 ±  33.643  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20  7912.984 ± 107.098  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20  8934.534 ±  46.916  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  9092.209 ±  36.689  ops/s

and on main:

[info] Benchmark                           (length)   Mode  Cnt     Score    Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5710.935 ± 65.052  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5586.684 ± 53.060  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5794.559 ± 31.269  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6156.259 ± 29.844  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1092.330 ±  7.171  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2713.055 ± 41.359  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20   982.057 ±  6.179  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20   873.508 ± 11.078  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20   986.796 ± 41.753  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20   918.397 ± 39.014  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2781.320 ± 15.372  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20   997.688 ±  4.798  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  3081.201 ± 31.311  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20   939.052 ± 20.897  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20   808.780 ±  8.011  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  1111.320 ±  9.188  ops/s

@TimWSpence
Copy link
Member Author

@valencik absolute legend ❤️ Couple of points:

  1. I'm dumb and not thinking straight 😅 I was thinking the Map implementation would be dependent on either Chain or Vector and so I was waiting for the outcome of that debate. I'll fix it now
  2. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...
  3. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

@TimWSpence
Copy link
Member Author

@valencik sorry to bother you. Just tagging you in case you missed my previous comment

@valencik
Copy link
Member

2. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

3. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

@TimWSpence
Copy link
Member Author

  1. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

  1. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

Yeah I'm more than happy to draw a line under this PR and follow up separately - it's been open far too long 😆 I'll move it out of draft now

@TimWSpence TimWSpence marked this pull request as ready for review March 21, 2024 16:49
@TimWSpence
Copy link
Member Author

TimWSpence commented Mar 22, 2024

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

@valencik
Copy link
Member

valencik commented Mar 22, 2024

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

Just to double check I understand, do you mean the implementation of traverseDirectly?

Yeah, let's switch that to Chain and I can rerun benchmarks.

@valencik
Copy link
Member

Currently running benchmarks again on a tweak which uses Chain in traverseDirectly and traverseFilterDirectly
https://github.com/typelevel/cats/compare/803107de2f728f648800513ef5cff242a1172fbf...valencik:cats:more-chain?expand=1

@valencik
Copy link
Member

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence
TimWSpence#2

Feel free to leave it for a day or two.
There's a ton of numbers here.
I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements.
But again, maybe we should save that for a follow up.

@TimWSpence
Copy link
Member Author

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence TimWSpence#2

Feel free to leave it for a day or two. There's a ton of numbers here. I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements. But again, maybe we should save that for a follow up.

Thanks so much again! Presumably that PR is the same as the Chain version that I've had in place multiple times in the history of this PR? 😆

How is traverseDirectlyVector different sorry?

Use `Chain` in `traverseDirectly` helpers
@valencik
Copy link
Member

Before we get into the weeds, my position remains that we should merge this PR if we are ok with the small regression to traverseError benchmarks.

@djspiewak Pinging you as the creator of these benchmarks, what do you think here? We've improved or held stable on on fronts except traverseError benchmarks.
Which, as a reminder, look like:

  @Benchmark
  def traverseVectorError(bh: Blackhole) = {
    val result = vectorT.traverse(vector) { i =>
      Eval.later {
        Blackhole.consumeCPU(Work)

        if (i == length * 0.3) {
          throw Failure
        }

        i * 2
      }
    }

    try {
      bh.consume(result.value)
    } catch {
      case Failure => ()
    }
  }
Click for a visual summary of benchmark results main vs this pr

Screenshot 2024-03-26 at 08-14-24 JMH Visualizer

My gut says these improvements are worth it and we should merge.


How is traverseDirectlyVector different sorry?

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

The commit: efa7ec0

Click for a visual summary of benchmark results this pr vs traverseDirectlyVector

Screenshot 2024-03-26 at 08-26-35 JMH Visualizer

@TimWSpence
Copy link
Member Author

TimWSpence commented Mar 26, 2024

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

Ah cool, thanks. Yeah I had asked about that before and I think we decided to leave that for potential follow-up

@TimWSpence
Copy link
Member Author

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

@valencik
Copy link
Member

valencik commented Apr 6, 2024

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

We need to make a call on whether or not we're ok with the small regression in the traverseError micro benchmarks.
Otherwise, I think we're good.

val (leftL, rightL) = fa.splitAt(leftSize)
runHalf(leftSize, leftL)
.flatMap { left =>
val right = runHalf(rightSize, rightL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an interesting benchmark would be do do Eval.defer(runHalf(rightSize, rightL)) which would short circuit for applicatives G that can short circuit.

I wonder how much that would hurt the non-short circuiting case... this is not for this PR, just something we may come back to in a later PR.

val rightSize = size - leftSize
runHalf(leftSize, idx)
.flatMap { left =>
val right = runHalf(rightSize, idx + leftSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the possibility to short circuit the right with Eval.defer(runHalf(rightSize, idx + leftSize)) but that would add cost to the cases where we do need to evaluate everything, which is probably the most common case.

@johnynek
Copy link
Contributor

johnynek commented Apr 26, 2024

So my read based on: #4498 (comment)

is that for a benchmark on Vector with a stack safe monad (Eval) that short circuits 30% the way through the Vector (by throwing, which is highly suspect to me, because that is an effect that Eval isn't designed to deal with), we are currently 12% slower.

I think the benchmark isn't one we should care about, but instead maybe OptionT[Eval, ?] would be a short-circuiting StackSafeMonad we could implement (although I don't think OptionT or EitherT preserve StackSafe-ness currently).

As to why this benchmark is currently slower, I think as the benchmark is written, in the new PR we fully interate over the entire Vector building a chain of map2, which in Eval is just implemented the naive way with flatMap. So, we iterate the entire vector for sure. Then we call .value and evaluate the Eval. That process will start running the Eval and throw when it finally gets to the index that throws.

In the old code (the code that builds the tree up), we build a tree by cutting the length (which we get in an O(1) call) in halfs recursively. Importantly, this process does not iterate over the Vector, it is just splitting the index and basically making a binary tree of indexes to call inside an Eval later. Then using an extra layer of Eval (to use map2Eval) we start evaluating the function call left to right. So, in this implementation we only access the first 30% of the indices before we hit the call to f that throws.

So put that way, I think what this means is that actually for Vector, short circuiting via the tree seems actually faster than directly using map2. This is possibly due to the fact that iterating the vector may have to fetch from memory not in cache and we have to do those fetches in the new code, but in the old code, we only fetch an index (possibly incurring some cache miss) when we actually are sure we will require the function to be evaluated.

I think perhaps the best call is to simply not do this optimization for traverse for Vector and keep the current code for that particular function. Seems that would be a quick change, and then I would be happy to merge.

@johnynek
Copy link
Contributor

btw: I was the person who implemented the tree style traverse in cats (having noticed it by starting with the TreeList datastructure I added to cats-collections: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/TreeList.scala)...

I now wonder if the implementation of two strategies in traverseViaChain actually is helping... I can't see why at the moment... There is a tree portion, and then a linear portion, the linear portion is currently set at 128 wide. The only motivation I can see for the linear portion is that may be cheaper to build a List than a Chain, but I don't recall if I benchmarked that.

For any future readers, benchmarking that code with a different width setting (say 32, 64, 128 (current), and 256) might be interesting, along with a version that is a tree all the way down (similar to the implementation of traverse_ in Vector).

If it is very close, it would be worth simplifying the code to only use the binary index tree algorithm for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants