Optimize traverse #4498

TimWSpence · 2023-08-23T16:08:54Z

This addresses the StackSafeMonad part of #4408. I thought that there were enough changes in this to merit a separate PR for the Defer part.

Open questions (will edit these out of the description later):

The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

The following benchmarks were run on an AWS t3.xlarge:

Baseline (`b08196e`)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3247.486	± 270.817	ops/s
TraverseBench.filterVector	10000	thrpt	20	3351.585	± 255.883	ops/s
TraverseBench.mapList	10000	thrpt	20	3198.905	± 235.865	ops/s
TraverseBench.mapVector	10000	thrpt	20	3906.339	± 225.508	ops/s
TraverseBench.traverseChain	10000	thrpt	20	556.706	± 31.814	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1278.954	± 74.242	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	536.841	± 38.696	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	532.103	± 30.473	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	590.312	± 29.534	ops/s
TraverseBench.traverseList	10000	thrpt	20	537.023	± 33.287	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1674.930	± 48.771	ops/s
TraverseBench.traverseVector	10000	thrpt	20	581.844	± 32.074	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1778.334	± 135.859	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	490.822	± 14.145	ops/s
TraverseBench.traverse_List	10000	thrpt	20	474.568	± 28.995	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	622.924	± 38.002	ops/s

Chain (`2d5f4d7`)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3170.974	± 146.628	ops/s
TraverseBench.filterVector	10000	thrpt	20	3217.881	± 181.087	ops/s
TraverseBench.mapList	10000	thrpt	20	3443.969	± 104.117	ops/s
TraverseBench.mapVector	10000	thrpt	20	3827.401	± 148.074	ops/s
TraverseBench.traverseChain	10000	thrpt	20	784.845	± 25.990	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1302.782	± 62.342	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	748.986	± 62.151	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	669.817	± 41.798	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	618.644	± 28.558	ops/s
TraverseBench.traverseList	10000	thrpt	20	583.413	± 27.079	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1253.510	± 76.328	ops/s
TraverseBench.traverseVector	10000	thrpt	20	581.975	± 27.619	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1302.362	± 66.015	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	3598.033	± 108.420	ops/s
TraverseBench.traverse_List	10000	thrpt	20	4048.287	± 158.090	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	4094.024	± 101.115	ops/s

Vector (`702ab8b`)

Benchmark	(length)	Mode	Cnt	Score	Error	Units
TraverseBench.filterList	10000	thrpt	20	3305.299	± 233.483	ops/s
TraverseBench.filterVector	10000	thrpt	20	3253.531	± 133.993	ops/s
TraverseBench.mapList	10000	thrpt	20	3308.961	± 138.697	ops/s
TraverseBench.mapVector	10000	thrpt	20	3911.187	± 142.582	ops/s
TraverseBench.traverseChain	10000	thrpt	20	594.535	± 29.777	ops/s
TraverseBench.traverseChainError	10000	thrpt	20	1084.548	± 75.686	ops/s
TraverseBench.traverseFilterChain	10000	thrpt	20	642.895	± 55.621	ops/s
TraverseBench.traverseFilterList	10000	thrpt	20	682.965	± 41.234	ops/s
TraverseBench.traverseFilterVector	10000	thrpt	20	696.235	± 59.945	ops/s
TraverseBench.traverseList	10000	thrpt	20	582.275	± 35.877	ops/s
TraverseBench.traverseListError	10000	thrpt	20	1114.959	± 52.641	ops/s
TraverseBench.traverseVector	10000	thrpt	20	638.235	± 42.533	ops/s
TraverseBench.traverseVectorError	10000	thrpt	20	1098.038	± 62.946	ops/s
TraverseBench.traverse_Chain	10000	thrpt	20	3480.086	± 178.654	ops/s
TraverseBench.traverse_List	10000	thrpt	20	3950.650	± 171.441	ops/s
TraverseBench.traverse_Vector	10000	thrpt	20	3695.495	± 182.401	ops/s

StackSafeMonads

Also fix 2.12 errors

different runtime types for the Applicative instance

armanbilge · 2023-08-23T16:11:45Z

The TraverseFilter laws hard-code Option. How do we test that the new branching implementation is lawful?

Can we add some new laws in terms of OptionT[Eval, _]? Edit: I don't think this will work unless we make sure that OptionT offers a StackSafeMonad if F is a StackSafeMonad.

Should I open a separate PR for the new benchmarks first?

Nah.

johnynek · 2023-08-23T16:41:07Z

Can you post the results of the benchmarks? Apologies if I missed them.

johnynek · 2023-08-23T16:31:42Z

core/src/main/scala/cats/Traverse.scala

+    val iter = fa.iterator
+    if (iter.hasNext) {
+      val first = iter.next()
+      G.map(iter.foldLeft(f(first)) { case (g, a) =>


I think this can be G.void(.. so that has a chance for an optimized implementation.

Yes, good catch!

johnynek · 2023-08-23T16:40:14Z

core/src/main/scala/cats/TraverseFilter.scala

+    G.map(fa.iterator.foldLeft(G.pure(builder)) { case (bldrG, a) =>
+      G.flatMap(bldrG) { bldr =>
+        G.map(f(a)) {
+          case Some(b) => bldr += b


I can't see why we know this is safe or lawful.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

We have a mutable data structure that could potentially be in a multithreaded situation with G.

It can't be: every append to the builder happens in flatMap so by law it must be sequential, not concurrent/parallel.

Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.

Hmm. I'm not entirely sure how this applies to traverse. It doesn't have a notion of "recovery". For sure, each individual step that the traverse runs may have a notion of recovery, but that's just it: it will either succeed or it will fail. But there's no way to "recover" an intermediate structure from the traverse itself.

I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.

Won't disagree here. We could do a benchmark to see how much performance we are leaving on the table with that strategy.

TimWSpence · 2023-08-23T18:03:28Z

Can you post the results of the benchmarks? Apologies if I missed them.

Yes will do. I had originally asked whether I should add the new ones in a separate PR first which was why I didn't run them straight away

TimWSpence · 2023-11-05T16:18:53Z

@armanbilge @johnynek apologies it's taken me so long to get round to this. I've updated with benchmarks for the mutable builder approach. Are you still worried about lawfulness? Shall I try to do it with immutable data structures?

Those traverse_ improvements though 😆

Is this necessary? The monad laws should ensure that it's safe to use mutable builders. Nonetheless it will be good to confirm the performance delta for using immutable data structures

johnynek · 2023-11-08T22:39:23Z

Thanks for working on this. What a don't see is how a Monad such as fs2.Stream[IO, A] would interact here. In that. Case you can have laziness, concurrency and more than one output per input.

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

armanbilge · 2023-11-08T23:34:36Z

Can you give an explanation as to why we shouldn't worry about mutability in such a case?

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

armanbilge · 2023-11-08T23:38:37Z

core/src/main/scala/cats/Traverse.scala

+  private[cats] def traverseDirectly[G[_], A, B](
+    fa: IterableOnce[A]
+  )(f: A => G[B])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
+    fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (accG, a) =>


I think we can do even better here by using a Vector.newBuilder, we can do even even better by using a dedicated builder for the desired collection (e.g. ArraySeq or List) and we can do even even even better by providing a size hint to the builder.

I might have misunderstood something, but it looks like that's how things were done before the commit 8cc742e was introduced, which added the immutable data structures. 🙂

You're right :) let's just say this PR has been open a while and I lost track of the old stuff 😇

johnynek · 2023-11-09T01:56:41Z

We can't have concurrency: we are sequencing the effects with flatMap, which by law must be sequential.

This is definitely a colloquial understanding of Monads, that they represent an sequential computation, but I don't see the law that allows us to make the claim you want.

For instance, in the current PR we are using immutable items inside the monadic type constructor. I think the current writing will work. But if we changed those to be mutable, I don't see a proof that it will work, even though it may.

For instance, consider this type:

sealed trait Tree[F[_], A]
case class Item[F[_], A](fa: A) extends Tree[F, A]
case class InsideF[F[_], A](treeF: F[Tree[F, A]]) extends Tree[F, A]
case class Branch[A](left: Tree[F, A], right: Tree[F, A]) extends Tree[F, A]

object Tree {
  def pure[F[_], A](a: A): Tree[F, A] = Item(a)
  def flatMap[F[_]: Functor, A, B](fa: Tree[F, A])(fn: A => Tree[F, B]): Tree[F, B] =
    fa match {
      case Item(a) => fn(a)
      case InsideF(fa) => InsideF(fa.map(flatMap(_)(fn)))
      case Branch(left, right) =>
        Branch(flatMap(left)(fn), flatMap(right)(fn))
    }
}

Now, I haven't checked if this follows the monad laws as we have expressed them, but I think it or something like it could. In this picture note we are recursively calling flatMap inside the F context and outside of it. So, if F is lazy, we return before we have fully recursed.

Couldn't something like this cause a problem?

I think saying "monads are sequential" is a bit too hand-wavey to bolt into the default implementation for every user of cats. I think we need a stronger argument than that.

To see if this is more performant than an immutable vector

johnynek · 2023-11-10T19:55:04Z

core/src/main/scala/cats/Traverse.scala

+    if (iter.hasNext) {
+      val first = iter.next()
+      G.void(iter.foldLeft(f(first)) { case (g, a) =>
+        G.flatMap(g) { _ =>


why not G.productR(g, f(a)) here? Is it to avoid calling f when g may have already failed? I think a comment would be helpful.

There was no good reason 😅 Thanks, I'll change it.

Is it to avoid calling f when g may have already failed?

Maybe this is something we should consider though?

Edit: ahh, I see your comment on f95b087. Fair enough.

johnynek · 2023-11-10T19:55:49Z

core/src/main/scala/cats/TraverseFilter.scala

+    fa: IterableOnce[A]
+  )(f: A => G[Option[B]])(implicit G: StackSafeMonad[G]): G[Vector[B]] = {
+    fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (bldrG, a) =>
+      G.flatMap(bldrG) { acc =>


add a comment on why we didn't write:

G.map2(bldrG, f(a)) { case (acc, Some(b)) => acc :+ b case (acc, _) => acc }

Thanks, good point 👍

johnynek · 2023-11-10T19:56:44Z

core/src/main/scala/cats/Traverse.scala

+    fa: IterableOnce[A]
+  )(f: A => G[B])(implicit G: StackSafeMonad[G]): G[List[B]] = {
+    G.map(fa.iterator.foldLeft(G.pure(List.empty[B])) { case (accG, a) =>
+      G.flatMap(accG) { acc =>


add a comment why not G.map2(accG, f(a))(_ :: _)

Since we know G is lazy (a StackSafeMonad has to be I think), I'm not sure the downside here. One answer would be we don't have to call f if we have already failed for a short-circuiting monad, but we are still iterating the whole list, so we are doing O(N) work. Adding the call to allocate the Monad doesn't seem like a big problem, since we have to allocate the function to pass to flatMap in the current case.

By calling map2 we are at least communicating to G what we are doing, and in principle, some monads could optimize this (e.g. a Parser can make a more optimized map2 than flatMap, and it can also be StackSafe since runs lazily only when input is passed to the resulting parser).

johnynek · 2024-03-04T23:51:49Z

I would go where the benchmarks suggest if we have confidence in them.

TimWSpence · 2024-03-05T12:42:12Z

@valencik you foolishly volunteered yourself on Discord to run the benchmarks as well 😆 If you have the chance, I would massively appreciate another set of data to confirm that what I'm seeing is correct

valencik

There's a lingering mutable builder I think we want to remove, and a few more G.pure(())s.

@TimWSpence I can benchmark later today :)

valencik · 2024-03-05T13:35:46Z

core/src/main/scala-2.13+/cats/instances/arraySeq.scala

+        G match {
+          case x: StackSafeMonad[G] => Traverse.traverse_Directly(fa)(f)(x)
+          case _ =>
+            foldRight(fa, Always(G.pure(()))) { (a, acc) =>


Suggested change

foldRight(fa, Always(G.pure(()))) { (a, acc) =>

foldRight(fa, Eval.now(G.unit)) { (a, acc) =>

valencik · 2024-03-05T13:41:56Z

core/src/main/scala/cats/instances/map.scala

-          })(_.iterator.toMap)
+          G match {
+            case x: StackSafeMonad[G] =>
+              x.map(fa.foldLeft(G.pure(Map.newBuilder[K, B])) { case (accG, (k, a)) =>


This is still using the mutable builder, I think we want something like:

case x: StackSafeMonad[G] => fa.iterator.foldLeft(G.pure(Map.empty[K, B])) { case (accG, (k, a)) => x.map2(accG, f(a)) { case (acc, b) => acc + (k -> b) } }

valencik · 2024-03-05T13:44:33Z

core/src/main/scala/cats/instances/seq.scala

+        G match {
+          case x: StackSafeMonad[G] => Traverse.traverse_Directly(fa)(f)(x)
+          case _ =>
+            foldRight(fa, Always(G.pure(()))) { (a, acc) =>


Suggested change

foldRight(fa, Always(G.pure(()))) { (a, acc) =>

foldRight(fa, Eval.now(G.unit)) { (a, acc) =>

😆 the Map instance. Sorry, the conversation has gotten so long that it's hard to follow. I think we agreed upon only benchmarking Chain, Vector and List so I said that I'd update the Map instance once we've decided what implementation to use for all the other collections.

valencik · 2024-03-06T13:41:11Z

Mostly 803107d *
( Sorry, I actually did apply suggested change in #4498 (comment) )

[info] Benchmark                           (length)   Mode  Cnt     Score     Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5683.590 ±  73.700  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5665.475 ±  34.436  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5833.001 ±  45.470  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6206.558 ±  45.317  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1157.471 ±   8.626  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2080.678 ±  15.341  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1201.729 ±   6.121  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20  1143.251 ±   8.666  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1232.990 ±   7.090  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20  1085.565 ±   8.561  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2192.903 ±   9.450  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1162.033 ±  22.101  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  2178.812 ±  22.956  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20  7644.606 ± 151.222  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20  8946.374 ± 124.209  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  9036.810 ±  37.314  ops/s

and against main (d323fac) but with TraverseBench.scala from 803107d:

[info] Benchmark                           (length)   Mode  Cnt     Score     Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5781.907 ±  31.753  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5667.024 ±  53.352  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5864.307 ±  45.807  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6245.999 ±  46.899  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1072.360 ±  25.156  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2716.055 ± 100.545  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1029.492 ±  51.776  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20   928.622 ±   3.589  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1002.081 ±  44.357  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20   948.305 ±  26.980  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2722.420 ±  77.210  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1011.813 ±  19.254  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  3366.467 ±  35.241  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20   876.282 ± 100.385  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20   779.133 ±  14.965  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  1093.288 ±  45.168  ops/s

By the looks of things filterList, filterVector, mapList, and mapVector are all within errror when comparing this PR against latest main.
We have small improvements for traverseChain, traverseFilterChain, traverseFilterList, traverseFilterVector, traverseList, and traverseVector.
Medium regressions for traverseChainError, traverseListError, traverseVectorError.
And then very large improvements for traverse_Chain, traverse_List, and traverse_Vector.

Summarized differently:

🏎️ traverse_
🏃 traverse
🏃 traverseFilter
🆗 filter
🆗 map
🐢 traverseError

valencik · 2024-03-06T23:51:39Z

Wanted to confirm things, so I ran again and came up with the same results.

I ran these benchmarks (both previously and here) with the following in sbt:
bench/jmh:run -i 10 -wi 10 -f 2 -t 1 cats.bench.TraverseBench

And with the following jvmopts: -Xms2G -Xmx12G

Details

on exactly 803107d

[info] Benchmark                           (length)   Mode  Cnt     Score    Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5715.626 ±  29.321  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5546.558 ±  19.276  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5767.038 ±  34.619  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6173.065 ±  40.277  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1123.427 ±   4.987  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2127.662 ±  13.144  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20  1180.463 ±   5.585  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20  1206.215 ±   9.615  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20  1244.894 ±  25.955  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20  1069.802 ±  13.869  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2167.488 ±  21.981  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20  1152.973 ±  25.078  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  2207.517 ±  33.643  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20  7912.984 ± 107.098  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20  8934.534 ±  46.916  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  9092.209 ±  36.689  ops/s

and on main:

[info] Benchmark                           (length)   Mode  Cnt     Score    Error  Units
[info] TraverseBench.filterList               10000  thrpt   20  5710.935 ± 65.052  ops/s
[info] TraverseBench.filterVector             10000  thrpt   20  5586.684 ± 53.060  ops/s
[info] TraverseBench.mapList                  10000  thrpt   20  5794.559 ± 31.269  ops/s
[info] TraverseBench.mapVector                10000  thrpt   20  6156.259 ± 29.844  ops/s
[info] TraverseBench.traverseChain            10000  thrpt   20  1092.330 ±  7.171  ops/s
[info] TraverseBench.traverseChainError       10000  thrpt   20  2713.055 ± 41.359  ops/s
[info] TraverseBench.traverseFilterChain      10000  thrpt   20   982.057 ±  6.179  ops/s
[info] TraverseBench.traverseFilterList       10000  thrpt   20   873.508 ± 11.078  ops/s
[info] TraverseBench.traverseFilterVector     10000  thrpt   20   986.796 ± 41.753  ops/s
[info] TraverseBench.traverseList             10000  thrpt   20   918.397 ± 39.014  ops/s
[info] TraverseBench.traverseListError        10000  thrpt   20  2781.320 ± 15.372  ops/s
[info] TraverseBench.traverseVector           10000  thrpt   20   997.688 ±  4.798  ops/s
[info] TraverseBench.traverseVectorError      10000  thrpt   20  3081.201 ± 31.311  ops/s
[info] TraverseBench.traverse_Chain           10000  thrpt   20   939.052 ± 20.897  ops/s
[info] TraverseBench.traverse_List            10000  thrpt   20   808.780 ±  8.011  ops/s
[info] TraverseBench.traverse_Vector          10000  thrpt   20  1111.320 ±  9.188  ops/s

TimWSpence · 2024-03-07T17:12:19Z

@valencik absolute legend ❤️ Couple of points:

I'm dumb and not thinking straight 😅 I was thinking the Map implementation would be dependent on either Chain or Vector and so I was waiting for the outcome of that debate. I'll fix it now
map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...
Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

TimWSpence · 2024-03-18T10:17:39Z

@valencik sorry to bother you. Just tagging you in case you missed my previous comment

valencik · 2024-03-20T23:36:33Z

2. map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

3. Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

TimWSpence · 2024-03-21T16:49:48Z

map and filter should be unchanged by this so all of the results apart from traverseError are expected, right? It's certainly not obvious to me why traverseError is worse now...

I agree, and similarly, I also do not know why traverseError would be worse now.

Do you think it's worth having separate implementations for Vector and Chain? From my results they seemed to be noticeably better when we're not doing a conversion from one to the other

This almost seems like followup work to me? I think, if we are fine with the traverseError regression, we should merge this work (or fix traverseError and then merge). This already is only one part of #4408, we can have incremental improvements follow it.

Yeah I'm more than happy to draw a line under this PR and follow up separately - it's been open far too long 😆 I'll move it out of draft now

TimWSpence · 2024-03-22T10:57:29Z

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

valencik · 2024-03-22T12:52:23Z

Argh @valencik I've just realized the current HEAD uses Vector rather than Chain. From what I saw, Chain seemed to be better overall. Shall I switch it back? I think that was @johnynek's view as well?

Just to double check I understand, do you mean the implementation of traverseDirectly?

Yeah, let's switch that to Chain and I can rerun benchmarks.

valencik · 2024-03-22T13:30:18Z

Currently running benchmarks again on a tweak which uses Chain in traverseDirectly and traverseFilterDirectly
https://github.com/typelevel/cats/compare/803107de2f728f648800513ef5cff242a1172fbf...valencik:cats:more-chain?expand=1

valencik · 2024-03-22T23:40:06Z

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence
TimWSpence#2

Feel free to leave it for a day or two.
There's a ton of numbers here.
I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements.
But again, maybe we should save that for a follow up.

TimWSpence · 2024-03-25T08:34:56Z

Ok, I ran the benchmarks, I think Chain is better, and put up a PR to your branch/fork @TimWSpence TimWSpence#2

Feel free to leave it for a day or two. There's a ton of numbers here. I wouldn't be against taking another look over things.

I also ran a test benchmark run with a traverseDirectlyVector variant, and I think that shows some mild improvements. But again, maybe we should save that for a follow up.

Thanks so much again! Presumably that PR is the same as the Chain version that I've had in place multiple times in the history of this PR? 😆

How is traverseDirectlyVector different sorry?

Use `Chain` in `traverseDirectly` helpers

valencik · 2024-03-26T12:28:44Z

Before we get into the weeds, my position remains that we should merge this PR if we are ok with the small regression to traverseError benchmarks.

@djspiewak Pinging you as the creator of these benchmarks, what do you think here? We've improved or held stable on on fronts except traverseError benchmarks.
Which, as a reminder, look like:

  @Benchmark
  def traverseVectorError(bh: Blackhole) = {
    val result = vectorT.traverse(vector) { i =>
      Eval.later {
        Blackhole.consumeCPU(Work)

        if (i == length * 0.3) {
          throw Failure
        }

        i * 2
      }
    }

    try {
      bh.consume(result.value)
    } catch {
      case Failure => ()
    }
  }

Click for a visual summary of benchmark results main vs this pr

My gut says these improvements are worth it and we should merge.

How is traverseDirectlyVector different sorry?

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

The commit: efa7ec0

Click for a visual summary of benchmark results this pr vs traverseDirectlyVector

TimWSpence · 2024-03-26T16:54:48Z

It was an experiment where I added additional traverseDirectly and traverseFilterDirectly implementations but using Vector. So there was one for Chain and one for Vector. These then got used in the instances for Vector.
This mildly improved some vector benchmarks but made the traverseVectorError worse. So I'm not sure it's worth digging into in this PR.

Ah cool, thanks. Yeah I had asked about that before and I think we decided to leave that for potential follow-up

TimWSpence · 2024-04-06T09:20:22Z

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

valencik · 2024-04-06T11:36:03Z

@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now?

We need to make a call on whether or not we're ok with the small regression in the traverseError micro benchmarks.
Otherwise, I think we're good.

johnynek · 2024-04-26T17:22:19Z

core/src/main/scala/cats/instances/list.scala

+                val (leftL, rightL) = fa.splitAt(leftSize)
+                runHalf(leftSize, leftL)
+                  .flatMap { left =>
+                    val right = runHalf(rightSize, rightL)


an interesting benchmark would be do do Eval.defer(runHalf(rightSize, rightL)) which would short circuit for applicatives G that can short circuit.

I wonder how much that would hurt the non-short circuiting case... this is not for this PR, just something we may come back to in a later PR.

johnynek · 2024-04-26T17:27:43Z

core/src/main/scala/cats/instances/vector.scala

+                val rightSize = size - leftSize
+                runHalf(leftSize, idx)
+                  .flatMap { left =>
+                    val right = runHalf(rightSize, idx + leftSize)


same comment about the possibility to short circuit the right with Eval.defer(runHalf(rightSize, idx + leftSize)) but that would add cost to the cases where we do need to evaluate everything, which is probably the most common case.

johnynek · 2024-04-26T17:46:20Z

So my read based on: #4498 (comment)

is that for a benchmark on Vector with a stack safe monad (Eval) that short circuits 30% the way through the Vector (by throwing, which is highly suspect to me, because that is an effect that Eval isn't designed to deal with), we are currently 12% slower.

I think the benchmark isn't one we should care about, but instead maybe OptionT[Eval, ?] would be a short-circuiting StackSafeMonad we could implement (although I don't think OptionT or EitherT preserve StackSafe-ness currently).

As to why this benchmark is currently slower, I think as the benchmark is written, in the new PR we fully interate over the entire Vector building a chain of map2, which in Eval is just implemented the naive way with flatMap. So, we iterate the entire vector for sure. Then we call .value and evaluate the Eval. That process will start running the Eval and throw when it finally gets to the index that throws.

In the old code (the code that builds the tree up), we build a tree by cutting the length (which we get in an O(1) call) in halfs recursively. Importantly, this process does not iterate over the Vector, it is just splitting the index and basically making a binary tree of indexes to call inside an Eval later. Then using an extra layer of Eval (to use map2Eval) we start evaluating the function call left to right. So, in this implementation we only access the first 30% of the indices before we hit the call to f that throws.

So put that way, I think what this means is that actually for Vector, short circuiting via the tree seems actually faster than directly using map2. This is possibly due to the fact that iterating the vector may have to fetch from memory not in cache and we have to do those fetches in the new code, but in the old code, we only fetch an index (possibly incurring some cache miss) when we actually are sure we will require the function to be evaluated.

I think perhaps the best call is to simply not do this optimization for traverse for Vector and keep the current code for that particular function. Seems that would be a quick change, and then I would be happy to merge.

johnynek · 2024-04-26T18:05:45Z

btw: I was the person who implemented the tree style traverse in cats (having noticed it by starting with the TreeList datastructure I added to cats-collections: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/TreeList.scala)...

I now wonder if the implementation of two strategies in traverseViaChain actually is helping... I can't see why at the moment... There is a tree portion, and then a linear portion, the linear portion is currently set at 128 wide. The only motivation I can see for the linear portion is that may be cheaper to build a List than a Chain, but I don't recall if I benchmarked that.

For any future readers, benchmarking that code with a different width setting (say 32, 64, 128 (current), and 256) might be interesting, along with a version that is a tree all the way down (similar to the implementation of traverse_ in Vector).

If it is very close, it would be worth simplifying the code to only use the binary index tree algorithm for sure.

TimWSpence added 15 commits August 18, 2023 17:22

Optimize List traverse for stack-safe Monads

858cca4

Optimize {List, Vector} traverse for StackSafeMonads

847df54

Optimize {Queue, Map} traverse for StackSafeMonads

878c4d0

Optimize Seq traverse for StackSafeMonads

9d86c98

Optimize ArraySeq traverse for StackSafeMonads

c4bcba2

Optimize Chain traverse for StackSafeMonads

c4c64ec

Add size hint to StackSafeMonad traverse

1344a70

Optimize List traverseFilter for StackSafeMonads

b6973dc

Optimize {Seq, Queue, Vector, ArraySeq} traverseFilter for

f0648c4

StackSafeMonads

Optimize Chain filterTraverse for StackSafeMonads

dbccc8e

Optimize List traverse_ for StackSafeMonads

fbf5ea5

Optimize Vector traverse_ for StackSafeMonads

9f60d2b

Also fix 2.12 errors

Optimize {Seq, ArraySeq, Chain, Queue} traverse_ for StackSafeMonads

f4fcf57

Add extra laws tests so we test the multiple branches corresponding to

d50bf83

different runtime types for the Applicative instance

Add benchmarks for traverse_ and for Chain

5c5049f

johnynek reviewed Aug 23, 2023

View reviewed changes

Use .void over .map(_ => ()) to give instances chance to optimize

a1c9ff5

TimWSpence force-pushed the optimize-traverse branch from b92a558 to b338f18 Compare November 8, 2023 12:18

Experiment: use immutable data structures in optimized traverse

8cc742e

Is this necessary? The monad laws should ensure that it's safe to use mutable builders. Nonetheless it will be good to confirm the performance delta for using immutable data structures

TimWSpence force-pushed the optimize-traverse branch from b338f18 to 8cc742e Compare November 8, 2023 12:51

armanbilge reviewed Nov 8, 2023

View reviewed changes

Experiment: use immutable List for optimized traverse

e3ae584

To see if this is more performant than an immutable vector

johnynek reviewed Nov 10, 2023

View reviewed changes

TimWSpence dismissed johnynek’s stale review via eebed5b March 4, 2024 22:25

valencik reviewed Mar 5, 2024

View reviewed changes

More Applicative#unit to limit allocations

803107d

Remove mutable builder from Map traverse as its unlawful

0ee49e6

TimWSpence marked this pull request as ready for review March 21, 2024 16:49

valencik added 2 commits March 22, 2024 09:28

Use Chain in traverseDirectly

6529be7

Use toList in seq traverse, not vector

a62ce4c

Merge pull request #2 from valencik/more-chain

6cb787d

Use `Chain` in `traverseDirectly` helpers

TimWSpence requested review from armanbilge and johnynek April 26, 2024 13:28

johnynek reviewed Apr 26, 2024

View reviewed changes

	foldRight(fa, Always(G.pure(()))) { (a, acc) =>
	foldRight(fa, Eval.now(G.unit)) { (a, acc) =>

Optimize traverse #4498

Are you sure you want to change the base?

Optimize traverse #4498

Conversation

TimWSpence commented Aug 23, 2023 • edited

Baseline (b08196e)

Chain (2d5f4d7)

Vector (702ab8b)

armanbilge commented Aug 23, 2023 • edited

johnynek commented Aug 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimWSpence commented Aug 23, 2023

TimWSpence commented Nov 5, 2023

johnynek commented Nov 8, 2023

armanbilge commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Nov 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armanbilge Nov 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Mar 4, 2024

TimWSpence commented Mar 5, 2024

valencik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valencik commented Mar 6, 2024

valencik commented Mar 6, 2024

TimWSpence commented Mar 7, 2024

TimWSpence commented Mar 18, 2024

valencik commented Mar 20, 2024

TimWSpence commented Mar 21, 2024

TimWSpence commented Mar 22, 2024 • edited

valencik commented Mar 22, 2024 • edited

valencik commented Mar 22, 2024

valencik commented Mar 22, 2024

TimWSpence commented Mar 25, 2024

valencik commented Mar 26, 2024

TimWSpence commented Mar 26, 2024 • edited

TimWSpence commented Apr 6, 2024

valencik commented Apr 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Apr 26, 2024 • edited

johnynek commented Apr 26, 2024

TimWSpence commented Aug 23, 2023 •

edited

Baseline (`b08196e`)

Chain (`2d5f4d7`)

Vector (`702ab8b`)

armanbilge commented Aug 23, 2023 •

edited

armanbilge Nov 14, 2023 •

edited

TimWSpence commented Mar 22, 2024 •

edited

valencik commented Mar 22, 2024 •

edited

TimWSpence commented Mar 26, 2024 •

edited

johnynek commented Apr 26, 2024 •

edited