Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve GroupAdjacent #1032

Open
jods4 opened this issue Nov 1, 2023 · 9 comments
Open

Improve GroupAdjacent #1032

jods4 opened this issue Nov 1, 2023 · 9 comments

Comments

@jods4
Copy link

jods4 commented Nov 1, 2023

Problem

GroupAdjacent doesn't let you cheaply compare a small key (e.g. single field) to define groups, then use different selectors for the group key and elements.

For example, consider an IEnumerable with the following fields:

aKey
a1
a2
a3
b1
b2
b3

You want to efficiently group by aKey, then use x => A(x.AKey, x.a1, x.a2, x.a3) as group key and x => B(x.b1, x.b2, x.b3) as group elements.

One option that works today is GroupAdjacent(x => A(...), x => B(...)) and have a comparer or override Equals on A.
The drawback is that one A instance is generated for every element in the sequence, when you only need a single one per group.

Proposed solution 1

A new overload?
GroupAdjacent(F<X, A> keySelector, F<X, B> elementSelector, F<X, Y> equalitySelector)

Proposed solution 2

GroupAdjacent doesn't stream its results, it actually buffers each group in a list. Internally these are IList but only exposed as IEnumerable.
Exposing the IList would allow the following solution:

seq.GroupAdjacent(
  keySelector: x => x.aKey,
  resultSelector: (key, IList elements) => (A(elements[0]), elements.Select(B))
)  

Today, this is doable with a cast from IEnumerable to IList but it can break with any MoreLINQ release as it relies on an implementation detail.
Alternatively, using elements.First() instead of elements[0] will always work but can be quite dangerous if the implementation of GroupAdjacent becomes non-buffering, as elements is enumerated twice.

What about non-buffering?

When working with long groups, it'd be interesting to not buffer the groups.
Maybe a new API StreamGroupAdjacent that has an added constraint that group elements must be consumed in order and only once (but not necessarily fully).

Predicate vs IEqualityComparer

It'd also be nice to have overloads accepting Func<X, X, bool> instead of IEqualityComparer.
GroupAdjacent is relies on adjacency, so uses only equality and never hashes.

@atifaziz
Copy link
Member

atifaziz commented Nov 1, 2023

Sorry, but I am having trouble to understand the problem and what you're asking for.

If this is the input:

aKey
a1
a2
a3
b1
b2
b3

What do you expect the output to be?

@jods4
Copy link
Author

jods4 commented Nov 1, 2023

Basically, I added this to my project. It's a slightly less efficient way of doing Proposed Solution 1:

IEnumerable<IGrouping<G, E>> GroupAdjacent<T, K, G, E>(
  this IEnumerable<T> source,
  Func<T, K> key,
  Func<T, G> group,
  Func<T, E> element)
{
  K cachedKey = default;
  G cachedGroup = default;
  return source.GroupAdjacent(
    x => 
    {
      var nextKey = key(x);
      if (nextKey != cachedKey || cachedGroup is null)
        (cachedKey, cachedGroup) = (nextKey, group(x));
      return cachedGroup;
    },
    element
}

The key point is that I don't want to create a G instance for every item in the source enumeration, only once per group.

@atifaziz
Copy link
Member

atifaziz commented Nov 2, 2023

Thanks for sharing the code. It helps to understand the mechanics a bit. So if the grouped elements were exposed as a list instead of a sequence, like your proposed solution 2, then it would solve your problem?

GroupAdjacent signatures were modeled after GroupBy, which is why the group elements are typed as IEnumerable<>.

@jods4
Copy link
Author

jods4 commented Nov 2, 2023

Yes, if GroupAdjacent exposed IList we could efficiently build the groups by peeking elements[0]. (Proposed Solution 2).

I like Proposed Solution 1 a little more because it's simpler on consumer side. With Solution 2 consumer has to take care of the IGrouping construction.

Independently of Solution 1 or 2: if buffering is not an implementation detail that will change, MoreLinq might as well expose IList. It opens up more flexibility to efficiently consume GroupAdjacent results on the consumer side.

I think it's fair to commit to buffering, because streaming is a valuable alternative but it comes with strict constraints on the way you consume it.
For instance, you can't enumerate one group elements anymore after you've moved to the next group. And you can only enumerate once.
For these reasons, a non-buffering implementation would IMHO have to be offered by a new API.

@atifaziz
Copy link
Member

atifaziz commented Nov 2, 2023

@jods4
Copy link
Author

jods4 commented Nov 2, 2023

ah, I had forgotten this :)

What I want can be implemented on top of AggregateAdjacent, if that's what you mean:

IEnumerable<IGrouping<G, E>> GroupByAdjacent<T, K, G, E>(
  this IEnumerable<T> source,
  Func<T, K> key,
  Func<T, G> group,
  Func<T, E> element)
{
  return source.AggregateAdjacent(
    key,
    (k, x) => new Grouping(group(x)),
    (k, g, x) => 
    { 
      g.Add(element(x)); 
      return g; 
    },
    (k, g) => g
  );
}

@atifaziz
Copy link
Member

atifaziz commented Nov 2, 2023

@jods4 Yes, that's it!

@viceroypenguin
Copy link
Contributor

viceroypenguin commented Nov 3, 2023

What about the following with the existing api:

seq.GroupAdjacent(
  x => (x.aKey, x.a1, x.a2, x.a3),
  x => new B(x.b1, x.b2, x.b3),
  (k, g) => (new A(k.aKey, k.a1, k.a2, k.a3), g));

Unless a1, a2, or a3 are non-comparable, this will work fine.

@jods4
Copy link
Author

jods4 commented Nov 3, 2023

@viceroypenguin
Minor point: I have more a_n properties than I want to copy each iteration (ValueTuple is a struct)...
Major point: MoreLinq would then compare all a_n (default struct comparison) for each element, I want to compare aKey only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants