Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] SortedDistinctBy #836

Open
jods4 opened this issue Mar 11, 2022 · 17 comments
Open

[Proposal] SortedDistinctBy #836

jods4 opened this issue Mar 11, 2022 · 17 comments

Comments

@jods4
Copy link

jods4 commented Mar 11, 2022

Same API as DistinctBy, but assumes a sorted input (by the distinct key), so it doesn't have to keep a full hashmap of previously seen values.

@Shelim
Copy link

Shelim commented Mar 31, 2022

This is needed! There is no Linq equivalent, and using Distinct on sorted data is so much worse it terms of O's complexity...
Especially if you need to keep them sorted - as official docs clearly states that result sequence (of Distinct) is unordered. Therefore you may need to sort it again, even more skyrocking the computation cost

@atifaziz
Copy link
Member

atifaziz commented Apr 1, 2022

I believe GroupAdjacent should give you what you're looking for. See also Exploring MoreLINQ Part 18 - GroupAdjacent and Segment by @markheath that includes a video.

Following is an example showing GroupAdjacent being used to average values per day where data is assumed to be sorted by day that's the key:

var data = new[]
{
    new { Date = new DateTime(2010, 1, 1), Value = 1 },
    new { Date = new DateTime(2010, 1, 1), Value = 2 },
    new { Date = new DateTime(2010, 1, 1), Value = 3 },
    new { Date = new DateTime(2010, 1, 2), Value = 4 },
    new { Date = new DateTime(2010, 1, 2), Value = 5 },
    new { Date = new DateTime(2010, 1, 2), Value = 6 },
};

var q =
    from g in data.GroupAdjacent(e => e.Date)
    select new { Date = g.Key, Value = g.Average(e => e.Value) };

foreach (var e in q)
    Console.WriteLine(e.ToString());

Prints:

{ Date = 01/01/2010, Value = 2 }
{ Date = 02/01/2010, Value = 5 }

I'll close this assuming GroupAdjacent does the trick and happy to re-open and re-consider if I misunderstood the problem.

@atifaziz atifaziz closed this as completed Apr 1, 2022
@jods4
Copy link
Author

jods4 commented Apr 1, 2022

@atifaziz GroupAdjacent is kind of similar but it returns a Key and enumerable of values, when all you want is the first value for each key.

You can get that as values.GroupAdjacent(x => x.Date).Select(x => x.First()) but it's a bit clumsy and not as efficient as straightforward implementation of OrderedDistinctBy(x => x.Date).

@Shelim
Copy link

Shelim commented Apr 1, 2022

@atifaziz Close enough, but I believe it is still a bit more expensive than directly implemented OrderedDistinctBy through sharing the same O's. And way harder to find in docs actually, which leaded to my post above :)

@atifaziz
Copy link
Member

atifaziz commented Apr 1, 2022

when all you want is the first value for each key.

What if you want the last value? What if you want a count of items under the key? Who decides? Assuming the first would be very limiting.

You can get that as values.GroupAdjacent(x => x.Date).Select(x => x.First()) but it's a bit clumsy and not as efficient as straightforward implementation of OrderedDistinctBy(x => x.Date).

It's not that it's clumsy, but perhaps wasteful to collect all values for a group only to throw away all but the first.

And way harder to find in docs actually, which leaded to my post above :)

@Shelim That's a hard problem to solve.

I think what's being asked here is a variation of GroupAdjacent that would be called AggregateAdjacent. Instead of collecting values of a key, it would let the caller decide how to accumulate/aggregate/fold into a single result. Here's a prototype of what that could potentially look like:

public static IEnumerable<TResult> AggregateAdjacent<TSource, TKey, TAccumulator, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, TSource, TAccumulator> seedSelector,
    Func<TKey, TAccumulator, TSource, TAccumulator> aggregator,
    Func<TKey, TAccumulator, TResult> resultSelector,
    IEqualityComparer<TKey>? comparer = null)
{
    comparer ??= EqualityComparer<TKey>.Default;

    using var item = source.GetEnumerator();
    
    if (!item.MoveNext())
        yield break;

    var key = keySelector(item.Current);
    var runKey = key;
    var accumulator = seedSelector(key, item.Current);

    while (item.MoveNext())
    {
        key = keySelector(item.Current);
        if (comparer.Equals(runKey, key))
        {
            accumulator = aggregator(runKey, accumulator, item.Current);
            continue;
        }
        else
        {
            yield return resultSelector(runKey, accumulator);
            runKey = key;
            accumulator = seedSelector(key, item.Current);
        }
    }

    yield return resultSelector(runKey, accumulator);
}

Here are some examples of uses:

var data = new[]
{
    new { Date = new DateTime(2010, 1, 1), Value = 1 },
    new { Date = new DateTime(2010, 1, 1), Value = 2 },
    new { Date = new DateTime(2010, 1, 2), Value = 4 },
    new { Date = new DateTime(2010, 1, 2), Value = 5 },
    new { Date = new DateTime(2010, 1, 2), Value = 6 },
    new { Date = new DateTime(2010, 1, 3), Value = 7 },
};

Console.WriteLine("First of each date:");
foreach (var item in data.AggregateAdjacent(e => e.Date, (_, e) => e, (_, a, _) => a, (_, a) => a))
    Console.WriteLine(item.ToString());

Console.WriteLine("Last of each date:");
foreach (var item in data.AggregateAdjacent(e => e.Date, (_, e) => e, (_, _, e) => e, (_, e) => e))
    Console.WriteLine(item.ToString());

Console.WriteLine("Count per date:");
foreach (var item in data.AggregateAdjacent(e => e.Date, (_, e) => 1, (_, a, _) => a + 1,
                                            (d, a) => new { Date = d, Count = a }))
    Console.WriteLine(item.ToString());

Prints:

First of each date:
{ Date = 01/01/2010 00:00:00, Value = 1 }
{ Date = 02/01/2010 00:00:00, Value = 4 }
{ Date = 03/01/2010 00:00:00, Value = 7 }
Last of each date:
{ Date = 01/01/2010 00:00:00, Value = 2 }
{ Date = 02/01/2010 00:00:00, Value = 6 }
{ Date = 03/01/2010 00:00:00, Value = 7 }
Count per date:
{ Date = 01/01/2010 00:00:00, Count = 2 }
{ Date = 02/01/2010 00:00:00, Count = 3 }
{ Date = 03/01/2010 00:00:00, Count = 1 }

To permit multiple aggregations efficiently in a single iteration, one could then use the same strategy as we did with Aggregate in #326.

@atifaziz atifaziz reopened this Apr 1, 2022
@jods4
Copy link
Author

jods4 commented Apr 1, 2022

What if you want the last value?

The semantics of the existing DistinctBy method is to return the first row for each distinct key.
It seems natural that OrderedDistinctBy has the same contract but is optimized for sorted sequences (DistinctBy needs to buffer all seen keys into a hash table to avoid yielding the same key twice, if the sequence is sorted your state only needs the current key, it's O(N) vs O(1)).

This is intuitive and quite useful.

If someone needs something else they should use another primitive like the AggregateAdjacent as you've mentionned.
Or maybe file an issue to add LastDistinctBy and OrderedLastDistinctBy to morelinq.

What if you want a count of items under the key?

This is clearly not a question for DistinctBy. The Distinct family returns a sequence of unique items, not groups. If you want counts, averages, sums or whatnot, then you should turn to GroupBy and/or Aggregate sets of apis. Typically the AggregateAdjacent examples you've made above.


I feel like you're expanding this into harder territory than needs be.
morelinq has several apis that have optimized counterparts when the source sequence is assumed to be sorted.

DistinctBy is a useful method and this issue is only about adding an optimized OrderedDistinctBy that performs the same operation with less local state. It's not complicated and it's useful.

Having new higher-level apis that perform optimized aggregations over sorted sequences is a great idea but it's a different, more complex thing that surely should have its own issue.

@Shelim
Copy link

Shelim commented Apr 4, 2022

Exactly as stated

The base-pure LINQ Distinct is also replaceable by GroupBy and then Select -> First. But it is optimized for this specific - and not so rare! - case of choosing distinct values. And that is the reason it exists in the first place.

@Orace
Copy link
Contributor

Orace commented May 11, 2022

What if the input is not sorted?
Does the fact that the input is sorted is relevant?
As an example, consider: [0, 0, 2, 2, 1, 1]

I propose:

  1. A DistinctBy that take an IOrderedEnumerable and that implement the optimized algorithm.
    It can be an overload or a try cast in the current implementation.
  2. An equivalent of the RX distinctUntilChanged method. I propose DistinctAdjacent and DistinctAdjacentBy.

@jods4
Copy link
Author

jods4 commented May 11, 2022

Adjacent could be another name for sure. 👍
It's important that the source is "sorted" in the sense that identical values are adjacent, otherwise the result will not be correct.

"Ordered" is a vague concept anyway when a comparator is not given... for all we know they could be sorted in descending order, or with their bits reversed, or anything really...

@Orace
Copy link
Contributor

Orace commented May 11, 2022

It's important that the source is "sorted" in the sense that identical values are adjacent, otherwise the result will not be correct.

It's important for a use case of yours.
Again, should the method have to check for a sorted / ordered input? With the limitations that this check imply (A hashset of already encountered values / comparable values or value comparer).
I think that the method should have a single purpose: collapse repeated adjacent values, whatever the input is.
The fact that it then matchs your use case on a sorted input is just an happy incident.

@jods4
Copy link
Author

jods4 commented May 11, 2022

Yes, I agree.
The method shouldn't validate that the input is sorted, and in fact as I mentioned before it's impossible without taking an extra comparator parameter because you can't assume how it's sorted.

The expected result is totally deterministic and if you have repeated but not adjacent values, this API would yield two items which have non-distinct key (albeit spaced by at least one other item).

@jods4
Copy link
Author

jods4 commented May 11, 2022

For reference here's the code we use for that in my project.
Feel free to copy, for a public API maybe you'll want:

  • Non-null parameters validation (internally we have nullable refs enabled so we don't care).
  • An overload that takes a custom equality comparer (we didn't need one).
  • For max perf you could drop the first variable and work with IEnumerator directly but it's more involved and needs a using in case it's disposable. Not sure it makes much difference.
    public static IEnumerable<T> OrderedDistinctBy<T, K>(this IEnumerable<T> source, Func<T, K> selector)
    {
      var comparer = EqualityComparer<K>.Default;
      var first = true;
      var previous = default(K);
      foreach (var x in source)
      {
        if (first)
        {
          previous = selector(x);
          first = false;
          yield return x;
          continue;
        }

        var current = selector(x);
        if (!comparer.Equals(previous!, current))
        {
          previous = current;
          yield return x;
        }
      }
    }

@viceroypenguin
Copy link
Contributor

@jods4 - FYI: This is already implemented in the System.Interactive package as .DistinctUntilChanged().

@atifaziz
Copy link
Member

@jods4 - FYI: This is already implemented in the System.Interactive package as .DistinctUntilChanged().

@jods4, @Shelim: I don't see the point in duplicating DistinctUntilChanged here since System.Interactive already offers what you're looking for and the more generic version didn't appeal, so perhaps we can close this?

@Shelim
Copy link

Shelim commented Nov 13, 2022

@atifaziz
I see there is a very strong opposition to add it in MoreLINQ directly (and I still do not get why, but whatever...), so I believe so. But please, at least mention it somewhere in the docs - issues does not appear high in search and I personally searched LINQ Extended when looking for DistinctUntilChanged, and I was not even aware of Rx async components. They did not appear within first three pages on Google while MoreLINQ was the first hit...

@jods4
Copy link
Author

jods4 commented Nov 14, 2022

@atifaziz
If you don't want to add it to MoreLINQ, then you can close this, of course!
I thought it was an excellent fit for MoreLINQ, but it's your call 😉

I don't want to bring in System.Interactive just for a single function that is less than 20 LoC, so I'm gonna stick with my local implementation.

@viceroypenguin
Copy link
Contributor

viceroypenguin commented Nov 14, 2022

Actually, this is not something that needs a separate operator- there is already a solution that exists using the existing operators:

var distinct = source.Lag((curr, lag) => (curr, lag)).Where(x => !comparer.Equals(x.lag, x.curr)).Select(x => x.curr);

This has similar memory performance as your implementation, but doesn't require a full operator to be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants