Spill head projection to remaining sequence #752

atifaziz · 2020-08-06T11:05:55Z

I propose to add an extension that takes one or more elements at the head of a sequence and spills a projection to remaining elements of the sequence.

Sometimes you have a sequence where the initial element(s) contains information about the processing of the rest of the sequence. A typical example is a table (think CSV) where the table is composed of rows; where the first row is a header and the remaining the data rows. Processing such a sequence should only generate a projection of the data rows.

The signature would be as follows:

public static IEnumerable<R>
    SpillSpan<T, M, A, H, R>(
        this IEnumerable<T> source,
        Func<T, (bool, M)> chooser,
        A empty,
        Func<M, A> seeder,
        Func<A, M, A> accumulator,
        Func<A, H> headerSelector,
        Func<H, T, R> resultSelector)

The chooser identifies header elements. Data rows commence as soon as it returns (false, _) for a T. The header elements are accumulated via accumulator and the seeder is used to seed the accumulation with the initial header element. The empty value is used for the headless case. The headerSelector function is used to create a single projection out of the accumulated header elements and which is subsequently paired with or spilled to remaining elements.

I propose to add overloads for simpler cases.

SpillSpan should never throw an exception. If the user wants to ban the headless case, he/she can throw in headerSelector upon receiving the empty value.

Example

const string csv = @"
    a,c,b
    1,3,2
    4,6,5
    7,9,8";

data result =
    from rows in new[]
    {
        from line in Regex.Split(csv.Trim(), @"\r?\n")
        select line.Split(',').Select(f => f.Trim()).ToArray()
    }
    from row in
        rows.Index()
            .SpillSpan(
                r => r.Key == 0,
                null,
                r => r.Value,
                (r, _) => r,
                h => MoreEnumerable.Return(h.Index().ToDictionary(e => e.Value, e => e.Key)) // name => index
                                   .SelectMany(_ => new[] { "a", "b", "c" }, (m, n) => m[n]) // lookup index of name
                                   .ToArray(),
                (bs, r) => bs.Select(i => int.Parse(r.Value[i], CultureInfo.InvariantCulture))
                             .Fold((a, b, c) => new { A = a, B = b, C = c }))
    select row;

foreach (var row in data)
    Console.WriteLine(row);

Output:

{ A = 1, B = 2, C = 3 }
{ A = 4, B = 5, C = 6 }
{ A = 7, B = 8, C = 9 }

The text was updated successfully, but these errors were encountered:

atifaziz · 2020-08-07T17:49:26Z

Based on work so far in PR #753, I feel it's best to rename the extension to SpillHead. The simplest overload will spill the first element of the sequence to the second and beyond so in the case of one element representing the head, span seems confusing. However, head works well whether you have a single element or several making up the “head” of the sequence.

morelinq#752 (comment)

declard · 2023-06-08T00:22:00Z

Is it essentially this, but working in constant space?

public static IEnumerable<R>
    SpillSpan<T, P, R>(
        this IEnumerable<T> source,
        Func<T, bool> predicate,
        Func<T, P> prefixMapping,
        Func<P, T, R> resultSelector)
{
    var (prefix, rest) = source.Span(predicate);
    var mappedPrefix = prefixMapping(prefix);
    return rest.Select(e => resultSelector(mappedPrefix, e));
}

// applied to a predicate `predicate` and an enumerable `source`, returns a tuple where first element is longest prefix (possibly empty) of `source` of elements that satisfy `predicate` and second element is the remainder of the enumerable
// >>> span (< 3) [1,2,3,4,1,2,3,4]
// ([1,2],[3,4,1,2,3,4])
public static (IReadOnlyCollection<T> Prefix, IReadOnlyCollection<T> Rest) Span<T>(this IEnumerable<T> source, Func<T, bool> predicate);

The A empty+Func<M, A> seeder+Func<A, M, A> accumulator+Func<A, H> headerSelector logic can be made external and encapsulated into Func<T, P> prefixMapping. chooser handles both predication and mapping, so the mapping part can be moved into prefixMapping too.

atifaziz · 2023-06-08T18:03:22Z

Not quite like span/Parition, because here the proposal is to consume in a streaming fashion. So while the head/header is collected and projected to help process the remainder, the remainder of the sequence (assuming they are the data rows) is streamed. If you have billions of rows, this operator will lazily only consume what's needed. It can be combined with other streaming operators to avoid committing the entire source to memory.

atifaziz linked a pull request Aug 6, 2020 that will close this issue

Add SpillHead #753

Open

atifaziz added a commit to atifaziz/MoreLINQ that referenced this issue Aug 7, 2020

Rename to SpillHead

bb32281

morelinq#752 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spill head projection to remaining sequence #752

Spill head projection to remaining sequence #752

atifaziz commented Aug 6, 2020

atifaziz commented Aug 7, 2020

declard commented Jun 8, 2023

atifaziz commented Jun 8, 2023

Spill head projection to remaining sequence #752

Spill head projection to remaining sequence #752

Comments

atifaziz commented Aug 6, 2020

Example

atifaziz commented Aug 7, 2020

declard commented Jun 8, 2023

atifaziz commented Jun 8, 2023