Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spill head projection to remaining sequence #752

Open
atifaziz opened this issue Aug 6, 2020 · 3 comments · May be fixed by #753
Open

Spill head projection to remaining sequence #752

atifaziz opened this issue Aug 6, 2020 · 3 comments · May be fixed by #753

Comments

@atifaziz
Copy link
Member

atifaziz commented Aug 6, 2020

I propose to add an extension that takes one or more elements at the head of a sequence and spills a projection to remaining elements of the sequence.

Sometimes you have a sequence where the initial element(s) contains information about the processing of the rest of the sequence. A typical example is a table (think CSV) where the table is composed of rows; where the first row is a header and the remaining the data rows. Processing such a sequence should only generate a projection of the data rows.

The signature would be as follows:

public static IEnumerable<R>
    SpillSpan<T, M, A, H, R>(
        this IEnumerable<T> source,
        Func<T, (bool, M)> chooser,
        A empty,
        Func<M, A> seeder,
        Func<A, M, A> accumulator,
        Func<A, H> headerSelector,
        Func<H, T, R> resultSelector)

The chooser identifies header elements. Data rows commence as soon as it returns (false, _) for a T. The header elements are accumulated via accumulator and the seeder is used to seed the accumulation with the initial header element. The empty value is used for the headless case. The headerSelector function is used to create a single projection out of the accumulated header elements and which is subsequently paired with or spilled to remaining elements.

I propose to add overloads for simpler cases.

SpillSpan should never throw an exception. If the user wants to ban the headless case, he/she can throw in headerSelector upon receiving the empty value.

Example

const string csv = @"
    a,c,b
    1,3,2
    4,6,5
    7,9,8";

data result =
    from rows in new[]
    {
        from line in Regex.Split(csv.Trim(), @"\r?\n")
        select line.Split(',').Select(f => f.Trim()).ToArray()
    }
    from row in
        rows.Index()
            .SpillSpan(
                r => r.Key == 0,
                null,
                r => r.Value,
                (r, _) => r,
                h => MoreEnumerable.Return(h.Index().ToDictionary(e => e.Value, e => e.Key)) // name => index
                                   .SelectMany(_ => new[] { "a", "b", "c" }, (m, n) => m[n]) // lookup index of name
                                   .ToArray(),
                (bs, r) => bs.Select(i => int.Parse(r.Value[i], CultureInfo.InvariantCulture))
                             .Fold((a, b, c) => new { A = a, B = b, C = c }))
    select row;

foreach (var row in data)
    Console.WriteLine(row);

Output:

{ A = 1, B = 2, C = 3 }
{ A = 4, B = 5, C = 6 }
{ A = 7, B = 8, C = 9 }
@atifaziz atifaziz linked a pull request Aug 6, 2020 that will close this issue
@atifaziz atifaziz linked a pull request Aug 6, 2020 that will close this issue
@atifaziz
Copy link
Member Author

atifaziz commented Aug 7, 2020

Based on work so far in PR #753, I feel it's best to rename the extension to SpillHead. The simplest overload will spill the first element of the sequence to the second and beyond so in the case of one element representing the head, span seems confusing. However, head works well whether you have a single element or several making up the “head” of the sequence.

atifaziz added a commit to atifaziz/MoreLINQ that referenced this issue Aug 7, 2020
@declard
Copy link

declard commented Jun 8, 2023

Is it essentially this, but working in constant space?

public static IEnumerable<R>
    SpillSpan<T, P, R>(
        this IEnumerable<T> source,
        Func<T, bool> predicate,
        Func<T, P> prefixMapping,
        Func<P, T, R> resultSelector)
{
    var (prefix, rest) = source.Span(predicate);
    var mappedPrefix = prefixMapping(prefix);
    return rest.Select(e => resultSelector(mappedPrefix, e));
}

// applied to a predicate `predicate` and an enumerable `source`, returns a tuple where first element is longest prefix (possibly empty) of `source` of elements that satisfy `predicate` and second element is the remainder of the enumerable
// >>> span (< 3) [1,2,3,4,1,2,3,4]
// ([1,2],[3,4,1,2,3,4])
public static (IReadOnlyCollection<T> Prefix, IReadOnlyCollection<T> Rest) Span<T>(this IEnumerable<T> source, Func<T, bool> predicate);

The A empty+Func<M, A> seeder+Func<A, M, A> accumulator+Func<A, H> headerSelector logic can be made external and encapsulated into Func<T, P> prefixMapping. chooser handles both predication and mapping, so the mapping part can be moved into prefixMapping too.

@atifaziz
Copy link
Member Author

atifaziz commented Jun 8, 2023

Not quite like span/Parition, because here the proposal is to consume in a streaming fashion. So while the head/header is collected and projected to help process the remainder, the remainder of the sequence (assuming they are the data rows) is streamed. If you have billions of rows, this operator will lazily only consume what's needed. It can be combined with other streaming operators to avoid committing the entire source to memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants