Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support intermediate pipe handlers to transform output from one command into input for another #206

Open
jeroen-mostert opened this issue May 2, 2023 · 11 comments

Comments

@jeroen-mostert
Copy link

Details

You can write cmd | Console.WriteLine (for example) to pipe output to a delegate, and cmd1 | cmd2 | cmd3 to pipe between commands, but you can't write cmd1 | twiddle | cmd2, with cmd a Func<string, string>, to transform between output and input. Any chance this could be implemented?

Syntax niceties aside, I haven't found any easy, clean way of expressing this logic with the existing interfaces either (assuming you don't want to buffer the input in its entirety). While handling output can be done with a delegate, there appears to be no dual for handling input (through an IEnumerable<string> or other iterator/observer mechanism). Only stream-based inputs are supported, and while you can write your very own stream type for that purpose, that seems very uncomfortable. As the existing infrastructure seems to be entirely based on Stream, that may also be the reason this isn't supported, I guess. :P System.IO.Pipelines may be of some use there, though it makes my head hurt every time I look at it.

Dropping down to Process and writing a good old-fashioned loop around Output.ReadLine and Input.WriteLine works, of course, but you need to take care of reading the other streams to their end and cleaning up and all those other niggles that CliWrap is otherwise neat at handling, so it's not at all cool.

@Tyrrrz
Copy link
Owner

Tyrrrz commented May 2, 2023

As the existing infrastructure seems to be entirely based on Stream, that may also be the reason this isn't supported.

Pretty much, since the underlying stdin/stdout/stderr streams are, in fact, streams. The pipe abstraction is just an inversion of the stream abstraction – where you normally read from a stream, you write to the pipe, and vice versa.

Now, in theory, your suggestion can be implemented without buffering the entire input, but by a process of:

  1. Decode a chunk of stream
  2. Yield a line (until break) or accumulate and return to 1
  3. Transform to get the result
  4. Re-encode the string and write it to the stdin stream of another command

We already do 1 and 2 for existing string-based pipe targets. The rest, however, would require additional abstractions in addition to PipeSource/PipeTarget, that would act as intermediate data handlers.

Perhaps an easier solution would be to implement an IAsyncEnumerable<string>-based PipeSource? That way you should be able to take a ListenAsync() output of an existing comand, use System.Reactive extensions to transform it in the correct IAsyncEnumerable<string> form, and then feed it into another command.

@jeroen-mostert
Copy link
Author

Although pulling in Rx is easier in the sense of requiring less code, it sure sounds like a whole bunch of overhead to overcome what is essentially the inability (or perhaps more accurately unwillingness, in design terms) of CliWrap to customize the copy loop from stream to stream. Using Rx to implement what is essentially a fully synchronous buffer (with hidden characteristics) is probably one of the worst ways to use it (although this is admittedly just me speaking off the cuff without having measured the actual performance impact, which is likely still dwarfed by the IPC itself).

A stream abstraction over a delegate is still overhead, but at least the overhead there is easier to reduce to a minimum. Even so it feels clunky that it would be necessary at all, although I do understand the alternative would probably involve revising a bunch of internal plumbing, which isn't necessarily worth it either. I'll have to think about whether this is worth looking into further or if I'm better off just doing my own little wrapping, since the rest of my scenario isn't very exciting.

@Tyrrrz
Copy link
Owner

Tyrrrz commented May 2, 2023

It would help if you could also provide some code samples that illustrate the usage scenario more in-depth

@jeroen-mostert
Copy link
Author

jeroen-mostert commented May 2, 2023

Simple enough, let me show you what I'm actually doing (well, more or less):

while (inputProcess.StandardOutput.ReadLine() is string line) {
    JsonDocument document = JsonDocument.Parse(line);
    using (Utf8JsonWriter writer = new(outputProcess.StandardInput.BaseStream)) {
        // Do a bunch of calls on `writer` using what's in `jsonDocument`.
        // We only make forward motion here, no buffering
    }
    outputProcess.StandardOutput.WriteLine();
}

The additional complication here is that I also need to read the StandardError of both processes, but these are just piped through to my own StandardError without further transformation. This is what CliWrap can take care of transparently using .WithStandardErrorPipe().

This is all a lot of ceremony and boilerplate for what could be ProcessOne | transformJsonDocument | ProcessTwo (albeit with the wrinkle that the naive version of this could not use Utf8JsonWriter on the output stream directly but would just be producing a string, but let's assume you go for the simplest thing that could possibly work here before optimizing it -- it's much more important that the whole input is not buffered rather than that individual lines are not buffered).

It's also notable that no asynchronous code is involved, mainly because for this particular scenario the added overhead isn't worth it -- it just makes the total time slower while we don't need the scalability benefits, if there are any. (Also, as I understand it, there is no portable asynchronous console I/O so you'll usually get async-over-sync anyway.)

@Tyrrrz
Copy link
Owner

Tyrrrz commented May 2, 2023

Okay, thanks for the example. Good thing I asked because in your scenario you don't actually need to encode/decode strings since both JsonDocument and Utf8JsonWriter can both work directly on streams. That makes things much simpler.

@jeroen-mostert
Copy link
Author

That's true, although the input does need to be split on newlines, which JsonDocument will not do (though this doesn't require decoding the string). That said, I was also thinking of more mundane scenarios where the input does conceptually need to be processed as strings, which is practically a poster child of pipeline processing (grep, awk, sed, etc.) If the processing could be done entirely only in terms of streams it would be easier, but this is often not the case. On the other hand, splitting on newlines probably is the most that is required for most text-based processing.

@Tyrrrz
Copy link
Owner

Tyrrrz commented May 3, 2023

Yeah. I definitely think there's merit to the idea in general, just need to make sure not to overshoot with the design.

Having some sort of PipeHandler concept, in addition to PipeSource and PipeTarget, sounds reasonable. I wonder what's the best way to integrate it within the existing pipeline.

However, if the use case primarily revolves around transforming input/output between two commands, then maybe it's instead more fitting to expand the functionality of PipeSource.FromCommand(...).

@jeroen-mostert
Copy link
Author

[..] expand the functionality of PipeSource.FromCommand(...).

You mean something like adding an overload that accepts a delegate to transform the output, and/or maybe a separator or splitter function of some sort? That could work. It would support a scenario like cmd1 | one | two | three | cmd2 if you do the functional composition of one | two | three yourself. It would not support arbitrary pipelines like cmd1 | foo | cmd2 | bar | cmd3, but that does seem like a rather less probable setup -- even regular pipelines with multiple commands would be unusual to kick off from another process (as opposed to just shelling out).

@Tyrrrz
Copy link
Owner

Tyrrrz commented May 3, 2023

You mean something like adding an overload that accepts a delegate to transform the output, and/or maybe a separator or splitter function of some sort? That could work.

Yeah, something like that. A stream-to-stream transform at the base level.

It would not support arbitrary pipelines like cmd1 | foo | cmd2 | bar | cmd3, but that does seem like a rather less probable setup -- even regular pipelines with multiple commands would be unusual to kick off from another process (as opposed to just shelling out).

Why exactly would that not work?

var cmdA = PipeSource.FromCommand(Cli.Wrap("cmd1"), x => ...) | Cli.Wrap("cmd2");
var cmdB = PipeSource.FromCommand(cmdA, x => ...) | Cli.Wrap("cmd3");

@jeroen-mostert
Copy link
Author

Right. For some reason I was fixated on PipeSource not directly returning a Command. The syntax does get a little awkward this way because transformations are not on the same level as commands, but with this an operator |(Command, Func<...>) overload should still be possible where they are.

@Tyrrrz Tyrrrz changed the title Support operator | on Func<string, string> (or on-demand input in general) Support intermediate pipe handlers to transform output from one command into input for another May 4, 2023
@Tyrrrz
Copy link
Owner

Tyrrrz commented Aug 7, 2023

I'm still unsure how to proceed with this feature, but I'm leaning towards mentally filing this under the "future major version" category. On the surface, the simplest solution (i.e., what I described above) can be implemented like so:

public static PipeSource FromCommand(
    Command command,
    Func<Stream, Stream, CancellationToken, Task> transformAsync) =>
    Create(async (destination, cancellationToken) =>
        await command
            .WithStandardOutputPipe(
                PipeTarget.Create(async (source, innerCancellationToken) =>
                    await transformAsync(source, destination, innerCancellationToken).ConfigureAwait(false))
            )
            .ExecuteAsync(cancellationToken)
            .ConfigureAwait(false)
    );

This allows the user to manually copy data from the stdout of one command (source) to the stdin of another command (destination). In doing so, they can perform any binary transformations they want.

However, this is a pretty low-level API and the library should provide some higher level overloads as well, such as those accepting Func<string, CancellationToken, Task<string>> for per-line transformations:

public static PipeSource FromCommand(
    Command command,
    Func<string, CancellationToken, Task<string>> transformAsync) =>
    FromCommand(command, async (source, destination, cancellationToken) =>
    {
        using var reader = new StreamReader(source);
        using var writer = new StreamWriter(destination);
        
        await foreach (var line in reader.ReadAllLinesAsync(cancellationToken))
        {
            var transformedLine = await transformAsync(line, cancellationToken).ConfigureAwait(false);
            await writer.WriteLineAsync(transformedLine).ConfigureAwait(false);
        }
    });

At this point, the delegates become unwieldy, and we're returning to the original suggestion of implementing some sort of PipeTransform or PipeHandler. Which is something that I feel pushes the library towards a larger redesign.

I think it's best that we park this issue until I can gather enough comments from the community to understand how the design should be shaped around this feature. As an interim solution, I recommend using the code snippets I shared above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants