Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a switch to Select-String that makes it return just strings for convenience and performance #7713

Closed
mklement0 opened this issue Sep 5, 2018 · 31 comments · Fixed by #9901
Labels
Committee-Reviewed PS-Committee has reviewed this and made a decision First-Time-Issue Easy issues first time contributors can work on to learn about this project Hacktoberfest Potential candidate to participate in Hacktoberfest Issue-Discussion the issue may not have a clear classification yet. The issue may generate an RFC or may be reclassif Resolution-Fixed The issue is fixed. Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module

Comments

@mklement0
Copy link
Contributor

mklement0 commented Sep 5, 2018

Related: #7712

Sometimes, all you're interested in is the matching input lines as strings rather than full-blown [Microsoft.PowerShell.Commands.MatchInfo] instances.

Not having to extract the .Line property in subsequent processing can also significantly improve performance.

Update: Performance was originally discussed only with respect to how expensive wrapping the matched strings in MatchInfo instances is to begin with: nothing to worry about, apparently - see comments for a discussion.

A new switch named, say, -Bare switch could instruct Select-String to output (undecorated) strings only.

$res = 'line1', 'lineother', 'line3' | Select-String '\d' -Bare; $res; '---'; $res[0].GetType().Name
line1
line3
---
String

Note: I'm suggesting the somewhat abstract name -Bare, because the abstract logic of this proposal - namely to output "bare" objects that are undecorated (have no ETS properties added to them) / are not wrapped in instances of a helper type - applies to other cmdlets as well, such as in #7537, and its conceivable that other cmdlets may benefit from -Bare too, such as ConvertTo-Json in order to solve #5797; other cmdlets could benefit from such a switch too, such as Compare-Object.
Update: The case for -Bare as a general pattern has since been made in #7855.

Environment data

Written as of:

PowerShell Core 6.1.0-preview.4
@iSazonov
Copy link
Collaborator

iSazonov commented Sep 6, 2018

instruct Select-Object

Typo?

We could consider -AsString parameter - looks more understandably.

@iSazonov iSazonov added Issue-Discussion the issue may not have a clear classification yet. The issue may generate an RFC or may be reclassif WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module labels Sep 6, 2018
@mklement0
Copy link
Contributor Author

@iSazonov: Yes, a typo, thanks - fixed.

How about we use both parameter names?

I agree that -AsString is more descriptive in this case, but what appeals me to about -Bare is that it can become part of a general pattern, across different cmdlets, where it requests output of undecorated / unwrapped objects.

@iSazonov
Copy link
Collaborator

iSazonov commented Sep 6, 2018

-Bare is that it can become part of a general pattern

In the case it is better to enumerate in the Issue all cases where the parameter should be .

@mklement0
Copy link
Contributor Author

@iSazonov:

I've added another example to the OP, in addition to the previous Get-Content (#7537 ) one.

If you can think of additional ones, do tell us.

Other than that, I think the updated OP now sufficiently establishes the case for using the name -Bare.

@BrucePay
Copy link
Collaborator

BrucePay commented Sep 6, 2018

@mklement0 We already have -Raw as the pattern for undecorated output (though we only have one instance of it) so there is no need to add a random new parameter pattern here. But I also don't think -Raw is appropriate as MatchInfo is concrete type, not a decorated object that can be cored (i.e. not a PSObject with note properties.) @iSazonovI I like -AsString since it's more informative and we do things like -AsHashtable elsewhere. And adding multiples aliases for no reason is undesirable - it's just more stuff to learn with no benefit.

Regarding performance, I doubt creating MatchInfo objects has much to do with Select-String performance. Again, it's a simple concrete type, not NoteProperties on a PSObject so the Get-Content analogy is not appropriate. Compared to the other operations in the cmdlet, I suspect that MatchObject construction would barely register. (Note: This PR will help with overall performance.) In the issue linked above, it seems that the specific issue has to do with rendering the MatchInfo objects rather than creating them.

@mklement0
Copy link
Contributor Author

@BrucePay:

As for the parameter name:

If you're referring to Get-Content -Raw: it does not provide undecorated output, it just reads the file as a whole - the resulting single string still has ETS properties.

The only other instance of -Raw I could find is on Format-Hex, which is currently undocumented, and it's unclear to me what it does.

I therefore consider -Raw "skunked" and -Bare a viable alternative, not least because raw, as stated, has connotations of uninterpreted / undecoded, which do not apply.


As fo the semantics of the proposed -Bare:

The OP now states:

the abstract logic of this proposal - namely to output "bare" objects that are undecorated (have no ETS properties added to them) / are not wrapped in instances of another type

That is, -Bare would be an abstraction that would apply to both this case (objects of interest wrapped in another type) and the Get-Content case (objects decorated with ETS properties).

The informal gist of this abstraction is:

"Give me just the objects I'm interested, nothing else - do not attach properties I don't care about and don't wrap the object in another type providing metadata."


As for performance:

The primary reason for creating this issue was to forgo the wrappers around the matching lines, because just wanting the lines themselves is a common use case.

Any performance gain - if any - would just be beneficial side effect.

It sounds like you're saying that performance gain will primarily come from bypassing ETS properties rather than straight .NET-type wrappers.


The issue you meant to link to is #7673 (your link URL has an extraneous char. at the end).

@BrucePay
Copy link
Collaborator

BrucePay commented Sep 6, 2018

@mklement0 Thanks for pointing out the link issue, it should be fixed now.

It sounds like you're saying that performance gain will primarily come from bypassing ETS properties rather than straight .NET-type wrappers.

There are no wrappers for MatchInfo, that is, no PSObject extended properties to bypass. MatchInfo is the result type of Select-String just like RegularExpressions.Match is the result type of [regex]::Match(). The match object does contain the string, but it also contains all of the context information.

In terms of actual string matching performance, Select-String and (git's) grep.exe are not that far apart:

PSCore (2:213) >  time { grep.exe try C:\netpop\WarAndPeace.txt} | % totalmilliseconds
61.3222
PSCore (2:214) >  time { sls try C:\netpop\WarAndPeace.txt} | % totalmilliseconds
82.9409

Hopefully @powercode 's changes will make them even closer. Where the performance diverges significantly is when you render the output to string:

PSCore (2:215) >  time { grep.exe try C:\netpop\WarAndPeace.txt | out-string} | % totalmilliseconds
66.2149
PSCore (2:216) >  time { sls try C:\netpop\WarAndPeace.txt | out-string} | % totalmilliseconds
330.4404

Now PowerShell is much slower. So speeding up rendering is the place to look for performance issues. In fact, only emitting the string does substantially "improve" performance, at the cost of losing all context information:

PSCore (2:230) >  time { (sls try C:\netpop\WarAndPeace.txt).line | out-string} | % totalmilliseconds
96.3773

Also note that rendering performance is only an issue if you have a lot of matches.

The primary reason for creating this issue was to forgo the wrappers around the matching lines, because just wanting the lines themselves is a common use case.

Which is certainly easy enough to accomplish now as shown above. It's not a use case I find especially compelling in PowerShell - if I simply want to filter strings I'll use the -match operator. I use Select-String when I want to know in which file and on what line the pattern is matched.

@mklement0
Copy link
Contributor Author

mklement0 commented Sep 7, 2018

@BrucePay:

There are no wrappers for MatchInfo, that is, no PSObject extended properties to bypass

What I meant by wrapper in this context: the core data of interest - the matching lines - are wrapped in a helper type (MatchInfo) whose purpose is to provide context.

Often there is no need for that wrapper, hence this proposal.


Which is certainly easy enough to accomplish now as shown above. It's not a use case I find especially compelling in PowerShell - if I simply want to filter strings I'll use the -match operator. I use Select-String when I want to know in which file and on what line the pattern is matched.

A primary benefit of the pipeline is memory-throttling.

You forfeit that benefit if you use the -match operator and - to a lesser degree, depending on the number of matches - if you use (...).Line

Especially with large files, Select-String is the tool of choice and -match is then not a suitable alternative (as it would require loading the entire file into memory up-front).

In general, it's not a good idea to frame operators vs. cmdlets as something you can choose freely between.

@powercode
Copy link
Collaborator

I have done some profiling on Select-String, and the massive hit is creating strings for all lines of every file. This also limits the gains of parallelizing Select-String, since it dies the GC death. (spending half it's time on GC).

@BrucePay is correct that the allocation of MatchInfos is a microscopic part of the runtime.

I've opened an issue on .net core, regarding span based alternatives for regex, and work is being done there.

Once we have a way of quickly scanning lines without allocating strings, and have a RegEx class that doesn't allocate a lot of strings internally, we can take a second pass on the performance of Select-String.

I have a parallel impementation of Select-String, but it doesn't perform much better because of the GC issues.

@iSazonov
Copy link
Collaborator

iSazonov commented Sep 7, 2018

I've opened an issue on .net core, regarding span based alternatives for regex, and work is being done there.

Please add reference on the issue.

@iSazonov
Copy link
Collaborator

iSazonov commented Sep 7, 2018

I think we can exclude "performance" from considiration in the Issue and look only on "convenience and performance".

So suggestion is to add new switch parameter to output string results not MatchInfo objects. Name can be -Raw or -AsString. It is a question for PowerShell Committee.

@iSazonov iSazonov added the Review - Committee The PR/Issue needs a review from the PowerShell Committee label Sep 7, 2018
@mklement0 mklement0 changed the title Add a switch to Select-String that makes it return just strings, for both convenience and performance Add a switch to Select-String that makes it return just strings for convenience Sep 11, 2018
@mklement0
Copy link
Contributor Author

Thanks for the great info re performance, @powercode.

Good point, @iSazonov - I've updated the OP to remove the performance aspect.

For the reasons discussed, I think -Raw is not an option, as the only reason for its use would be to establish a general pattern, which is at odds with applying that pattern to Get-Content (as requested in #7537), where -Raw already has a different meaning.

To me, the options are, in descending order of personal preference:

  • use just -Bare to align with the envisioned cross-cmdlet pattern
  • use -Bare and make -AsString an alias
  • use just -AsString

While -AsString is arguably more descriptive taken by itself, its use here doesn't quite align with its existing use in other cmdlets:

  • In Get-Unique, it determines how to treat the input.

  • In Group-Object, it determines a specific aspect of the output objects produced, not the type of the output objects per se.


Taking a step back:

The very name Select-*String* would lead one to expect string-typed output by default, yet you get instances of a helper type.

From that perspective, -Bare makes more sense: it requests returning just what the cmdlet is arguably designed to return, without metadata - irrespective of whether that metadata takes the form of a helper wrapper type or the form of ETS properties directly attached to the objects.

@rkeithhill
Copy link
Collaborator

Speaking from an "aesthetic" point of view, this is kind of blech:

Get-ChildItem . -r *.h | Select-String -AsString typedef

The Select-String -AsString bit seems oddly redundant. I don't really like that. I'd prefer -Bare or -Raw.

@SteveL-MSFT
Copy link
Member

SteveL-MSFT commented Oct 3, 2018

@PowerShell/powershell-committee reviewed this, we believe -match or using | % propname is sufficient to achieve this scenario. If you have more details of scenarios requiring the proposed behavior, please let us know.

@SteveL-MSFT SteveL-MSFT added Committee-Reviewed PS-Committee has reviewed this and made a decision and removed Review - Committee The PR/Issue needs a review from the PowerShell Committee labels Oct 3, 2018
@mklement0
Copy link
Contributor Author

mklement0 commented Oct 4, 2018

@SteveL-MSFT:

A joint bulletin from the Spilt-Milk and 20-20 Departments:

As a general rule, expression-mode solutions and pipeline solutions aren't interchangeable, for reasons of performance (yay for expressions) and memory consumption (yay for pipelines), so recommending the use of -match is not helpful as an alternative to Select-String.

As for | % propname (I assume you meant: | % Line[1]):

I can't help but notice the irony of a cmdlet named Select-*String* not only not emitting strings by default, but not even offering an option to do so.
(An irony also alluded to by @rkeithhill when questioning naming the parameter to request string output -AsString).

If you have more details of scenarios requiring the proposed behavior

Let me ask the opposite question: Do you think it is typically more useful to emit MatchInfo objects rather than the strings matched?
I know that changing the default behavior is no longer an option, but the next best thing is to opt into the behavior that should have been the default to begin with - by way of a switch.


[1] As a further aside: Given that Select-Object typically, but not necessarily operates on lines, the naming of that property is also problematic.

@SteveL-MSFT
Copy link
Member

@mklement0 the discussion within @PowerShell/powershell-committee is that PowerShell is not replicating the sed or grep experience so having a rich object makes sense. MatchInfo provides metadata that may be useful more than just the string. We agreed that there are cases where just the string is what is desired, but believe the currently mechanisms are sufficient.

@mklement0
Copy link
Contributor Author

Thanks for the feedback, @SteveL-MSFT.

To be clear: the rich match information is a wonderful thing to have if needed - and often it is not.

Arguably, it should have been opt-in to begin with, with string output as default, but that ship has obviously sailed, hence the suggestion to reverse the logic and make string output opt-in.

Forcing the extra step of accessing the .Line property on each output object is not just inconvenient, but also a performance issue:

Consider the following results, searching through a 100,000-lines file (with lines containing just sequence numbers: 1..1e5 > t.txt):

Command                                             FriendlySecs (10-run avg.) Factor
-------                                             -------------------------- ------
sls '\d' t.txt                                      0.173                      1.00
(sls '\d' t.txt).Line                               0.424                      2.45
sls '\d' t.txt | Select-Object -ExpandProperty Line 3.732                      21.52

As you can see, even member enumeration is a notable slowdown, but if you must use the pipeline, the slowdown is dramatic: the Select-Object -ExpandProperty makes the command 20 times slower.

@SteveL-MSFT
Copy link
Member

@mklement0 perf wasn't something we discussed so I appreciate you taking the time to post the numbers. Seeing the data, I would agree that for large files, there is a significant perf difference.

@SteveL-MSFT SteveL-MSFT added Review - Committee The PR/Issue needs a review from the PowerShell Committee and removed Committee-Reviewed PS-Committee has reviewed this and made a decision labels Oct 4, 2018
@mklement0
Copy link
Contributor Author

Thanks, @SteveL-MSFT. Performance was initially discussed, but only in the context of how expensive it is to construct the match-info objects around the matched strings (and the answer was: nothing to worry about) - it hadn't occurred to me until now that the performance penalty comes from the need to "unwrap" the match-info objects in order to get at the strings during later processing.

@iSazonov
Copy link
Collaborator

iSazonov commented Oct 5, 2018

Perhaps it must be another solition #4767 (comment)

@mklement0
Copy link
Contributor Author

Thanks, @iSazonov, but it's obviously preferable to improve Select-String rather than relying on a third-party solutions such as ripgrep.

And I think that there's hope: The combination of alleviating the GC issues that @powercode mentions above, combined with outputting strings only, may make Select-String fast enough.


As an aside re @lzybkr's other statement in the linked comment (it's a related, but separate issue):

I do think we need to detect where the output is going - and not just for the width. If the output is to a file, we might also want to strip ansi escape sequences (with some sort of option to not do so).

While I think having the option to strip ANSI escape sequences would be great, I don't think it should be done by default - both for reasons of performance and for consistency with grep implementations, which do not perform stripping (and apparently don't even offer the option and only offer coloring output themselves to reflect matching parts).

Perhaps a dedicated, general-purpose Remove-AnsiEscapeSequences cmdlet is (also) called for (surprisingly, the Unix world seems to have no such utility).

@lzybkr
Copy link
Member

lzybkr commented Oct 5, 2018

I just tried ls and grep on my default Ubuntu install (via WSL). I get colored output on the console, but no colors when redirecting to a file. In addition, ls prints 1 file per line when redirected to a file.

I'm sure they aren't removing escape sequences, but they aren't generating them in the first place.

But my point wasn't about specific implementation details - it was that smarts are needed for a good experience - interacting with a console (possibly via a pty like in tmux) or writing to a file. PowerShell doesn't have those smarts.

@mklement0
Copy link
Contributor Author

mklement0 commented Oct 5, 2018

Understood re smarts, @lzybkr, but the Unix smarts relate to the utility producing the colored strings, not to stripping on consumption with general-purpose filtering utilities such as grep.

That is, if you feed grep colored strings, it will happily and invariably pass the escape codes through.

A Remove-AnsiEscapeSequences cmdlet and/or a -StripAnsiCodes switch for Select-String would cover the consumption side.


To address your specific examples:

  • Yes, ls itself adjusts its behavior based on whether you're redirecting to a file / pipe (unless you specify --color or --color=always), in which case the escape sequences are stripped (or never added).

  • Ditto for grep, but only with respect to the color codes it adds in order to highlight matches, not the ones in the input.

@mklement0
Copy link
Contributor Author

mklement0 commented Oct 5, 2018

To put it differently: In an ideal world where all output-producing utilities exhibit the smarts you mention and therefore suppress inclusion of escape sequences when not outputting to a terminal, there'd be no need for Remove-AnsiEscapeSequences and/or -StripAnsiCodes - but in practice you do need to deal with these situations.

@iSazonov
Copy link
Collaborator

iSazonov commented Oct 6, 2018

I wonder that we switched to discuss stripping escapes here but I should say I'd expect that it is formatting system area to do coloring. I think Unix utilities get the parameter (--color) because there are no objects there and user have to set the parameter explicitly in a last command of the pipe. (I found examples of coloring applications that parse the output of other apps/ bash commands and then make a colored output).
In contrast, we know the type of the outgoing object and can delegate the coloring to the formatting system.
From this point of view, I do not see the need to filter escape sequences explicitly - by default formatting system must only mask them (escape escapes) so as not to break the formatting (and possibly have an option to bypass them(ex., if we get them from external app)).

As for Select-String, I want to try a couple of ideas in the next few weeks to reduce allocations. Ping me if I don't do it.

@mklement0
Copy link
Contributor Author

mklement0 commented Oct 8, 2018

Thanks for being willing to tackle performance improvements, @iSazonov

As for stripping escape codes:

Sorry - that was a tangent that I started, based on what I now presume to be at least a partial misunderstanding of the comment by @lzybkr that you linked to.

Let me try to close the tangent:

I think Unix utilities get the parameter (--color) because there are no objects there and user have to set the parameter explicitly in a last command of the pipe.

No, coloring is more typically applied via aliases (e.g., alias ls='ls --color=auto' on Linux) or environment variables, where supported (e.g., export CLICOLOR on macOS).

Generally, however, the expectation is that the producers of optimized-for-display output (coloring, padding/multi-column output) suppress these optimizations situationally, namely if stdout isn't connected to a terminal (unless keeping colors is explicitly requested); both ls and grep are examples of utilities that exhibit this behavior.

You're right that in PowerShell we typically don't have this problem, because PowerShell's output-formatting system does not come into play:

  • when piping objects to PowerShell-native commands, because the receiving commands process the objects as such.
  • when piping objects to external program, because it is the .ToString() representation that is then used (and while that representation is often unhelpful, it is untroubled by issues of assuming a fixed window width or applying output coloring).

However, problems do arise with > / Out-File, where the default output formatting is applied, and can cause problems:

That is, > / Out-File should be enhanced to modify its behavior on whether output is going to a terminal or a file (analogous to Unix utilities - the smarts PowerShell currently lacks, as pointed out by @lzybkr) - in the latter case, either no max. line width should be assumed, or a reasonably high default value should be used, and coloring should be stripped, at least by default.


Therefore, the need for explicitly stripping escape codes on receiving strings is definitely atypical, and as such Remove-AnsiEscapeSequences cmdlet and/or a -StripAnsiCodes switch for Select-String are definitely low-priority - albeit nice to have in principle ("rogue" utilities / scripts could be producing output with escape sequences unconditionally; that this does happen in practice is demonstrated by this SO query).

@SteveL-MSFT
Copy link
Member

@PowerShell/powershell-committee reviewed this again. We are fine with adding a -Raw switch in a new parameter set to emit strings.

@SteveL-MSFT SteveL-MSFT added Committee-Reviewed PS-Committee has reviewed this and made a decision Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors and removed Review - Committee The PR/Issue needs a review from the PowerShell Committee labels Nov 7, 2018
@SteveL-MSFT SteveL-MSFT added the First-Time-Issue Easy issues first time contributors can work on to learn about this project label Jan 1, 2019
@TylerLeonhardt TylerLeonhardt added the Hacktoberfest Potential candidate to participate in Hacktoberfest label Feb 19, 2019
@mklement0 mklement0 changed the title Add a switch to Select-String that makes it return just strings for convenience Add a switch to Select-String that makes it return just strings for convenience and performance Feb 26, 2019
@Jawz84
Copy link
Contributor

Jawz84 commented Jun 14, 2019

I'd like to pick this one up.

@vexx32
Copy link
Collaborator

vexx32 commented Jun 14, 2019

Have at it! 💖

@Jawz84
Copy link
Contributor

Jawz84 commented Jun 14, 2019

Hey Joel @vexx32 tnx! I've put in a PR as you can see.

@iSazonov iSazonov added the Resolution-Fixed The issue is fixed. label Jul 22, 2019
@ghost
Copy link

ghost commented Aug 20, 2019

🎉This issue was addressed in #9901, which has now been successfully released as v7.0.0-preview.3.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Committee-Reviewed PS-Committee has reviewed this and made a decision First-Time-Issue Easy issues first time contributors can work on to learn about this project Hacktoberfest Potential candidate to participate in Hacktoberfest Issue-Discussion the issue may not have a clear classification yet. The issue may generate an RFC or may be reclassif Resolution-Fixed The issue is fixed. Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module
Projects
None yet
10 participants