New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a switch to Select-String that makes it return just strings for convenience and performance #7713
Comments
Typo? We could consider |
@iSazonov: Yes, a typo, thanks - fixed. How about we use both parameter names? I agree that |
In the case it is better to enumerate in the Issue all cases where the parameter should be . |
@mklement0 We already have Regarding performance, I doubt creating |
As for the parameter name: If you're referring to The only other instance of I therefore consider As fo the semantics of the proposed The OP now states:
That is, The informal gist of this abstraction is: "Give me just the objects I'm interested, nothing else - do not attach properties I don't care about and don't wrap the object in another type providing metadata." As for performance: The primary reason for creating this issue was to forgo the wrappers around the matching lines, because just wanting the lines themselves is a common use case. Any performance gain - if any - would just be beneficial side effect. It sounds like you're saying that performance gain will primarily come from bypassing ETS properties rather than straight .NET-type wrappers. The issue you meant to link to is #7673 (your link URL has an extraneous char. at the end). |
@mklement0 Thanks for pointing out the link issue, it should be fixed now.
There are no wrappers for In terms of actual string matching performance, PSCore (2:213) > time { grep.exe try C:\netpop\WarAndPeace.txt} | % totalmilliseconds
61.3222
PSCore (2:214) > time { sls try C:\netpop\WarAndPeace.txt} | % totalmilliseconds
82.9409 Hopefully @powercode 's changes will make them even closer. Where the performance diverges significantly is when you render the output to string: PSCore (2:215) > time { grep.exe try C:\netpop\WarAndPeace.txt | out-string} | % totalmilliseconds
66.2149
PSCore (2:216) > time { sls try C:\netpop\WarAndPeace.txt | out-string} | % totalmilliseconds
330.4404 Now PowerShell is much slower. So speeding up rendering is the place to look for performance issues. In fact, only emitting the string does substantially "improve" performance, at the cost of losing all context information: PSCore (2:230) > time { (sls try C:\netpop\WarAndPeace.txt).line | out-string} | % totalmilliseconds
96.3773 Also note that rendering performance is only an issue if you have a lot of matches.
Which is certainly easy enough to accomplish now as shown above. It's not a use case I find especially compelling in PowerShell - if I simply want to filter strings I'll use the |
What I meant by wrapper in this context: the core data of interest - the matching lines - are wrapped in a helper type ( Often there is no need for that wrapper, hence this proposal.
A primary benefit of the pipeline is memory-throttling. You forfeit that benefit if you use the Especially with large files, In general, it's not a good idea to frame operators vs. cmdlets as something you can choose freely between. |
I have done some profiling on Select-String, and the massive hit is creating strings for all lines of every file. This also limits the gains of parallelizing Select-String, since it dies the GC death. (spending half it's time on GC). @BrucePay is correct that the allocation of I've opened an issue on .net core, regarding span based alternatives for regex, and work is being done there. Once we have a way of quickly scanning lines without allocating strings, and have a RegEx class that doesn't allocate a lot of strings internally, we can take a second pass on the performance of Select-String. I have a parallel impementation of Select-String, but it doesn't perform much better because of the GC issues. |
Please add reference on the issue. |
I think we can exclude "performance" from considiration in the Issue and look only on "convenience and performance". So suggestion is to add new switch parameter to output string results not MatchInfo objects. Name can be |
Thanks for the great info re performance, @powercode. Good point, @iSazonov - I've updated the OP to remove the performance aspect. For the reasons discussed, I think To me, the options are, in descending order of personal preference:
While
Taking a step back: The very name From that perspective, |
Speaking from an "aesthetic" point of view, this is kind of blech:
The |
@PowerShell/powershell-committee reviewed this, we believe |
A joint bulletin from the Spilt-Milk and 20-20 Departments: As a general rule, expression-mode solutions and pipeline solutions aren't interchangeable, for reasons of performance (yay for expressions) and memory consumption (yay for pipelines), so recommending the use of As for I can't help but notice the irony of a cmdlet named
Let me ask the opposite question: Do you think it is typically more useful to emit [1] As a further aside: Given that |
@mklement0 the discussion within @PowerShell/powershell-committee is that PowerShell is not replicating the |
Thanks for the feedback, @SteveL-MSFT. To be clear: the rich match information is a wonderful thing to have if needed - and often it is not. Arguably, it should have been opt-in to begin with, with string output as default, but that ship has obviously sailed, hence the suggestion to reverse the logic and make string output opt-in. Forcing the extra step of accessing the Consider the following results, searching through a 100,000-lines file (with lines containing just sequence numbers: Command FriendlySecs (10-run avg.) Factor
------- -------------------------- ------
sls '\d' t.txt 0.173 1.00
(sls '\d' t.txt).Line 0.424 2.45
sls '\d' t.txt | Select-Object -ExpandProperty Line 3.732 21.52 As you can see, even member enumeration is a notable slowdown, but if you must use the pipeline, the slowdown is dramatic: the |
@mklement0 perf wasn't something we discussed so I appreciate you taking the time to post the numbers. Seeing the data, I would agree that for large files, there is a significant perf difference. |
Thanks, @SteveL-MSFT. Performance was initially discussed, but only in the context of how expensive it is to construct the match-info objects around the matched strings (and the answer was: nothing to worry about) - it hadn't occurred to me until now that the performance penalty comes from the need to "unwrap" the match-info objects in order to get at the strings during later processing. |
Perhaps it must be another solition #4767 (comment) |
Thanks, @iSazonov, but it's obviously preferable to improve And I think that there's hope: The combination of alleviating the GC issues that @powercode mentions above, combined with outputting strings only, may make As an aside re @lzybkr's other statement in the linked comment (it's a related, but separate issue):
While I think having the option to strip ANSI escape sequences would be great, I don't think it should be done by default - both for reasons of performance and for consistency with Perhaps a dedicated, general-purpose |
I just tried I'm sure they aren't removing escape sequences, but they aren't generating them in the first place. But my point wasn't about specific implementation details - it was that smarts are needed for a good experience - interacting with a console (possibly via a pty like in tmux) or writing to a file. PowerShell doesn't have those smarts. |
Understood re smarts, @lzybkr, but the Unix smarts relate to the utility producing the colored strings, not to stripping on consumption with general-purpose filtering utilities such as That is, if you feed A To address your specific examples:
|
To put it differently: In an ideal world where all output-producing utilities exhibit the smarts you mention and therefore suppress inclusion of escape sequences when not outputting to a terminal, there'd be no need for |
I wonder that we switched to discuss stripping escapes here but I should say I'd expect that it is formatting system area to do coloring. I think Unix utilities get the parameter ( As for Select-String, I want to try a couple of ideas in the next few weeks to reduce allocations. Ping me if I don't do it. |
Thanks for being willing to tackle performance improvements, @iSazonov As for stripping escape codes: Sorry - that was a tangent that I started, based on what I now presume to be at least a partial misunderstanding of the comment by @lzybkr that you linked to. Let me try to close the tangent:
No, coloring is more typically applied via aliases (e.g., Generally, however, the expectation is that the producers of optimized-for-display output (coloring, padding/multi-column output) suppress these optimizations situationally, namely if stdout isn't connected to a terminal (unless keeping colors is explicitly requested); both You're right that in PowerShell we typically don't have this problem, because PowerShell's output-formatting system does not come into play:
However, problems do arise with
That is, Therefore, the need for explicitly stripping escape codes on receiving strings is definitely atypical, and as such |
@PowerShell/powershell-committee reviewed this again. We are fine with adding a |
I'd like to pick this one up. |
Have at it! 💖 |
Hey Joel @vexx32 tnx! I've put in a PR as you can see. |
🎉This issue was addressed in #9901, which has now been successfully released as Handy links: |
Related: #7712
Sometimes, all you're interested in is the matching input lines as strings rather than full-blown
[Microsoft.PowerShell.Commands.MatchInfo]
instances.Not having to extract the
.Line
property in subsequent processing can also significantly improve performance.Update: Performance was originally discussed only with respect to how expensive wrapping the matched strings in
MatchInfo
instances is to begin with: nothing to worry about, apparently - see comments for a discussion.A new switch named, say,
-Bare
switch could instructSelect-String
to output (undecorated) strings only.Note: I'm suggesting the somewhat abstract name
-Bare
, because the abstract logic of this proposal - namely to output "bare" objects that are undecorated (have no ETS properties added to them) / are not wrapped in instances of a helper type - applies to other cmdlets as well, such as in #7537, and its conceivable that other cmdlets may benefit from-Bare
too, such asConvertTo-Json
in order to solve #5797; other cmdlets could benefit from such a switch too, such asCompare-Object
.Update: The case for
-Bare
as a general pattern has since been made in #7855.Environment data
Written as of:
The text was updated successfully, but these errors were encountered: