Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[System.Convert]::FromBase64String causes memory leak with large strings #21473

Open
5 tasks done
chopinrlz opened this issue Apr 15, 2024 · 61 comments
Open
5 tasks done
Labels
Needs-Triage The issue is new and needs to be triaged by a work group. WG-Engine-Performance core PowerShell engine, interpreter, and runtime performance

Comments

@chopinrlz
Copy link

Prerequisites

Steps to reproduce

This was tested on PowerShell 7.4.2

  1. Download the .NET 8.0 installer for Windows x64 to use as the test file, the direct link to this is here: https://dotnet.microsoft.com/en-us/download/dotnet/thank-you/sdk-8.0.204-windows-x64-installer
  2. Save this file into a folder on your computer somewhere
  3. Create a PowerShell script with the following contents in the same folder as the .NET 8.0 installer:
Get-Date
Write-Host "Creating Path to dotnet.exe test file"
$file = Join-Path -Path $PSScriptRoot -ChildPath "dotnet.exe"
Write-Host "Reading all file bytes into memory"
$bytes = [System.IO.File]::ReadAllBytes( $file )
Write-Host "Converting file bytes to base64 string"
$base64 = [System.Convert]::ToBase64String( $bytes )
Write-Host "Converting base64 string back to file bytes"
$bytes = [System.Convert]::FromBase64String( $base64 )
Write-Host "Test complete"
Get-Date

NOTE: That if you test this with a newer version of the .NET 8.0 installer, you may have to modify the test script to pick the correct file for the test since the filename is hard coded on line 3.

  1. Open a PowerShell window in the folder with the script and test file
  2. Run the PowerShell script
  3. Open Task Manager and observe the memory usage of PowerShell
  4. Note the time required to complete the conversion from Base64 and the usage of upwards of 3.5 GB of RAM to do so

The .NET 8.0 installer for Windows x64 is approximately 222 MB in size. Reading into memory and converting to base64 then converting back should require about 790 MB of RAM with all variables remaining in scope during the process and no garbage collection happening or object disposal happening. The observed behavior appears to be memory-leak related as the amount of memory used once the conversion eventually completes is about 3.4 GB of RAM. These data points can be see in the attached screen shots.

Expected behavior

When you run the same script in PowerShell 7 and Windows PowerShell 5.1, you see two very different behaviors:

In PowerShell 7.4.2, the time to complete is 82 seconds and memory used is 3.4 GB
In PowerShell 5.1, the time to complete is 7 seconds and memory used is 1.0 GB

This suggests there is an error in the PowerShell 7.4.2 / .NET 8.0 implementation.

Actual behavior

In PowerShell 7.4.2, the time to complete is 82 seconds and memory used is 3.4 GB.

Error details

No response

Environment data

Name                           Value
----                           -----
PSVersion                      7.4.2
PSEdition                      Core
GitCommitId                    7.4.2
OS                             Microsoft Windows 10.0.22631
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Visuals

Testing in PowerShell 7.4.2

ps7-test

Testing in PowerShell 5.1

ps5-test

PowerShell 7.4.2 Memory Usage

ps7-mem

PowerShell 5.1 Memory Usage

ps5-mem
@chopinrlz chopinrlz added the Needs-Triage The issue is new and needs to be triaged by a work group. label Apr 15, 2024
@rhubarb-geek-nz
Copy link

Hi,

Given all of your file and data operations are performed directly with the .NET API I suggest this is not a PowerShell problem, it looks like it is a .NET issue.

Is your issue that PowerShell is not running the garbage collector? Does running

[System.GC]::Collect()

make any difference?

Also be aware that managed memory runtimes often take memory from the OS in order to allocate objects but never return it. So the objects may have been released/disposed/freed so the actual CLR heap is free but the memory has not been returned to the OS. This is completely normal as managed runtimes (CLR, JVM etc) assume that if they needed it once they are likely to need it again so no point giving it back to the OS.

You would need to look for tools to examine the state of the CLR heap within the process rather than external process monitoring tools.

When you have an operation that you know is memory intensive then two other options are available,

  1. run the operation in a separate process, in PowerShell all you need to do is run it in a Job and PowerShell will do the rest.
  2. stream the operation and read and write chunks of data rather than holding it all in memory.

I hope that helps.

@iSazonov iSazonov added the CL-Performance Indicates that a PR should be marked as a performance improvement in the Change Log label Apr 15, 2024
@jborean93
Copy link
Collaborator

jborean93 commented Apr 15, 2024

I cannot replicate the slowness you see as 7.4.2 takes less than 5 seconds for me and is in fact faster than WinPS but I do notice the large memory usage. My guess is that it is not only are now storing the 2 byte arrays (raw file data and the decoded base64 string) but also the base64 string is all allocated on the heap as part of the operation. Potentially WinPS/.NET Framework is more aggressive in reusing the array values but as per the above the CLR could be allocating the memory and just never freeing it so it can more efficient reuse the memory in the future.

Putting aside the above comment you can more efficiently base64 encode bytes by streaming it rather than reading all the input bytes into memory.

Function ConvertTo-Base64String {
    [OutputType([string])]
    [CmdletBinding()]
    param (
        [Parameter(Mandatory)]
        [string]$Path
    )

    $fs = $cryptoStream = $sr = $null
    try {
        $fs = [System.IO.File]::OpenRead($Path)
        $cryptoStream = [System.Security.CryptoGraphy.CryptoStream]::new(
            $fs,
            [System.Security.Cryptography.ToBase64Transform]::new(),
            [System.Security.Cryptography.CryptoStreamMode]::Read)
        $sr = [System.IO.StreamReader]::new($cryptoStream, [System.Text.Encoding]::ASCII)
        $sr.ReadToEnd()
    }
    finally {
        ${sr}?.Dispose()
        ${cryptoStream}?.Dispose()
        ${fs}?.Dispose()
    }
}

This will stream the raw bytes from the source file stream and produce the final output string. If you are storing this string into a file then you could optimize it further by streaming the output base64 CryptoStream to a file avoiding having to store all the data in PowerShell.

If you do need to store the base64 string as an object in PowerShell keep in mind this means you not only have to store the inflated size that base64 uses (($length / 3) * 4) but each char of the string takes two bytes so at a minimum you are looking at around 600MB for the dotnet installer. Using the above function I see the memory usage sits around 1.2GB which is less than WinPS (about 1.4GB). While this is still more than the ~600MB there could be other factors in place here like the CLR allocating more memory than strictly needs in anticipation of future needs or some other reason.

@chopinrlz
Copy link
Author

chopinrlz commented Apr 16, 2024

Thank you both for your responses and suggestions for optimization. I did reply to the Issue cross-posted in the dotnet/runtime project which you can see here dotnet/runtime#101061 (comment)

I ran [GC]::Collect() after running the test script in both WinPS and PS7 and noticed that in PS7 almost none of the used memory was released back to the operating system, which seemed suspicious as in WinPS almost all memory is released after the test.

UPDATE: Confirmed below that the FromBase64String delay and excessive memory usage only occurs through PowerShell 7, but not through a .NET console app. This definitely appears to be a .NET run-time issue and not a PowerShell issue, however I wanted to post it here as well as over there for visibility in case others encounter this oddity while building PowerShell modules (which is how I stumbled upon it).

I am aware that storing a 200 MB file in memory as base64 text is wildly inefficient, and is not the intention of how my PowerPass module should be used (which is how I discovered this in the first place), but since I stumbled upon this unexpected behavior I thought it prudent to at least report it.

But again, I appreciate all of the comments and feedback here, especially the suggestions for optimization techniques.

@237dmitry
Copy link

In PowerShell 7.4.2, the time to complete is 82 seconds and memory used is 3.4 GB.

Yes, the same memory usage, but time to complete is about 1.2 seconds

@chopinrlz
Copy link
Author

chopinrlz commented Apr 16, 2024

I retested the following updated script on my desktop PC. A Ryzen 5800X with 128 GB of RAM and PCIe Gen4 NVMe storage. The test ran much faster as expected, but the memory usage still remains high.

Even after invoking [GC]::Collect() around 3.2 GB of RAM still remains utilized by the pwsh process.

Reading the 222 MB file into memory takes 0.06 seconds and converting it to base64 takes 0.22 seconds and uses 845 MB of RAM across both operations as expected. The last operation [System.Convert]::FromBase64String uses 2.6 GB of RAM alone and takes 17 seconds.

ps7-retest-timings

I'll cross-post this in the dotnet/runtime issue. Thank you all for the feedback.

Updated test script:

$name = "random.bin"

$start = Get-Date

Write-Host "Creating Path to $name test file: " -NoNewline
$now = Get-Date
$file = Join-Path -Path $PSScriptRoot -ChildPath $name
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"

Write-Host "Reading all file bytes into memory: " -NoNewline
$now = Get-Date
$bytes = [System.IO.File]::ReadAllBytes( $file )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"

Write-Host "Converting file bytes to base64 string: " -NoNewline
$now = Get-Date
$base64 = [System.Convert]::ToBase64String( $bytes )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"

Write-Host "Converting base64 string back to file bytes: " -NoNewline
$now = Get-Date
$bytes = [System.Convert]::FromBase64String( $base64 )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"

Write-Host "Test complete"

Write-Host "Total duration: $(((Get-Date) - $start).TotalMilliseconds) ms"

@chopinrlz
Copy link
Author

I just tested this same implementation using a C# console application for the dotnet/runtime team and the issues does NOT occur when running in a console application against .NET 8 on the latest SDK on Windows 11 Professional. My test results and C# code are here: dotnet/runtime#101061 (comment)

It seems that this is actually a memory leak in the PowerShell runtime for some reason. The dotnet/runtime crew was asking if the [System.Convert]::FromBase64String function was being replaced with something else by PowerShell, but since I haven't tweaked my PowerShell installation, I can't imagine what could be doing that. Plus, this issue happens on multiple PCs, my desktop and my laptop, and only within PowerShell 7.4.2.

I'm assuming PowerShell 7.4.2 is using .NET 8.0 under the hood. Does it ship with its own .NET run-time or does it rely on the run-time installed on the system?

@chopinrlz
Copy link
Author

For reference, the random.bin test file I am using is exactly 233,420,544 bytes in length. I generated it with this script:

param(
	[int]
	$Size
)
$blockSize = 256
$rand = [System.Random]::new()
$total = 0
[byte[]]$data = [System.Array]::CreateInstance( [byte[]], $blockSize )
$path = Join-Path -Path $PSScriptRoot -ChildPath "random.bin"
if( Test-Path $path ) {
	Remove-Item -Path $path -Force
}
$file = [System.IO.File]::OpenWrite( $path )
while( $total -lt $Size ) {
	$rand.NextBytes( $data )
	$file.Write( $data, 0, $data.Length )
	$total += $blockSize
}
$file.Flush()
$file.Close()

@KalleOlaviNiemitalo
Copy link

I'm assuming PowerShell 7.4.2 is using .NET 8.0 under the hood. Does it ship with its own .NET run-time or does it rely on the run-time installed on the system?

With PowerShell 7.4.2 installed from Microsoft Store, Get-ChildItem $PSHOME shows files like coreclr.dll; I think that means PowerShell has its own copy of the .NET Runtime.

[System.Runtime.InteropServices.RuntimeInformation]::FrameworkDescription is ".NET 8.0.4".

@mklement0
Copy link
Contributor

A couple more data points regarding the unexpected slowdown:

  • I see the slowdown only on Windows (neither on macOS nor on Linux), both in in 7.4.2 (.NET 8.0.4) and v7.5.0-preview.2 (.NET 9.0.0-preview.1.24080.9).

  • On Windows, I see the slowdown in 7.3.10 (.NET 7.0.14) too, albeit in less severe form (about twice as fast as 7.4.2 / v7.5.0-preview.2, but still way too slow); WinPS is fine.

@KalleOlaviNiemitalo
Copy link

KalleOlaviNiemitalo commented Apr 16, 2024

In src/System.Management.Automation/engine/runtime/CompiledScriptBlock.cs, there is class SuspiciousContentChecker, which attempts to detect "suspicious strings" such as FromBase64String. That then apparently causes PowerShell to log some ETW event. I wonder if Windows Defender monitors those events and then spends time investigating the process.

@iSazonov
Copy link
Collaborator

I would like to remind that earlier we observed slow pwsh operations with files due to antivirus.

@chopinrlz
Copy link
Author

I'm assuming PowerShell 7.4.2 is using .NET 8.0 under the hood. Does it ship with its own .NET run-time or does it rely on the run-time installed on the system?

With PowerShell 7.4.2 installed from Microsoft Store, Get-ChildItem $PSHOME shows files like coreclr.dll; I think that means PowerShell has its own copy of the .NET Runtime.

[System.Runtime.InteropServices.RuntimeInformation]::FrameworkDescription is ".NET 8.0.4".

I am also seeing .NET 8.0.4 for the FrameworkDescription on my PowerShell 7 install which I setup from the MSI downloaded from Github. Assembly file versions are 8.0.424.16909.

@chopinrlz
Copy link
Author

chopinrlz commented Apr 16, 2024

I would like to remind that earlier we observed slow pwsh operations with files due to antivirus.

The slow operation is the call to [System.Convert]::FromBase64String which occurs after the file is loaded from disk into memory. On my desktop, this operation takes 17 seconds. The file load operation from disk takes 0.06 seconds.

Also, this only happens when doing this in PowerShell 7.4.2. Running this same test in a C# console application takes under 1 second.

console-app-test

@chopinrlz
Copy link
Author

In src/System.Management.Automation/engine/runtime/CompiledScriptBlock.cs, there is class SuspiciousContentChecker, which attempts to detect "suspicious strings" such as FromBase64String. That then apparently causes PowerShell to log some ETW event. I wonder if Windows Defender monitors those events and then spends time investigating the process.

I added a Program Setting for pwsh.exe into Exploit Protection under App & Browser Control and disabled all protections. Rerunning the test resulted in the same duration. 17 seconds and 3.4 GB of RAM usage. Is there another place I can check in Windows Security to prevent Defender from watching pwsh.exe?

@KalleOlaviNiemitalo
Copy link

The excess memory consumption of the pwsh process suggests that there is extra code running within the process; rather than antivirus software examining it from the outside (from another process or from a kernel-mode driver).

During the slow FromBase64String operation, is the pwsh process consuming a lot of processor time (one thread's worth)?

Can you attach to an unmanaged-code debugger to the process during FromBase64String and get a stack trace of the thread with the most CPU time? (!runaway, k, Thread Syntax)

@chopinrlz
Copy link
Author

Here is a stack trace of the thread with the most CPU time. I used WinDbg and broke process execution in the middle of the FromBase64String operation. You can see some fun stuff at the top.

0:009> ~21 k
 # Child-SP          RetAddr               Call Site
00 0000007b`a638e6d8 00007ffc`5b790a10     MPCLIENT!MpUpdateServicePingRpc+0x7b676
01 0000007b`a638e6e0 00007ffc`5b7cfde4     MPCLIENT!MpTelemetryUpdateUserConsent+0x190
02 0000007b`a638e720 00007ffc`5b7c5db0     MPCLIENT!MpConveyUserChoiceForSampleList+0x704
03 0000007b`a638e800 00007ffc`68253f92     MPCLIENT!MpAmsiNotify+0x140
04 0000007b`a638e8d0 00007ffc`682eb5fd     MpOav!DllRegisterServer+0x1142
05 0000007b`a638e930 00007ffc`682e81cb     amsi!CAmsiAntimalware::Notify+0xcd
06 0000007b`a638e9c0 00007ffb`95ec173b     amsi!AmsiNotifyOperation+0xab
07 0000007b`a638ea10 00007ffb`f1b36cfb     0x00007ffb`95ec173b
08 0000007b`a638eae0 00007ffb`f1acfc4b     System_Management_Automation!System.Management.Automation.AmsiUtils.WinReportContent+0xeb
09 0000007b`a638eb60 00007ffb`95964a7c     System_Management_Automation!System.Management.Automation.MemberInvocationLoggingOps.LogMemberInvocation+0x27b
0a 0000007b`a638ec90 00007ffb`fa945cf6     0x00007ffb`95964a7c
0b 0000007b`a638ecf0 00007ffb`f1f61b9f     System_Linq_Expressions!System.Dynamic.UpdateDelegates.UpdateAndExecute2<System.Type,object,object>+0x1f6 [/_/src/libraries/System.Linq.Expressions/src/System/Dynamic/UpdateDelegates.Generated.cs @ 268] 
0c 0000007b`a638ed80 00007ffb`f1bac64e     System_Management_Automation!System.Management.Automation.Interpreter.DynamicInstruction<System.Type,object,object>.Run+0xff
0d 0000007b`a638ee10 00007ffb`f1bac64e     System_Management_Automation!System.Management.Automation.Interpreter.EnterTryCatchFinallyInstruction.Run+0x7e
0e 0000007b`a638ee90 00007ffb`f1bb20d3     System_Management_Automation!System.Management.Automation.Interpreter.EnterTryCatchFinallyInstruction.Run+0x7e
0f 0000007b`a638ef10 00007ffb`f1fa3e06     System_Management_Automation!System.Management.Automation.Interpreter.Interpreter.Run+0x33
10 0000007b`a638ef60 00007ffb`f1ad841d     System_Management_Automation!System.Management.Automation.Interpreter.LightLambda.RunVoid1<System.Management.Automation.Language.FunctionContext>+0xc6
11 0000007b`a638efe0 00007ffb`f1ad7e0d     System_Management_Automation!System.Management.Automation.DlrScriptCommandProcessor.RunClause+0x28d
12 0000007b`a638f070 00007ffb`f19fff15     System_Management_Automation!System.Management.Automation.DlrScriptCommandProcessor.Complete+0x11d
13 0000007b`a638f0e0 00007ffb`f1cbd0ed     System_Management_Automation!System.Management.Automation.CommandProcessorBase.DoComplete+0x85
14 0000007b`a638f130 00007ffb`f1cbcdc9     System_Management_Automation!System.Management.Automation.Internal.PipelineProcessor.DoCompleteCore+0x9d
15 0000007b`a638f1b0 00007ffb`f1ac7eab     System_Management_Automation!System.Management.Automation.Internal.PipelineProcessor.SynchronousExecuteEnumerate+0xc9
16 0000007b`a638f230 00007ffb`9525b18e     System_Management_Automation!System.Management.Automation.PipelineOps.InvokePipeline+0x33b
17 0000007b`a638f2d0 00007ffb`f1bac64e     System_Management_Automation!System.Management.Automation.Interpreter.ActionCallInstruction<object,bool,System.Management.Automation.CommandParameterInternal[][],System.Management.Automation.Language.CommandBaseAst[],System.Management.Automation.CommandRedirection[][],System.Management.Automation.Language.FunctionContext>.Run+0x21e
18 0000007b`a638f380 00007ffb`f1bac64e     System_Management_Automation!System.Management.Automation.Interpreter.EnterTryCatchFinallyInstruction.Run+0x7e
19 0000007b`a638f400 00007ffb`f1bb20d3     System_Management_Automation!System.Management.Automation.Interpreter.EnterTryCatchFinallyInstruction.Run+0x7e
1a 0000007b`a638f480 00007ffb`f1fa3e06     System_Management_Automation!System.Management.Automation.Interpreter.Interpreter.Run+0x33
1b 0000007b`a638f4d0 00007ffb`f1ad841d     System_Management_Automation!System.Management.Automation.Interpreter.LightLambda.RunVoid1<System.Management.Automation.Language.FunctionContext>+0xc6
1c 0000007b`a638f550 00007ffb`f1ad7e0d     System_Management_Automation!System.Management.Automation.DlrScriptCommandProcessor.RunClause+0x28d
1d 0000007b`a638f5e0 00007ffb`f19fff15     System_Management_Automation!System.Management.Automation.DlrScriptCommandProcessor.Complete+0x11d
1e 0000007b`a638f650 00007ffb`f1cbd0ed     System_Management_Automation!System.Management.Automation.CommandProcessorBase.DoComplete+0x85
1f 0000007b`a638f6a0 00007ffb`f1cbcdc9     System_Management_Automation!System.Management.Automation.Internal.PipelineProcessor.DoCompleteCore+0x9d
20 0000007b`a638f720 00007ffb`f1bd1117     System_Management_Automation!System.Management.Automation.Internal.PipelineProcessor.SynchronousExecuteEnumerate+0xc9
21 0000007b`a638f7a0 00007ffb`f1bd1923     System_Management_Automation!System.Management.Automation.Runspaces.LocalPipeline.InvokeHelper+0x507
22 0000007b`a638f850 00007ffb`f1bd2a7f     System_Management_Automation!System.Management.Automation.Runspaces.LocalPipeline.InvokeThreadProc+0x113
23 0000007b`a638f8b0 00007ffb`f2c763cd     System_Management_Automation!System.Management.Automation.Runspaces.PipelineThread.WorkerProc+0x2f
24 0000007b`a638f8e0 00007ffb`f4d6b8d3     System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal+0x7d [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 179] 
25 0000007b`a638f950 00007ffb`f4c3ebac     coreclr!CallDescrWorkerInternal+0x83 [D:\a\_work\1\s\src\coreclr\vm\amd64\CallDescrWorkerAMD64.asm @ 100] 
26 0000007b`a638f990 00007ffb`f4d57b93     coreclr!DispatchCallSimple+0x60 [D:\a\_work\1\s\src\coreclr\vm\callhelpers.cpp @ 221] 
27 0000007b`a638fa20 00007ffb`f4cc4abd     coreclr!ThreadNative::KickOffThread_Worker+0x63 [D:\a\_work\1\s\src\coreclr\vm\comsynchronizable.cpp @ 158] 
28 (Inline Function) --------`--------     coreclr!ManagedThreadBase_DispatchInner+0xd [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7222] 
29 0000007b`a638fa80 00007ffb`f4cc49d3     coreclr!ManagedThreadBase_DispatchMiddle+0x85 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7266] 
2a 0000007b`a638fb60 00007ffb`f4cc4b6e     coreclr!ManagedThreadBase_DispatchOuter+0xab [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7425] 
2b (Inline Function) --------`--------     coreclr!ManagedThreadBase_FullTransition+0x28 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7470] 
2c (Inline Function) --------`--------     coreclr!ManagedThreadBase::KickOff+0x28 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7505] 
2d 0000007b`a638fc00 00007ffc`738e257d     coreclr!ThreadNative::KickOffThread+0x7e [D:\a\_work\1\s\src\coreclr\vm\comsynchronizable.cpp @ 230] 
2e 0000007b`a638fc60 00007ffc`7508aa48     KERNEL32!BaseThreadInitThunk+0x1d
2f 0000007b`a638fc90 00000000`00000000     ntdll!RtlUserThreadStart+0x28

@KalleOlaviNiemitalo
Copy link

09 0000007ba638eb60 00007ffb95964a7c System_Management_Automation!System.Management.Automation.MemberInvocationLoggingOps.LogMemberInvocation+0x27b

Oh, LogMemberInvocation calls ArgumentToString here:

string value = ArgumentToString(args[i]);

So does that mean the multi-megabyte base64 string goes via Anti-Malware Scan Interface to Windows Defender…? I guess that would be a sensible design. And then perhaps the Defender implementation of AMSI makes a few more copies of the string.

This PowerShell code would apparently log the whole AMSI scan request to the console if you set __PSDumpAMSILogContent=1 in the environment before you start PowerShell.

Why does the AMSI scan take that long, though… does it do useful work all that time, or does it get stuck somehow and give up after a timeout? Perhaps you could try with files of different sizes, graph how the file size affects the FromBase64String duration. If the duration stays the same, then that suggests there is a timeout.

@chopinrlz
Copy link
Author

I ran a test using variable size byte arrays with random payloads starting at 2 MiB in size and going up to 116 MiB in size. You can see that the duration required is linear, and also extremely slow. It takes 2 seconds to convert 32 MiB back to a byte array from a base64 string.

data-size-chart

The same test conducted at 16 MiB intervals up to 256 MiB also shows a linear trend.

data-size-chart

One final test at 32 MiB intervals up to 384 MiB shows a linear trend as well suggesting that there may be no upper boundary or timeout no matter how much data you ask PowerShell to convert from base64.

data-size-chart

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 17, 2024

So does that mean the multi-megabyte base64 string goes via Anti-Malware Scan Interface to Windows Defender…? I guess that would be a sensible design.

Or perhaps you might be completely horrified by the idea of deep packet inspection of all arguments and no knowledge of whether that will be sent to 3rd parties. ( or any other party at all, to be honest )

@KalleOlaviNiemitalo
Copy link

I hoped the graph might show a lower boundary, because it could indicate a configuration error that could then be fixed to speed up the operation; for example, if the AMSI code running in-process were unable to contact the Defender service and spent some constant amount of time attempting that. Alas, the linear graph doesn't look like that's the case.

There may be ways to change the PowerShell script so that, even though it still triggers the suspicious content detector and causes an AMSI scan, the argument list being scanned does not include the base64 data and the scan finishes faster. But if such a workaround becomes commonly used, I suspect a future version of PowerShell will be changed to scan the data anyway.

@rhubarb-geek-nz
Copy link

If I am understanding this nonsense with AMSI correctly, then a solution would be to perform the Base64 translation in a compiled C# cmdlet. Given we are talking PowerShell it should be implemented using a pipeline with System.Security.Cryptography.ToBase64Transform rather than dealing with massive strings. In this case the biggest string would be 64 characters

@KalleOlaviNiemitalo
Copy link

AMSI logging of method invocations was added as an experimental feature in #16496 and changed to non-experimental in #18041. I'm not sure it even uses the suspicious content detector; perhaps the difference between ToBase64String and FromBase64String is that ArgumentToString does not format the elements of a byte[] argument for AMSI, but passes a string argument through.

A slowdown was previously reported in #19431.

@rhubarb-geek-nz
Copy link

I am not seeing similar times

$bytes = new-object byte[] -ArgumentList @(,200554320)
$random = new-object Random
$random.NextBytes($bytes)
$now = Get-Date
$base64 = [System.Convert]::ToBase64String( $bytes )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"

Took 442.478 ms on a little Intel(R) Core(TM) i3-10100Y CPU @ 1.30GHz 1.61 GHz running Windows 11 Pro

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 17, 2024

AMSI logging of method invocations was added as an experimental feature in #16496 and changed to non-experimental in #18041.

Where can I find documentation on what this actually does? I am personally horrified by the idea that anyone thinks they have the rights to log data that was private to a process without their knowledge.

When I say POWERSHELL_TELEMETRY_OPTOUT=1 I mean it!

@KalleOlaviNiemitalo
Copy link

KalleOlaviNiemitalo commented Apr 17, 2024

I am not seeing similar times

@rhubarb-geek-nz, your script uses ToBase64String, not FromBase64String.

Where can I find documentation on what this actually does?

The best may be the documentation of the PSAMSIMethodInvocationLogging experimental feature in this old version: https://github.com/MicrosoftDocs/PowerShell-Docs/blob/793ed5c687e6c7b64565d1751c532eb1d7d84209/reference/docs-conceptual/learn/experimental-features.md#psamsimethodinvocationlogging

The "How AMSI helps" link in that documentation doesn't work on GitHub; use https://learn.microsoft.com/windows/win32/amsi/how-amsi-helps instead.

AMSI doesn't necessarily involve telemetry that would send the data off the machine. I don't know whether Windows Defender has telemetry for AMSI scans.

GitHub
The official PowerShell documentation sources. Contribute to MicrosoftDocs/PowerShell-Docs development by creating an account on GitHub.
As an application developer, you can actively participate in malware defense. Specifically, you can help protect your customers from dynamic script-based malware, and from non-traditional avenues of cyber attack.

@KalleOlaviNiemitalo
Copy link

There may be ways to change the PowerShell script so that, even though it still triggers the suspicious content detector and causes an AMSI scan, the argument list being scanned does not include the base64 data and the scan finishes faster.

Because ArgumentToString does not recognise the char[] type and returns only the type name, I think a [System.Convert]::FromBase64CharArray call should be much faster for AMSI to scan than [System.Convert]::FromBase64String. But who knows how long that will remain so.

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 17, 2024

My workaround is to use two cmdlets to do the Base64 conversions hence bypass the PSAMSIInvocationLogging nonsense.

Original timings

 218.4235 ms
 64152.6876 ms

now

 222.2945 ms
 301.4766 ms

Code

#!/usr/bin/env pwsh

$env:__PSDumpAMSILogContent='1'

trap
{
	throw $PSItem
}

$ErrorActionPreference = 'Stop'

$code = @"
using System;
using System.Management.Automation;

	[Cmdlet("ConvertFrom", "Base64String")]
	public class ConvertFromBase64String : PSCmdlet
	{
		[Parameter(Mandatory=true,ValueFromPipeline=true)]
		public String InputString;

		protected override void ProcessRecord()
		{
			WriteObject(System.Convert.FromBase64String(InputString));
		}
	}

	[Cmdlet("ConvertTo", "Base64String")]
	public class ConvertToBase64String : PSCmdlet
	{
		[Parameter(Mandatory=true,ValueFromPipeline=true)]
		public byte[] InputObject;

		protected override void ProcessRecord()
		{
			WriteObject(System.Convert.ToBase64String(InputObject));
		}
	}
"@

Add-Type $code -PassThru | ForEach-Object { Import-Module $_.Assembly }

Get-Command -Noun 'Base64String'

$bytes = new-object byte[] -ArgumentList @(,200554320)

$bytes.Length

$random = new-object Random

$random.NextBytes($bytes)

$now = Get-Date

$base64 = @(,$bytes) | ConvertTo-Base64String

Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"

$base64.Length

$bytes = $null

$now = Get-Date

$bytes = $base64 | ConvertFrom-Base64String

Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"

$bytes.Length

@chopinrlz
Copy link
Author

My workaround is to use two cmdlets to do the Base64 conversions hence bypass the PSAMSIInvocationLogging nonsense.

Thank you, this is very helpful. I was fiddling with using C# and Add-Type, but I noticed that making the call directly in C# via a static function still invokes the AMSI context. I will incorporate this into my PowerPass module to avoid the performance issue and excessive memory usage.

@mklement0
Copy link
Contributor

mklement0 commented Apr 17, 2024

@chopinrlz, it isn't calls from C# that trigger an AMSI scan, it is .NET method calls from PowerShell code.

@rhubarb-geek-nz's workaround based on using C# code to implement cmdlets avoids this by letting PowerShell itself mediate the method calls.


Two asides:

@rhubarb-geek-nz
Copy link

  • $env:__PSDumpAMSILogContent='1' isn't effective in-session; the env. var. must be set before calling PowerShell.

Not actually true, it merely needs to be set before the first call to get DumpLogAMSIContent, the environment variable is accessed by a lazy load. Eg #21492

        private static readonly Lazy<bool> DumpLogAMSIContent = new Lazy<bool>(
            () => {
                object result = Environment.GetEnvironmentVariable("__PSDumpAMSILogContent");
                if (result != null && LanguagePrimitives.TryConvertTo(result, out int value))
                {
                    return value == 1;
                }
                return false;
            }
        );
  • $bytes = new-object byte[] -ArgumentList @(,200554320) can be simplified to
    $bytes = [byte[]]::new(200554320), which also avoids an (invisible) [psobject] wrapper that would cause an (unrelated) problem ....

Of course, how could we not have invisible, non-obvious problems in the simplest of code.

@mklement0
Copy link
Contributor

mklement0 commented Apr 18, 2024

@rhubarb-geek-nz:

Not actually true, because for predictable diagnostic output you indeed do need the set the environment variable first, as evidenced by the following:

$null, 1 | % {
  Write-Host ---
  $env:__PSDumpAMSILogContent = $_
  pwsh -noprofile { [byte[]]::new(0) } 
}
$env:__PSDumpAMSILogContent = $null

Of course, how could we not have invisible, non-obvious problems in the simplest of code.

I assume this is pure sarcasm (which I do not endorse, but I empathize with the frustration I presume to underlie it); if there's an actual argument in there (beyond what #21496 expresses), please tell us.

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 18, 2024

Not actually true, because for predictable diagnostic output you indeed do need the set the environment variable first, as evidenced by the following:

The script was predictable because it had the "#!/usr/bin/env pwsh" at the start, the executable it bit set, and was designed to run directly from bash. It sets the environment variable after powershell has started but before the first reflection invocation.

if there's an actual argument in there, please tell us.

The frustration is because everytime you think you have found the solution with PowerShell, there is always another reason, case, exception or scenario where it breaks. As a user you don't have the tools to see all these problems because the very objects themselves play stupid games trying to pretend to be something they are not, or changing from what you thought it should have been. I can only assume I am not the target audience for this tool despite it supposedly being for system administrators, developers and IT professionals.

@mklement0
Copy link
Contributor

The script was predictable

No, it isn't predictable. My previous example stands. If the calling process doesn't have environment variable __PSDumpAMSILogContent already set before invocation, a call to [byte[]]::new(0) will not dump the diagnostic ASMI information, but it will do in the former case.

there is always another reason, case, exception or scenario where it breaks.

Again I empathize.
But simply venting your frustration isn't the way forward.

In the case at hand I've (indirectly) pointed to the (ultimate) root cause of the underlying problem - #5579.
I suggest channeling your frustration into constructive feedback - while being cognizant that such feedback may or may no be heard.

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 18, 2024

The script was predictable

No, it isn't predictable. My previous example stands. If the calling process doesn't have environment variable __PSDumpAMSILogContent already set before invocation, a call to [byte[]]::new(0) will not dump the diagnostic ASMI information, but it will do in the former case.

Yes, you are absolutely right. I wasn't predictable because you might have been using PowerShell as your default shell to launch scripts. Whereas all other UNIX shell scripts start a new script with executable bit set in a new process, we are talking about PowerShell here. Sigh.

Perhaps a recommendation for running test scripts is "In a new process....", not "In whatever process with whatever indeterminate state you happen to have....."

@mklement0
Copy link
Contributor

mklement0 commented Apr 18, 2024

I wasn't predictable because you might have been using PowerShell as your default shell

I was, but that is irrelevant: the only thing that matters is whether the calling process had a __PSDumpAMSILogContent variable defined or not, so this equally applies to POSIX-compatible shells.
See below.

@mklement0
Copy link
Contributor

mklement0 commented Apr 18, 2024

P.S.: @rhubarb-geek-nz:

  • I haven't looked into why the apparent attempt at honoring an in-process definition of the variable (private static readonly Lazy<bool> DumpLogAMSIContent) doesn't work reliably.

    • Update: The reason is that the very first .NET method call from PowerShell code in a session locks in the value of DumpLogAMSIContent.Value based on whether env. var. __PSDumpAMSILogContent is defined (and set to 1) then. In an interactive session, it is invariably the PSReadLine module that is the first to make such a call ([Microsoft.PowerShell.PSConsoleReadLine]::ReadLine()), so that subsequent in-process attempts to set __PSDumpAMSILogContent are ineffective.
  • However, generally speaking, in cases where PowerShell honors environment variables, they are expected to be set before PowerShell is launched.

    • Update: There is at least one exception: $env:PSModulePath is honored dynamically, on every access; the lazy once-per session initialization of DumpLogAMSIContent is an unfortunate hybrid between static and dynamic behavior.

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 18, 2024

the only thing that matters is whether the calling process had a __PSDumpAMSILogContent variable defined or not, so this equally applies to POSIX-compatible shells.

I beg to offer a different opinion...

$ ls -ld new.ps1
-rwxr-xr-x 1 me users 139 Apr 18 00:57 new.ps1
$ cat new.ps1
#!/usr/bin/env pwsh
$env:__PSDumpAMSILogContent='1'
$bytes1024 = new-object byte[] -ArgumentList @(,1024)
$bytes2048 = [byte[]]::new(2048)

Scenario A - The environment variable is not set in the calling process

$ echo $__PSDumpAMSILogContent

$ ./new.ps1

=== Amsi notification report content ===
<System.Byte[]>.new(<2048>)
=== Amsi notification report success: False ===

Scenario B - it is set to 0 in the calling process

$ __PSDumpAMSILogContent=0
$ echo $__PSDumpAMSILogContent
0
$ ./new.ps1

=== Amsi notification report content ===
<System.Byte[]>.new(<2048>)
=== Amsi notification report success: False ===

@mklement0
Copy link
Contributor

mklement0 commented Apr 18, 2024

You're right - via the CLI (as implicitly used via a shebang-based executable shell script), the in-process setting is honored, if:

  • the CLI call uses either (possibly implied) -File or -Command for execute-and-exit functionality.
  • and $env:__PSDumpAMSILogContent = 1 is set before any in-session .NET method calls occur from PowerShell code.

A simpler demonstration: Start a pristine POSIX-compatible shell and run the following:

export -n __PSDumpAMSILogContent # ensure that the env. var. isn't defined.

# AMSI log output via env. var. defined BEFORE 
__PSDumpAMSILogContent=1 pwsh -noprofile -c '$null = [byte[]]::new(2048)'

# !! Produces AMSI output too, because the environment variable - despite being set in-session - is
# !! set *before the first method call*.
pwsh -noprofile -c '$env:__PSDumpAMSILogContent = 1; $null = [byte[]]::new(2048)'

Note:

  • From inside PowerShell, an executable shell script with extension .ps1 is still executed in-process - and using a filename extension with an executable shell script is generally ill-advised.

  • The bigger picture here is: To make PowerShell configuration environment variables work predictably, set them before invoking PowerShell, irrespective of the invocation mechanism.

@rhubarb-geek-nz
Copy link

Rather than making a cmdlet for every .NET method you wish to call, you can simply put reflection in a single cmdlet.

$bytes = [byte[]]@(1,2,3)
$base64 = [string](Invoke-Reflection -Method ToBase64String -Type ([System.Convert]) -ArgumentList @(,$bytes))
Invoke-Reflection -Method FromBase64String -Type ([System.Convert]) -ArgumentList @(,$base64) | Format-Hex

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 19, 2024

and using a filename extension with an executable shell script is generally ill-advised.

Really? That is one I have not heard of.... eg

$ ls -ld *.sh
-rwxr-xr-x 1 github users  772 May 18  2023 debug.sh
-rw-r--r-- 1 github users  107 May 18  2023 download.sh
-rwxr-xr-x 1 github users 2094 May 18  2023 generate-icns.sh
-rwxr-xr-x 1 github users 7418 May 18  2023 install-powershell.sh
-rwxr-xr-x 1 github users 7307 May 18  2023 installpsh-amazonlinux.sh
-rwxr-xr-x 1 github users 9229 May 18  2023 installpsh-debian.sh
-rwxr-xr-x 1 github users 7791 May 18  2023 installpsh-gentoo.sh
-rw-r--r-- 1 github users 7533 May 18  2023 installpsh-mariner.sh
-rwxr-xr-x 1 github users 6483 May 18  2023 installpsh-osx.sh
-rwxr-xr-x 1 github users 6425 May 18  2023 installpsh-redhat.sh
-rwxr-xr-x 1 github users 9081 May 18  2023 installpsh-suse.sh

If you mean executable PowerShell scripts without the ps1 extension, we know how that ends up.

@rhubarb-geek-nz
Copy link

Another alternative is to do the reflection directly in PowerShell itself

ToBase64String is

$method = ([System.Convert]).GetMethod('ToBase64String',[type[]]@(,([byte[]])))

$base64 = [string]($method.Invoke($null,@(,$bytes)))

FromBase64String is

$method = ([System.Convert]).GetMethod('FromBase64String',[type[]]@(,([string])))

$bytes = $method.Invoke($null,@(,$base64))

Then the AMSI logging just looks like

=== Amsi notification report content ===
<System.Random>.NextBytes(<System.Byte[]>)
=== Amsi notification report success: False ===

=== Amsi notification report content ===
<System.RuntimeType>.GetMethod(<ToBase64String>, <System.Type[]>)
=== Amsi notification report success: False ===

=== Amsi notification report content ===
<System.Reflection.RuntimeMethodInfo>.Invoke(<null>, <System.Object[]>)
=== Amsi notification report success: False ===

=== Amsi notification report content ===
<System.RuntimeType>.GetMethod(<FromBase64String>, <System.Type[]>)
=== Amsi notification report success: False ===

=== Amsi notification report content ===
<System.Reflection.RuntimeMethodInfo>.Invoke(<null>, <System.Object[]>)
=== Amsi notification report success: False ===

Where the arguments are not dumped because all it prints is System.Object[]

@mklement0
Copy link
Contributor

Really? That is one I have not heard of...

Unfortunately, many ill-advised practices are common.
An executable shell script (using a shebang line) is an executable like any other, and there is no benefit to signaling to a caller that a given executable happens to be a shell script, which is (a) an implementation detail and (b) may lead users to believe that sh <script>.sh should be used for invocation, which can fail if the script uses Bashisms, for instance.

With PowerShell, specifically, things get tricky (leaving the bug you mention aside), because, unlike analogous shell scripts for POSIX-compatible shells, an executable, shebang line-based .ps1 file is still executed in-process, with the potential to alter the session state. An executable, shebang line-based .ps1 file must therefore be designed with this in mind.

One without this extension consistently runs in a child process - albeit more slowly and at the expense of not having rich type support in the in- and output and the inability to pass array arguments and arguments that have no string-literal representations - but a PowerShell script that is designed to (also) run as a standalone executable should not rely on these features anyway.

I presume it is the latter limitations that explain why - at least in my perception - shebang line-based PowerShell scripts haven't really caught on and why bugs such as #21402 are still not fixed.

@rhubarb-geek-nz
Copy link

Really? That is one I have not heard of...

Unfortunately, many ill-advised practices are common.

It depends on the context. If you mean a program that is found on via the PATH then I might agree, but in general when you are managing large numbers of scripts to perform tasks then keeping the .sh extension is very useful. UNIX exec() does not care about file extensions for executables, the concept of file extensions does not exist within the POSIX C API. You are free to name executable files how you like. One major advantage of maintaining the .sh extension is when you manage them in a source code repository and you are storing text, not a compiled binary. Keeping the extension makes that absolutely obvious.

It is Window'isms that step through extensions (com, bat, exe, cmd) while looking for commands on the path or local directory, and similarly PowerShell does the same and will try and append .ps1 to try and look for a command.

@mklement0
Copy link
Contributor

@rhubarb-geek-nz , we're getting far afield, but let me attempt a summary of the issue at hand first, which implies that there's likely nothing actionable here:

  • I presume that there's no actual memory leak here, only a memory "grab" by the CLR that isn't released, at least not instantly (perhaps on demand?).

  • The behavior is currently by design, and the only pathological case is an attempt to pass a very large string as an argument to a .NET method. Workarounds have been offered:

    • Per @KalleOlaviNiemitalo's comment, using [System.Convert]::FromBase64CharArray() bypasses the problem.

    • Per your own comment, reflection can be used to bypass the problem.

  • A fundamental solution would be to allow opt-out of AMSI calls - with obvious security implications - which you've asked for in How can I disable PSAMSIMethodInvocationLogging #21491

  • Also, given the currently unnecessary overhead on Unix-like platforms - where no AMSI equivalent exists - runtime performance could be improved: AMSI logging implemented on Linux #21492 (comment)


Returning to the tangent:

If you mean a program that is found on via the PATH then I might agree

When it comes to naming a stand-alone executable, it seems to me that the end-user experience should be the driver, trumping any design-time / implementation considerations:

  • On Unix-like platforms, this means: Do not use filename extensions when naming such executables.

  • On Windows, this means: Given that .ps1 files aren't directly executable from outside PowerShell, create companion .cmd files that are - both with and without specifying .cmd - using @"%~dpn0.ps1" %*

@KalleOlaviNiemitalo
Copy link

Per your own comment, reflection can be used to bypass the problem.

I'm half expecting you to make PowerShell recognise MethodInfo.Invoke calls and log each element of object?[]? parameters to AMSI as if the method had been called directly.

@mklement0
Copy link
Contributor

@KalleOlaviNiemitalo, fair point: Both of the aforementioned workarounds amount to bypassing the intended AMSI calls - I merely summarized them, speaking as someone who's neither a security expert nor speaking in any official capacity.

@rhubarb-geek-nz
Copy link

Let's go back to the original problem.

Reading into memory and converting to base64 then converting back should require about 790 MB of RAM with all variables remaining in scope during the process and no garbage collection happening or object disposal happening. The observed behavior appears to be memory-leak related as the amount of memory used once the conversion eventually completes is about 3.4 GB of RAM.

Since the early days of computers we have been able to deal with files larger than the available memory of the computer. This is still the case.

The first thing to realise is

(a) PowerShell is not a UNIX shell and it is really really bad at dealing with streams of bytes. That is not a problem of the PowerShell engine itself, but the existing cmdlets, scripts, patterns and expectations. PowerShell deals with pipelines of typed objects, not text or byte streams.

(b) UNIX does this kind of thing in its sleep, literally. A pipe is a byte stream first and foremost. Deciding to treat it as text is an afterthought.

So if we were doing this in UNIX we would simply do

$ openssl base64 < file.in | openssl base64 -d > file.out

The file went through the memory as it was being processed and then out to the final file.

Now let's do the same thing with PowerShell, $file is the sdk exe, $copy is a 2nd copy we are making

Split-Content -LiteralPath $file -AsByteStream | ConvertTo-Base64 | ConvertFrom-Base64 | Set-Content -LiteralPath $copy -AsByteStream

When you put that pipeline together it takes only about 50MB working set in order to process dotnet-sdk-8.0.204-win-x64.exe and write a copy of the output.

Validate it and compare with the SHA512 from the original download site

Get-FileHash -LiteralPath $file,$copy -Algorithm SHA512

So how does that work?

Split-Content reads a file and writes arrays of 4096 bytes to the success pipeline

ConvertTo-Base64 reads the byte arrays and writes out lines of Base64 encoding of just 64 characters each, same as openssl base64.

ConvertFrom-Base64 reads the strings and converts them to byte arrays.

Set-Content writes the bytes arrays to the final file.

It only took about 27MB to read, encode the decode the base64, without writing to a file.

$total = 0
	
Split-Content -LiteralPath $file -AsByteStream | ConvertTo-Base64 | ConvertFrom-Base64 | ForEach-Object { $total += $_.Length }

$total

So from 3.4GB to 27MB with no change to PowerShell itself is not a bad effort.

It was a trade-off of space versus time. It takes about 7 seconds or so to run the read, encode and decode pipeline.

@mklement0
Copy link
Contributor

mklement0 commented Apr 20, 2024

Yes, prior to PS 7.4 raw byte handling in pipelines wasn't supported, but in 7.4+ it now is, between external (native) programs, so the following works as intended from PowerShell (also on Windows, if you install openssl.exe there); note the use of -in to specify the input file:

# OK in PS 7.4+
openssl base64 -in file.in | openssl base64 -d > file.out

I haven't looked into the implementation, but I assume (hope) that on Unix-like platforms the usual system-level data buffering applies, which is 64KB these days.

While < isn't available in PowerShell to byte-stream data to an external program, in v7.4+ you can feed [byte] or - much more efficiently - [byte[]] data output from PowerShell commands to external programs:

The - slow - solution is therefore (byte-by-byte processing on the PowerShell side):

Get-Content file.in -AsByteStream | openssl base64 | openssl base64 -d > file.out

The - much faster - solution, which, however, reads the input file in full, due to -Raw:

Get-Content file.in -Raw -AsByteStream |  openssl base64 | openssl base64 -d > file.out

The - more memory-efficient - solution that emulates Unix pipeline buffering is:

Get-Content file.in -ReadCount 64kb -AsByteStream | 
  % { , [byte[]] $_ } |
  openssl base64 | openssl base64 -d > file.out

Note the - unfortunate in terms of both verbosity and performance - need for an intermediate % (ForEach-Object) call that strongly types the [object[]]-typed arrays that -ReadCount produces as [byte[]], as that is the prerequisite for sending raw byte data to an external program.

Arguably, Get-Content's -ReadCount parameter should instead output:

  • [string[]] arrays by default

  • [byte[]] arrays in combination with -AsByteStream

This would obviate the need for the inefficient and awkward ForEach-Object helper call.

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 20, 2024

The - more memory-efficient - solution that emulates Unix pipeline buffering is:

Get-Content file.in -ReadCount 64kb -AsByteStream | 
  % { , [byte[]] $_ } 

I did not have much success with Get-Content with ReadCount even in binary mode, I did not think of the array conversion in a ForEach-Object.

Hence I wrote the Split-Content which reads directly into a byte array and put that straight in the output pipeline. No need to convert any arrays.

I am not convinced that large buffers like 64K help in the PowerShell pipeline, because it has to fill the entire 64KB first before it passes onto the pipeline. The buffering in UNIX works the other way round, things can keep writing until the pipe buffer is full then they block until the reader has made some room.

A UNIX pipeline has a record size of 1. The PowerShell pipeline above has a record size of 64K, so nothing can move until the record is full. In UNIX if a network stream is slow then even the few hundred bytes at a time would still dribble through.

It would certainly be better if Get-Content always wrote AsByteStream as a byte array but I think it is too late to change that.

@mklement0
Copy link
Contributor

because it has to fill the entire 64KB first before it passes onto the pipeline

Yes, it's an imperfect emulation of the native Unix pipeline, but with file input (where there's no "dribbling"), it works well.

That said, it's rare for Unix-heritage utilities to accept input via stdin (the pipeline) only and not also via file-path operands; thus, with a file as the data source, passing the file's path as an argument to an external program is the simpler and better solution (such as in the openssl case, using the - syntactically unusual - -in parameter).

It would certainly be better if Get-Content always wrote AsByteStream as a byte array but I think it is too late to change that.

Hopefully not: Let's see what becomes of the feature request you've since created:

@chopinrlz
Copy link
Author

chopinrlz commented May 2, 2024

Split-Content -LiteralPath $file -AsByteStream | ConvertTo-Base64 | ConvertFrom-Base64 | Set-Content -LiteralPath $copy -AsByteStream

Thank you for showing me this technique. So what I understand is happening with the PowerShell pipeline is Split-Content instantiates a small memory buffer, fills it with bytes from the file, then the pipeline hands it to ConvertTo-Base64 and so forth down the pipeline such that the same small memory buffer is reused for each read operation on the $file and pass through the pipeline with each read iteration.

@rhubarb-geek-nz do you have your cmdlet source on Github?

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented May 2, 2024

@rhubarb-geek-nz do you have your cmdlet source on Github?

Yes, they are on PSGallery and each entry has a Project link which takes you to github, likewise, the releases pages on github have a link to PSGallery

PSGallery rhubarb-geek-nz.SplitContent/1.0.0
PSGallery rhubarb-geek-nz.Base64/1.0.0
PSGallery rhubarb-geek-nz.Joinery/1.0.0

github rhubarb-geek-nz/SplitContent
github rhubarb-geek-nz/Base64
github rhubarb-geek-nz/Joinery

@rhubarb-geek-nz
Copy link

the same small memory buffer is reused for each read operation on the $file

Yes, but a new byte array is written to the output pipeline. So the same total amount of memory is allocated, just not all at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs-Triage The issue is new and needs to be triaged by a work group. WG-Engine-Performance core PowerShell engine, interpreter, and runtime performance
Projects
None yet
Development

No branches or pull requests

7 participants