New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[System.Convert]::FromBase64String causes memory leak with large strings #21473
Comments
Hi, Given all of your file and data operations are performed directly with the .NET API I suggest this is not a PowerShell problem, it looks like it is a .NET issue. Is your issue that PowerShell is not running the garbage collector? Does running
make any difference? Also be aware that managed memory runtimes often take memory from the OS in order to allocate objects but never return it. So the objects may have been released/disposed/freed so the actual CLR heap is free but the memory has not been returned to the OS. This is completely normal as managed runtimes (CLR, JVM etc) assume that if they needed it once they are likely to need it again so no point giving it back to the OS. You would need to look for tools to examine the state of the CLR heap within the process rather than external process monitoring tools. When you have an operation that you know is memory intensive then two other options are available,
I hope that helps. |
I cannot replicate the slowness you see as 7.4.2 takes less than 5 seconds for me and is in fact faster than WinPS but I do notice the large memory usage. My guess is that it is not only are now storing the 2 byte arrays (raw file data and the decoded base64 string) but also the base64 string is all allocated on the heap as part of the operation. Potentially WinPS/.NET Framework is more aggressive in reusing the array values but as per the above the CLR could be allocating the memory and just never freeing it so it can more efficient reuse the memory in the future. Putting aside the above comment you can more efficiently base64 encode bytes by streaming it rather than reading all the input bytes into memory. Function ConvertTo-Base64String {
[OutputType([string])]
[CmdletBinding()]
param (
[Parameter(Mandatory)]
[string]$Path
)
$fs = $cryptoStream = $sr = $null
try {
$fs = [System.IO.File]::OpenRead($Path)
$cryptoStream = [System.Security.CryptoGraphy.CryptoStream]::new(
$fs,
[System.Security.Cryptography.ToBase64Transform]::new(),
[System.Security.Cryptography.CryptoStreamMode]::Read)
$sr = [System.IO.StreamReader]::new($cryptoStream, [System.Text.Encoding]::ASCII)
$sr.ReadToEnd()
}
finally {
${sr}?.Dispose()
${cryptoStream}?.Dispose()
${fs}?.Dispose()
}
} This will stream the raw bytes from the source file stream and produce the final output string. If you are storing this string into a file then you could optimize it further by streaming the output base64 CryptoStream to a file avoiding having to store all the data in PowerShell. If you do need to store the base64 string as an object in PowerShell keep in mind this means you not only have to store the inflated size that base64 uses ( |
Thank you both for your responses and suggestions for optimization. I did reply to the Issue cross-posted in the dotnet/runtime project which you can see here dotnet/runtime#101061 (comment) I ran UPDATE: Confirmed below that the FromBase64String delay and excessive memory usage only occurs through PowerShell 7, but not through a .NET console app. I am aware that storing a 200 MB file in memory as base64 text is wildly inefficient, and is not the intention of how my PowerPass module should be used (which is how I discovered this in the first place), but since I stumbled upon this unexpected behavior I thought it prudent to at least report it. But again, I appreciate all of the comments and feedback here, especially the suggestions for optimization techniques. |
Yes, the same memory usage, but time to complete is about 1.2 seconds |
I retested the following updated script on my desktop PC. A Ryzen 5800X with 128 GB of RAM and PCIe Gen4 NVMe storage. The test ran much faster as expected, but the memory usage still remains high. Even after invoking Reading the 222 MB file into memory takes 0.06 seconds and converting it to base64 takes 0.22 seconds and uses 845 MB of RAM across both operations as expected. The last operation I'll cross-post this in the dotnet/runtime issue. Thank you all for the feedback. Updated test script: $name = "random.bin"
$start = Get-Date
Write-Host "Creating Path to $name test file: " -NoNewline
$now = Get-Date
$file = Join-Path -Path $PSScriptRoot -ChildPath $name
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Reading all file bytes into memory: " -NoNewline
$now = Get-Date
$bytes = [System.IO.File]::ReadAllBytes( $file )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Converting file bytes to base64 string: " -NoNewline
$now = Get-Date
$base64 = [System.Convert]::ToBase64String( $bytes )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Converting base64 string back to file bytes: " -NoNewline
$now = Get-Date
$bytes = [System.Convert]::FromBase64String( $base64 )
Write-Host " $(((Get-Date) - $now).TotalMilliseconds) ms"
Write-Host "Test complete"
Write-Host "Total duration: $(((Get-Date) - $start).TotalMilliseconds) ms" |
I just tested this same implementation using a C# console application for the dotnet/runtime team and the issues does NOT occur when running in a console application against .NET 8 on the latest SDK on Windows 11 Professional. My test results and C# code are here: dotnet/runtime#101061 (comment) It seems that this is actually a memory leak in the PowerShell runtime for some reason. The dotnet/runtime crew was asking if the I'm assuming PowerShell 7.4.2 is using .NET 8.0 under the hood. Does it ship with its own .NET run-time or does it rely on the run-time installed on the system? |
For reference, the param(
[int]
$Size
)
$blockSize = 256
$rand = [System.Random]::new()
$total = 0
[byte[]]$data = [System.Array]::CreateInstance( [byte[]], $blockSize )
$path = Join-Path -Path $PSScriptRoot -ChildPath "random.bin"
if( Test-Path $path ) {
Remove-Item -Path $path -Force
}
$file = [System.IO.File]::OpenWrite( $path )
while( $total -lt $Size ) {
$rand.NextBytes( $data )
$file.Write( $data, 0, $data.Length )
$total += $blockSize
}
$file.Flush()
$file.Close() |
With PowerShell 7.4.2 installed from Microsoft Store,
|
A couple more data points regarding the unexpected slowdown:
|
In |
I would like to remind that earlier we observed slow pwsh operations with files due to antivirus. |
I am also seeing .NET 8.0.4 for the |
The slow operation is the call to Also, this only happens when doing this in PowerShell 7.4.2. Running this same test in a C# console application takes under 1 second. |
I added a Program Setting for |
The excess memory consumption of the During the slow FromBase64String operation, is the Can you attach to an unmanaged-code debugger to the process during FromBase64String and get a stack trace of the thread with the most CPU time? ( |
Here is a stack trace of the thread with the most CPU time. I used WinDbg and broke process execution in the middle of the FromBase64String operation. You can see some fun stuff at the top.
|
Oh, LogMemberInvocation calls ArgumentToString here: PowerShell/src/System.Management.Automation/engine/runtime/Operations/MiscOps.cs Line 3660 in 8ea1598
So does that mean the multi-megabyte base64 string goes via Anti-Malware Scan Interface to Windows Defender…? I guess that would be a sensible design. And then perhaps the Defender implementation of AMSI makes a few more copies of the string. This PowerShell code would apparently log the whole AMSI scan request to the console if you set Why does the AMSI scan take that long, though… does it do useful work all that time, or does it get stuck somehow and give up after a timeout? Perhaps you could try with files of different sizes, graph how the file size affects the FromBase64String duration. If the duration stays the same, then that suggests there is a timeout. |
I ran a test using variable size byte arrays with random payloads starting at 2 MiB in size and going up to 116 MiB in size. You can see that the duration required is linear, and also extremely slow. It takes 2 seconds to convert 32 MiB back to a byte array from a base64 string. The same test conducted at 16 MiB intervals up to 256 MiB also shows a linear trend. One final test at 32 MiB intervals up to 384 MiB shows a linear trend as well suggesting that there may be no upper boundary or timeout no matter how much data you ask PowerShell to convert from base64. |
Or perhaps you might be completely horrified by the idea of deep packet inspection of all arguments and no knowledge of whether that will be sent to 3rd parties. ( or any other party at all, to be honest ) |
I hoped the graph might show a lower boundary, because it could indicate a configuration error that could then be fixed to speed up the operation; for example, if the AMSI code running in-process were unable to contact the Defender service and spent some constant amount of time attempting that. Alas, the linear graph doesn't look like that's the case. There may be ways to change the PowerShell script so that, even though it still triggers the suspicious content detector and causes an AMSI scan, the argument list being scanned does not include the base64 data and the scan finishes faster. But if such a workaround becomes commonly used, I suspect a future version of PowerShell will be changed to scan the data anyway. |
If I am understanding this nonsense with AMSI correctly, then a solution would be to perform the Base64 translation in a compiled C# cmdlet. Given we are talking PowerShell it should be implemented using a pipeline with |
AMSI logging of method invocations was added as an experimental feature in #16496 and changed to non-experimental in #18041. I'm not sure it even uses the suspicious content detector; perhaps the difference between ToBase64String and FromBase64String is that ArgumentToString does not format the elements of a byte[] argument for AMSI, but passes a string argument through. A slowdown was previously reported in #19431. |
I am not seeing similar times
Took 442.478 ms on a little Intel(R) Core(TM) i3-10100Y CPU @ 1.30GHz 1.61 GHz running Windows 11 Pro |
Where can I find documentation on what this actually does? I am personally horrified by the idea that anyone thinks they have the rights to log data that was private to a process without their knowledge. When I say |
@rhubarb-geek-nz, your script uses ToBase64String, not FromBase64String.
The best may be the documentation of the PSAMSIMethodInvocationLogging experimental feature in this old version: https://github.com/MicrosoftDocs/PowerShell-Docs/blob/793ed5c687e6c7b64565d1751c532eb1d7d84209/reference/docs-conceptual/learn/experimental-features.md#psamsimethodinvocationlogging The "How AMSI helps" link in that documentation doesn't work on GitHub; use https://learn.microsoft.com/windows/win32/amsi/how-amsi-helps instead. AMSI doesn't necessarily involve telemetry that would send the data off the machine. I don't know whether Windows Defender has telemetry for AMSI scans.
|
Because ArgumentToString does not recognise the char[] type and returns only the type name, I think a [System.Convert]::FromBase64CharArray call should be much faster for AMSI to scan than [System.Convert]::FromBase64String. But who knows how long that will remain so. |
My workaround is to use two cmdlets to do the Base64 conversions hence bypass the Original timings
now
Code
|
Thank you, this is very helpful. I was fiddling with using C# and Add-Type, but I noticed that making the call directly in C# via a static function still invokes the AMSI context. I will incorporate this into my PowerPass module to avoid the performance issue and excessive memory usage. |
@chopinrlz, it isn't calls from C# that trigger an AMSI scan, it is .NET method calls from PowerShell code. @rhubarb-geek-nz's workaround based on using C# code to implement cmdlets avoids this by letting PowerShell itself mediate the method calls. Two asides:
|
Not actually true, it merely needs to be set before the first call to get
Of course, how could we not have invisible, non-obvious problems in the simplest of code. |
Not actually true, because for predictable diagnostic output you indeed do need the set the environment variable first, as evidenced by the following: $null, 1 | % {
Write-Host ---
$env:__PSDumpAMSILogContent = $_
pwsh -noprofile { [byte[]]::new(0) }
}
$env:__PSDumpAMSILogContent = $null
I assume this is pure sarcasm (which I do not endorse, but I empathize with the frustration I presume to underlie it); if there's an actual argument in there (beyond what #21496 expresses), please tell us. |
The script was predictable because it had the "#!/usr/bin/env pwsh" at the start, the executable it bit set, and was designed to run directly from bash. It sets the environment variable after powershell has started but before the first reflection invocation.
The frustration is because everytime you think you have found the solution with PowerShell, there is always another reason, case, exception or scenario where it breaks. As a user you don't have the tools to see all these problems because the very objects themselves play stupid games trying to pretend to be something they are not, or changing from what you thought it should have been. I can only assume I am not the target audience for this tool despite it supposedly being for system administrators, developers and IT professionals. |
No, it isn't predictable. My previous example stands. If the calling process doesn't have environment variable
Again I empathize. In the case at hand I've (indirectly) pointed to the (ultimate) root cause of the underlying problem - #5579. |
Yes, you are absolutely right. I wasn't predictable because you might have been using PowerShell as your default shell to launch scripts. Whereas all other UNIX shell scripts start a new script with executable bit set in a new process, we are talking about PowerShell here. Sigh. Perhaps a recommendation for running test scripts is "In a new process....", not "In whatever process with whatever indeterminate state you happen to have....." |
|
P.S.: @rhubarb-geek-nz:
|
I beg to offer a different opinion...
Scenario A - The environment variable is not set in the calling process
Scenario B - it is set to 0 in the calling process
|
You're right - via the CLI (as implicitly used via a shebang-based executable shell script), the in-process setting is honored, if:
A simpler demonstration: Start a pristine POSIX-compatible shell and run the following: export -n __PSDumpAMSILogContent # ensure that the env. var. isn't defined.
# AMSI log output via env. var. defined BEFORE
__PSDumpAMSILogContent=1 pwsh -noprofile -c '$null = [byte[]]::new(2048)'
# !! Produces AMSI output too, because the environment variable - despite being set in-session - is
# !! set *before the first method call*.
pwsh -noprofile -c '$env:__PSDumpAMSILogContent = 1; $null = [byte[]]::new(2048)' Note:
|
Rather than making a cmdlet for every .NET method you wish to call, you can simply put reflection in a single cmdlet.
|
Really? That is one I have not heard of.... eg
If you mean executable PowerShell scripts without the ps1 extension, we know how that ends up. |
Another alternative is to do the reflection directly in PowerShell itself
Then the AMSI logging just looks like
Where the arguments are not dumped because all it prints is System.Object[] |
Unfortunately, many ill-advised practices are common. With PowerShell, specifically, things get tricky (leaving the bug you mention aside), because, unlike analogous shell scripts for POSIX-compatible shells, an executable, shebang line-based One without this extension consistently runs in a child process - albeit more slowly and at the expense of not having rich type support in the in- and output and the inability to pass array arguments and arguments that have no string-literal representations - but a PowerShell script that is designed to (also) run as a standalone executable should not rely on these features anyway. I presume it is the latter limitations that explain why - at least in my perception - shebang line-based PowerShell scripts haven't really caught on and why bugs such as #21402 are still not fixed. |
It depends on the context. If you mean a program that is found on via the PATH then I might agree, but in general when you are managing large numbers of scripts to perform tasks then keeping the .sh extension is very useful. UNIX exec() does not care about file extensions for executables, the concept of file extensions does not exist within the POSIX C API. You are free to name executable files how you like. One major advantage of maintaining the .sh extension is when you manage them in a source code repository and you are storing text, not a compiled binary. Keeping the extension makes that absolutely obvious. It is Window'isms that step through extensions (com, bat, exe, cmd) while looking for commands on the path or local directory, and similarly PowerShell does the same and will try and append .ps1 to try and look for a command. |
@rhubarb-geek-nz , we're getting far afield, but let me attempt a summary of the issue at hand first, which implies that there's likely nothing actionable here:
Returning to the tangent:
When it comes to naming a stand-alone executable, it seems to me that the end-user experience should be the driver, trumping any design-time / implementation considerations:
|
I'm half expecting you to make PowerShell recognise MethodInfo.Invoke calls and log each element of |
@KalleOlaviNiemitalo, fair point: Both of the aforementioned workarounds amount to bypassing the intended AMSI calls - I merely summarized them, speaking as someone who's neither a security expert nor speaking in any official capacity. |
Let's go back to the original problem.
Since the early days of computers we have been able to deal with files larger than the available memory of the computer. This is still the case. The first thing to realise is (a) PowerShell is not a UNIX shell and it is really really bad at dealing with streams of bytes. That is not a problem of the PowerShell engine itself, but the existing cmdlets, scripts, patterns and expectations. PowerShell deals with pipelines of typed objects, not text or byte streams. (b) UNIX does this kind of thing in its sleep, literally. A pipe is a byte stream first and foremost. Deciding to treat it as text is an afterthought. So if we were doing this in UNIX we would simply do
The file went through the memory as it was being processed and then out to the final file. Now let's do the same thing with PowerShell,
When you put that pipeline together it takes only about 50MB working set in order to process dotnet-sdk-8.0.204-win-x64.exe and write a copy of the output. Validate it and compare with the SHA512 from the original download site
So how does that work? Split-Content reads a file and writes arrays of 4096 bytes to the success pipeline ConvertTo-Base64 reads the byte arrays and writes out lines of Base64 encoding of just 64 characters each, same as ConvertFrom-Base64 reads the strings and converts them to byte arrays. Set-Content writes the bytes arrays to the final file. It only took about 27MB to read, encode the decode the base64, without writing to a file.
So from 3.4GB to 27MB with no change to PowerShell itself is not a bad effort. It was a trade-off of space versus time. It takes about 7 seconds or so to run the read, encode and decode pipeline. |
Yes, prior to PS 7.4 raw byte handling in pipelines wasn't supported, but in 7.4+ it now is, between external (native) programs, so the following works as intended from PowerShell (also on Windows, if you install # OK in PS 7.4+
openssl base64 -in file.in | openssl base64 -d > file.out I haven't looked into the implementation, but I assume (hope) that on Unix-like platforms the usual system-level data buffering applies, which is 64KB these days. While The - slow - solution is therefore (byte-by-byte processing on the PowerShell side): Get-Content file.in -AsByteStream | openssl base64 | openssl base64 -d > file.out The - much faster - solution, which, however, reads the input file in full, due to Get-Content file.in -Raw -AsByteStream | openssl base64 | openssl base64 -d > file.out The - more memory-efficient - solution that emulates Unix pipeline buffering is: Get-Content file.in -ReadCount 64kb -AsByteStream |
% { , [byte[]] $_ } |
openssl base64 | openssl base64 -d > file.out Note the - unfortunate in terms of both verbosity and performance - need for an intermediate Arguably,
This would obviate the need for the inefficient and awkward |
I did not have much success with Get-Content with ReadCount even in binary mode, I did not think of the array conversion in a ForEach-Object. Hence I wrote the Split-Content which reads directly into a byte array and put that straight in the output pipeline. No need to convert any arrays. I am not convinced that large buffers like 64K help in the PowerShell pipeline, because it has to fill the entire 64KB first before it passes onto the pipeline. The buffering in UNIX works the other way round, things can keep writing until the pipe buffer is full then they block until the reader has made some room. A UNIX pipeline has a record size of 1. The PowerShell pipeline above has a record size of 64K, so nothing can move until the record is full. In UNIX if a network stream is slow then even the few hundred bytes at a time would still dribble through. It would certainly be better if Get-Content always wrote AsByteStream as a byte array but I think it is too late to change that. |
Yes, it's an imperfect emulation of the native Unix pipeline, but with file input (where there's no "dribbling"), it works well. That said, it's rare for Unix-heritage utilities to accept input via stdin (the pipeline) only and not also via file-path operands; thus, with a file as the data source, passing the file's path as an argument to an external program is the simpler and better solution (such as in the
Hopefully not: Let's see what becomes of the feature request you've since created: |
Thank you for showing me this technique. So what I understand is happening with the PowerShell pipeline is @rhubarb-geek-nz do you have your cmdlet source on Github? |
Yes, they are on PSGallery and each entry has a Project link which takes you to github, likewise, the releases pages on github have a link to PSGallery PSGallery rhubarb-geek-nz.SplitContent/1.0.0 github rhubarb-geek-nz/SplitContent |
Yes, but a new byte array is written to the output pipeline. So the same total amount of memory is allocated, just not all at the same time. |
Prerequisites
Steps to reproduce
This was tested on PowerShell 7.4.2
NOTE: That if you test this with a newer version of the .NET 8.0 installer, you may have to modify the test script to pick the correct file for the test since the filename is hard coded on line 3.
The .NET 8.0 installer for Windows x64 is approximately 222 MB in size. Reading into memory and converting to base64 then converting back should require about 790 MB of RAM with all variables remaining in scope during the process and no garbage collection happening or object disposal happening. The observed behavior appears to be memory-leak related as the amount of memory used once the conversion eventually completes is about 3.4 GB of RAM. These data points can be see in the attached screen shots.
Expected behavior
Actual behavior
In PowerShell 7.4.2, the time to complete is 82 seconds and memory used is 3.4 GB.
Error details
No response
Environment data
Visuals
Testing in PowerShell 7.4.2
Testing in PowerShell 5.1
PowerShell 7.4.2 Memory Usage
PowerShell 5.1 Memory Usage
The text was updated successfully, but these errors were encountered: