Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search for the diagnostics socket in the mount NS of the target process #3539

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Jongy
Copy link

@Jongy Jongy commented Dec 5, 2022

Closes: #3480

This is a PoC for #3480 .

I tested the PoC by:

  • running a dotnet app in a container (based on mcr.microsoft.com/dotnet/sdk:6.0-focal).
  • built dotnet-trace from main branch and ran it - I get:
root@4ee5552303fe:/code# ./artifacts/bin/dotnet-trace/Debug/netcoreapp3.1/dotnet-trace collect --process-id 1249318
No profile or providers specified, defaulting to trace profile 'cpu-sampling'

Provider Name                           Keywords            Level               Enabled By
Microsoft-DotNETCore-SampleProfiler     0x0000F00000000000  Informational(4)    --profile 
Microsoft-Windows-DotNETRuntime         0x00000014C14FCCBD  Informational(4)    --profile 

Unable to start a tracing session: Microsoft.Diagnostics.NETCore.Client.ServerNotAvailableException: Process 1249318 not running compatible .NET runtime.
   at Microsoft.Diagnostics.NETCore.Client.PidIpcEndpoint.GetDefaultAddress() in /code/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcTransport.cs:line 282
   at Microsoft.Diagnostics.NETCore.Client.PidIpcEndpoint.Connect(TimeSpan timeout) in /code/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcTransport.cs:line 243
   at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessageGetContinuation(IpcEndpoint endpoint, IpcMessage message) in /code/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 40
   at Microsoft.Diagnostics.NETCore.Client.EventPipeSession.Start(IpcEndpoint endpoint, IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB) in /code/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsClient/EventPipeSession.cs:line 34
   at Microsoft.Diagnostics.NETCore.Client.DiagnosticsClient.StartEventPipeSession(IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB) in /code/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsClient/DiagnosticsClient.cs:line 70
   at Microsoft.Diagnostics.Tools.Trace.CollectCommandHandler.Collect(CancellationToken ct, IConsole console, Int32 processId, FileInfo output, UInt32 buffersize, String providers, String profile, TraceFileFormat format, TimeSpan duration, String clrevents, String clreventlevel, String name, String diagnosticPort, Boolean showchildio, Boolean resumeRuntime) in /code/src/Tools/dotnet-trace/CommandLine/Commands/CollectCommand.cs:line 228
  • updated the code as seen in the PR and re-built, ran dotnet-trace again and now it works:
root@4ee5552303fe:/code# ./artifacts/bin/dotnet-trace/Debug/netcoreapp3.1/dotnet-trace collect --process-id 1249318
No profile or providers specified, defaulting to trace profile 'cpu-sampling'

Provider Name                           Keywords            Level               Enabled By
Microsoft-DotNETCore-SampleProfiler     0x0000F00000000000  Informational(4)    --profile 
Microsoft-Windows-DotNETRuntime         0x00000014C14FCCBD  Informational(4)    --profile 

Process        : /usr/share/dotnet/dotnet
Output File    : /code/dotnet_20221205_220957.nettrace
[00:00:00:03]	Recording trace 427.354  (KB)
Press <Enter> or <Ctrl+C> to exit...092  (KB)
Stopping the trace. This may take several minutes depending on the application being traced.

Trace completed.

Sanity passes.

I had to fix 2 sites:

  1. Need to look up in the tmp directory of the target process - I do that by accessing it via /proc/pid/root/...
  2. Need to look up a socket belonging to the NSpid of the process - the PID as the process sees it, not as us see it. I parse /proc/pid/status for that.

TODOs

  1. I use /proc/pid/root by default. This requires privileges (I think it requires to be able to ptrace the process?) and dotnet-trace itself doesn't require root (I actually don't know which permission checks are implemented in the CLR before handling requests on the diagnostics socket. I know that in Java, HotSpot will compare the UID GID of the connecting process to the UID GID of itself). What makes sense, in order to maintain compatibility with the old versions of dotnet-trace, is to fallback to use IpcRootPath if we couldn't use /proc/pid/root. We can also check /proc/pid/ns/mnt vs /proc/self/ns/mnt to understand if we actually need to move mount namespace, but that requires ptrace permissions as well, so the first check ("can we use /proc/pid/root") is enough IMO.
  2. I directly use /proc/pid/root/... which can fail if the added path contains absolute symlinks. The problem is explained thoroughly and a solution is given in Python here, this will affect profiled containers if their /tmp is e.g a link to /var/tmp. We can resolve it now or in a follow-up PR.
  3. I get the NSpid via /proc/pid/status. This is available from Linux 4.1. If we want to support this feature on older kernels, there is a different method to get the NSpid, if we want to support that.
  4. I didn't fix other sites (e.g what dotnet-trace ps uses). I don't think there's a way to properly "ps" all dotnet processes by their sockets, because this will require iterating over all mount namespaces? Or over all mount namespaces of all dotnet process? IDK. I suggest we skip that.
  5. Current code uses /tmp for the temporary directory. Previous code used Path.GetTempPath which gets TMPDIR from env, or defaults to /tmp. Since that's the same path the diagnostics socket will be created in - what makes sense is to implement the behavior of GetTempPath for the remote process, i.e read its environment variables and use /proc/pid/root/$TMPDIR if TMPDIR is present, and otherwise fallback to /tmp. Can probably be a follow-up PR but I'm just stating it. Btw dotnet-trace collect --process-id as of now wouldn't have worked on it processes on the same mount namespace if they change their TMPDIR

Which of these TODOs do you think I should get done for merging it? I think only the first one is needed for MVP, to avoid regressions, and all others are features on top of it. If you agree I can implement it and we can move forward with merging this PR :)

@Jongy Jongy changed the title Search for the diagnostics docker in the mount NS of the target process Search for the diagnostics socket in the mount NS of the target process Dec 6, 2022
@@ -297,7 +298,18 @@ private static bool TryGetDefaultAddress(int pid, out string defaultAddress)
{
try
{
defaultAddress = Directory.GetFiles(IpcRootPath, $"dotnet-diagnostic-{pid}-*-socket") // Try best match.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Kubernetes scenarios with sharedProcessNamespace disabled, wouldn't this preclude the other solution from working? I.e. shared /tmp volume would not work and /proc will not be visible between containers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, good point. Since dotnet-trace never actually required the PID to be visibile in /proc, we could always just share the /tmp and use --process-id with the NSpid of the process.

I think what will work is - if this path fails (i.e we don't have a valid socket in the /proc/pid/root/tmp/...nspid... path) we will fall back to the old behavior. Makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like it would work but be aware that in the above environment if two containers are both running managed apps as lead processes, they will both have a pid of 1. This conflict also exists for tmp mounting though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dotnet-trace on containers
2 participants