Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group GPU child zones by name. #262

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

benvanik
Copy link
Sponsor Contributor

This gives a more intuitive grouping in this UI pane than the uniqued source locations which aren't fully displayed - people were thinking the UI pane was broken as identically named zones weren't grouping when the checkbox was ticked.

Previous:
image

Now:
image

@wolfpld
Copy link
Owner

wolfpld commented Sep 15, 2021

What is an unique source location in your use case? Is this duplication some artifact of optimization/inlining?

@benvanik
Copy link
Sponsor Contributor Author

These are GPU locations that use this approach: https://github.com/wolfpld/tracy/blob/master/TracyVulkan.hpp#L401
Because that allocs a new source location per use each srcloc ends up as a unique ID in the UI - so every single zone is uniqued. All of them have the same exact inputs (line/source/function/name) - I think a better fix would be if the tracy profiler deduplicated them instead (as ​then all parts of the UI would work correctly), but this is broken today and this felt like a good babystep to unblock being able to use tracy for GPU rollups :)

@benvanik
Copy link
Sponsor Contributor Author

(more to your question: we are running dynamic content that can't be compiled into the application and use the static source locations that are uniqued in the application using the static SourceLocationData)

@wolfpld
Copy link
Owner

wolfpld commented Sep 16, 2021

Because that allocs a new source location per use each srcloc ends up as a unique ID in the UI - so every single zone is uniqued. All of them have the same exact inputs (line/source/function/name) - I think a better fix would be if the tracy profiler deduplicated them instead (as ​then all parts of the UI would work correctly)

This should already be deduplicated. The GpuZoneBeginAllocSrcLocSerial message is being processed at https://github.com/wolfpld/tracy/blob/058e8901/client/TracyProfiler.cpp#L2364-L2371, where the source location payload is transferred within the SendSourceLocationPayload function (https://github.com/wolfpld/tracy/blob/058e8901/client/TracyProfiler.cpp#L2522-L2541). This data is eventually processed at https://github.com/wolfpld/tracy/blob/058e8901/server/TracyWorker.cpp#L3731-L3784, where a temporary source location structure is created, which is then used to search if an entry in the sourceLocationPayloadMap map already exists. If it does, the already existing source location is used. (The StoreString functions also deduplicate.)

It would be interesting to find out why and where this breaks.

@benvanik
Copy link
Sponsor Contributor Author

Awesome! I'll debug through that!

@benvanik
Copy link
Sponsor Contributor Author

Debugging through loading the file, I see TracyWorker.cpp 744 loading 1303 entries into sourceLocationPayload with 81 deduplicated entries in sourceLocationPayloadMap. So it looks like it is deduplicating there correctly on load, but the fact that there are 1303 payloads makes me think it's not deduplicating on capture correctly.

I believe the issue causing the behavior I'm seeing above in the zone view is that though that deduplication happens on load the GpuEvents cpuStart_srcloc is still pointing at the raw sourceLocationPayload index from the file - this means that the view code that's strictly using that index for its cmap will still be acting in the unduplicated domain.

The source locations I'm seeing are negative, probably coming from m_pendingSourceLocationPayload on the capture side. Debugging into that will be trickier (ugh, Android), but maybe the above triggers something in your mind? In particular, I'm wondering if namehash may be causing issues with the AddSourceLocationPayload deduplicating.

One particularly interesting thing I've observed is that this trace I'm debugging through was captured using the capture tool against a process running on an Android device, but if I capture a trace by directly attaching to a process on my Windows machine things work fine! Debugging through the same sourceLocationPayload loading code in a saved trace I see that both sourceLocationPayload and sourceLocationPayloadMap have the same number of entries (31 in my particular run) - indicating that the worker did deduplicate the payloads correctly.

Any thoughts as to why Android + capture tool would not deduplicate while Windows + profiler UI would? (I'm not sure where best to look at next)

Attached is the trace showing the unduplicated sourceLocationPayloads in case it's helpful:
MobileNetV3Small-[fp32,imagenet]-(TensorFlow)-full-inference-with-IREE-Vulkan-@-SM-G980F-(GPU-Mali-G77).zip

(and thanks for the help!)

@wolfpld
Copy link
Owner

wolfpld commented Sep 16, 2021

So it looks like it is deduplicating there correctly on load

There is no deduplication at load time. Everything is done at capture time.

the GpuEvents cpuStart_srcloc is still pointing at the raw sourceLocationPayload index from the file - this means that the view code that's strictly using that index for its cmap will still be acting in the unduplicated domain.

This is the intended behavior.

The source locations I'm seeing are negative, probably coming from m_pendingSourceLocationPayload on the capture side.

This is because there are two ways of providing source location data, as you can see in the source location retrieval code:

const SourceLocation& Worker::GetSourceLocation( int16_t srcloc ) const
{
    if( srcloc < 0 )
    {
        return *m_data.sourceLocationPayload[-srcloc-1];
    }
    else
    {
        return m_data.sourceLocation.find( m_data.sourceLocationExpand[srcloc] )->second;
    }
}

Positive srcloc entries are decompressed (int16 -> uint64 LUT) to a pointer to the source location data structure, as was present in the profiled program. This pointer is then used to lookup the retrieved data stored in sourceLocation map.

Negative srcloc entries do not have a corresponding pointer in the profiled program memory, as they are allocated dynamically (and deduplicated by the methods referenced earlier), and have to be handled in a different manner.

You can see that the loading code just populates the sourceLocation, sourceLocationExpand and sourceLocationPayload lookup tables, without any further processing, as it's not needed.

In particular, I'm wondering if namehash may be causing issues with the AddSourceLocationPayload deduplicating.

namehash is used only as a hacky cache (mutable in a const struct...) for name string hash, used to provide zone color in certain configurations (View::GetRawSrcLocColor). The SourceLocationHasher and SourceLocationComparator functors, used for deduplication, operate on SourceLocationBase, which does not include the namehash field.

Any thoughts as to why Android + capture tool would not deduplicate while Windows + profiler UI would? (I'm not sure where best to look at next)

I don't think Android is a factor here, the allocated source location transfer format is rather simple, and not platform specific:

    // Allocated source location data layout:
    //  2b  payload size
    //  4b  color
    //  4b  source line
    //  fsz function name
    //  1b  null terminator
    //  ssz source file name
    //  1b  null terminator
    //  nsz zone name (optional)

Furthermore, both _36 source locations do have identical data inside:

obraz
obraz

Profiler UI and the capture utility differ mainly by using or not the TRACY_NO_STATISTICS macro, which disables calculation of additional derivative data which may be useful in the UI, but which would just waste CPU cycles otherwise.

@benvanik
Copy link
Sponsor Contributor Author

Ok cool - that all makes sense. I apologize for being a bit dense, but why would if they have duplicate data and everything is working as intended are they not deduplicated and shown as unique entries in the GPU zone listing? (I don't know where to look next :)

@wolfpld
Copy link
Owner

wolfpld commented Sep 16, 2021

I have no idea. The best course of action would be to try to do a repro case and then debug the server part to find out what exactly happens in the deduplication code and why it is failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants