Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP013: Create a motivational example (P1795) #129

Open
AerialMantis opened this issue May 24, 2020 · 4 comments
Open

CP013: Create a motivational example (P1795) #129

AerialMantis opened this issue May 24, 2020 · 4 comments
Assignees
Labels
REM20 ISO C++ meeting (remote) 2020

Comments

@AerialMantis
Copy link
Contributor

In our last discussion, we decided that we should create a motivational example of how a developer could use the topology discovery design proposed in P1795 to optimise an algorithm such as matrix multiply based on different system architectures.

@AerialMantis AerialMantis added the REM20 ISO C++ meeting (remote) 2020 label May 24, 2020
@AerialMantis AerialMantis self-assigned this May 24, 2020
@mhoemmen
Copy link
Collaborator

Do we plan on being able to look up cache sizes at different levels of the memory hierarchy? We could use Strassen for a simple example, and have it use the lowest-level cache size to decide when to stop recursing.

@AerialMantis
Copy link
Contributor Author

I would like to have a property which reflects the various caches levels and their sizes. I'm not entirely sure how best to represent those in a generic way yet, perhaps through a hierarchy of managed memory resources which provide constructive/destructive interference.

Yeah, I like that idea, it would be a good example of using the topology information. So we would recursively divide the matrices into blocks until they fit into the lowest level cache and then compute one at a time, per group of threads sharing the cache.

It would be interesting to then further generalize this so that larger matrices could be subdivided across NUMA regions as well.

@AerialMantis
Copy link
Contributor Author

I am working on a pseudo generic algorithm for this incorporating the various architecture agnostic information that we will need to be able to query, and this is a summary of what I have so far:

  • Establish where the matrices should be stored initially:
    • Is there a global memory region representing a single unified address space that all high-level execution resources can share?
    • Are there distinct memory regions for various execution resources, how can memory be migrated between them (copy, map, network communication)?
    • What is the latency in moving between memory regions?
    • Are the kind of memory regions and their latencies uniform, and can we hide latency through double buffering?
  • Establish the matrix subdivision size:
    • What is the lowest latency memory resources available throughout the execution resources?
    • What kind of memory regions are these, are they OS/kernel level managed, such as a cache or are they user-managed such as scratch-pad memory?
    • How much memory is available to these memory resources?
    • Is this uniform or do we need to look for the lowest common denominator?
  • Establish how much work to allocate to the various execution resources:
    • What is the total available concurrency of the execution resources which share these memory resources?
    • What is the peak performance of the execution resources which share these memory resources, say in FLOP/s?
    • Is this uniform or does this need to be more dynamically assigned?

@AerialMantis
Copy link
Contributor Author

AerialMantis commented Jun 12, 2020

From last heterogeneous C++ call:

  • We should focus first on generalizing for the CPU.
  • Another part of this algorithm should be to compare the computational complexity, in this case, matrix-matrix multiply and the performance of each execution resource against the cost of moving data to each execution resource, this will give us the necessary trade-off of whether or not to offload to accelerators such a GPU.
  • Another factor to consider in the future is the overhead invoking a callable on various executions resources, such as launching kernels on accelerators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
REM20 ISO C++ meeting (remote) 2020
Projects
None yet
Development

No branches or pull requests

2 participants