TriSolver (dist): move sorting permutation from CPU to GPU #1118

albestro · 2024-04-09T10:24:54Z

This PR aims at dropping the custom permuteJustLocal and reduce the use-case, by transforming permutation indices, to be manageable with the existing local permutation implementation, that exists for both backends.

Cleanup implementation
~~It might be possible to drop i5 (for distributed implementation)~~
What to do about permute API? Should we separate the "distributed" use case (at least formally) or is it enough reviewing assumptions?
~~Evaluate if it is worth switching to MatrixRef (just for the code changed)~~
Make it work on GPU
Add a unit test for the new use-case with distributed matrices

Notes

From PR #967 each rank sort eigenvalues by type (upper, dense, lower, deflated) independently from other ranks. At the time of that PR, for convenience reasons, we opted for performing the sort with a custom permutation procedure permuteJustLocal that were able to deal with global indices but just apply the permutation to the local part. In addition to this, permuteJustLocal was implemented just on CPU because on GPU it would had required a major effort not worth due to the inherently GPU inefficient type of operations.

albestro · 2024-04-09T10:26:22Z

cscs-ci run

albestro · 2024-04-09T12:55:40Z

cscs-ci run

albestro · 2024-04-09T15:10:33Z

cscs-ci run

albestro · 2024-04-16T16:20:27Z

cscs-ci run

albestro · 2024-05-27T08:11:20Z

cscs-ci run

albestro · 2024-05-27T08:49:40Z

cscs-ci run

albestro · 2024-05-27T09:44:02Z

cscs-ci run

albestro · 2024-05-27T09:56:28Z

include/dlaf/eigensolver/tridiag_solver/merge.h

+// @param perm_sorted           array[n]    current -> initial (i.e. evals[i] -> types[perm_sorted[i]])
+// @param index_sorted          array[n]    global(sort(non-deflated)|sort(deflated))) -> initial
+// @param index_sorted_coltype  array[n]    local(sort(upper)|sort(dense)|sort(lower)|sort(deflated))) -> initial
+// @param i5_lc                 array[n_lc] local(sort(upper)|sort(dense)|sort(lower)|sort(deflated))) -> initial


note-to-self: specify that they are local indices, while in index_sorted_coltype they are global indices

albestro · 2024-05-27T10:35:45Z

cscs-ci run

rasolca

Looks good.

rasolca · 2024-05-28T10:01:18Z

include/dlaf/permutations/general.h

  // Note:
  // These are not implementation constraints, but more logic constraints. Indeed, these ensure that
  // the range [i_begin, i_end] is square in terms of elements (it would not make sense to have it square
  // in terms of number of tiles). Moreover, by requiring mat_in and mat_out matrices to have the same
  // shape, it is ensured that range [i_begin, i_end] is actually the same on both sides.


Constraints should be revised (in a different PR).

rasolca · 2024-05-28T10:01:29Z

include/dlaf/permutations/general/impl.h

-                                                 std::move(setup_permute_fn)) |
-                       ex::unpack() | ex::bulk(subm_dist.size().get<C>(), permute_fn));
+    ex::start_detached(std::move(sender) | ex::transfer(di::getBackendScheduler<Backend::MC>()) |
+                       ex::bulk(nperms, std::move(permute_fn)));


Number of tasks created by bulk might be huge.
I suggest addressing in a new PR.

The number might be large, but not larger than num_threads. That may of course still be too much, but just keep in mind that the thread_pool_scheduler specialization of bulk will not blindly create nperms tasks.

rasolca · 2024-05-28T10:05:28Z

cscs-ci run

msimberg

Can't comment on algorithmic changes. Looks good otherwise.

rasolca · 2024-05-29T10:06:48Z

Back to draft due to frequent hangs on santis

with a distributed matrix

required to be compatible with both local and distributed usage

…d matrices

albestro added Priority:Medium Type:Optimization labels Apr 9, 2024

albestro added this to the Optimizations milestone Apr 9, 2024

albestro self-assigned this Apr 9, 2024

albestro marked this pull request as draft April 9, 2024 12:24

albestro linked an issue May 21, 2024 that may be closed by this pull request

Tridiagonal Solver (dist): Migrate permutation of local eigenvectors to GPU #1058

Open

albestro force-pushed the alby/permute-local branch from 180c08d to 217cf55 Compare May 27, 2024 08:10

albestro commented May 27, 2024

View reviewed changes

albestro marked this pull request as ready for review May 27, 2024 10:35

rasolca approved these changes May 28, 2024

View reviewed changes

albestro mentioned this pull request May 28, 2024

Improve logic for splitting work over bulk threads for permutations #1151

Open

rasolca requested a review from msimberg May 28, 2024 10:04

rasolca added Type:New Feature New feature or request Priority:High and removed Priority:Medium labels May 28, 2024

rasolca modified the milestones: Optimizations, v0.5.0 May 28, 2024

msimberg approved these changes May 28, 2024

View reviewed changes

rasolca marked this pull request as draft May 29, 2024 10:06

rasolca removed this from the v0.5.0 milestone May 29, 2024

rasolca modified the milestones: Optimizations, v0.5.1 May 29, 2024

albestro and others added 20 commits June 4, 2024 14:27

WIP: add i5b for local permutation + adapt local permutation to deal

92c0076

with a distributed matrix

switch to col vector for permutations indices

46c6190

required to be compatible with both local and distributed usage

WIP: adapt also for GPU

8cd03c0

drop permuteJustLocal custom implementation

de3d218

WIP: requirements change to use public API of permute with distribute…

0085d57

…d matrices

use public API of permute in tridiagonal solver

78b652a

simplified version of applyPermutationsOnCPU

76415a8

minor changes and doc

f89feca

integrate doc

60bf9e6

test for local permutation on distributed matrix

3a87a30

bug fix and minor improvement to assert

a70d6da

add also gpu test

3556723

add test also for row permutations (cpu and gpu)

9044f06

bug fix for row permutation

16fd6f7

add anoter test-case

05f7269

revert change to fix warnings

9a8afa0

fix another warning

27fdb60

fix also the other assertion

e76b3bc

refactor and drop old applyPermutationOnCPU with splits

f06b0fa

yet another warning fix

dd8c4ec

albestro force-pushed the alby/permute-local branch from 11f2185 to dd8c4ec Compare June 4, 2024 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TriSolver (dist): move sorting permutation from CPU to GPU #1118

TriSolver (dist): move sorting permutation from CPU to GPU #1118

albestro commented Apr 9, 2024 •

edited

albestro commented Apr 9, 2024

albestro commented Apr 9, 2024

albestro commented Apr 9, 2024

albestro commented Apr 16, 2024

albestro commented May 27, 2024

albestro commented May 27, 2024

albestro commented May 27, 2024

albestro May 27, 2024

albestro commented May 27, 2024

rasolca left a comment

rasolca May 28, 2024

rasolca May 28, 2024

msimberg May 28, 2024

rasolca commented May 28, 2024

msimberg left a comment

rasolca commented May 29, 2024

TriSolver (dist): move sorting permutation from CPU to GPU #1118

Are you sure you want to change the base?

TriSolver (dist): move sorting permutation from CPU to GPU #1118

Conversation

albestro commented Apr 9, 2024 • edited

Notes

albestro commented Apr 9, 2024

albestro commented Apr 9, 2024

albestro commented Apr 9, 2024

albestro commented Apr 16, 2024

albestro commented May 27, 2024

albestro commented May 27, 2024

albestro commented May 27, 2024

albestro May 27, 2024

Choose a reason for hiding this comment

albestro commented May 27, 2024

rasolca left a comment

Choose a reason for hiding this comment

rasolca May 28, 2024

Choose a reason for hiding this comment

rasolca May 28, 2024

Choose a reason for hiding this comment

msimberg May 28, 2024

Choose a reason for hiding this comment

rasolca commented May 28, 2024

msimberg left a comment

Choose a reason for hiding this comment

rasolca commented May 29, 2024

albestro commented Apr 9, 2024 •

edited