[Distributed][auto-parallel] Sharding Specification and rule discovery #336

soodoshll · 2023-07-31T00:38:37Z

This PR tries to implement step 1) and 3) mentioned in #335 (comment)

It contains an example of automatically discovering the sharding rules of all operators in resnet model, which involves dynamic batch size. You can run it with

python -m hidet.distributed.partition.rule

Now we only support discovering 1D partitions.

The rule discovery process is basically to enumerate all possible input-sharding specifications and check their validity. The checking takes two steps:

Shape check: it will generate a sharded version of input tensors and re-forward the op to get the new output tensors. Then it will try to align the new output with the original whole output tensors, in order to find which output dimensions are sharded. For example, if the input $A$ is an 8x8 matrix, the op is $sin$, and we want to shard the first dimension of $A$ into 2 partitions, then it will generate an input shard with shape 4x8, and get the output shape 4x8, and we know that the first dimension of the output is also shared.
Data dependency check. The validity of the shape check does not guarantee correctness. For example, if we have an 8x8 matrix $A$, and the op is $softmax$, then sharding along the second dimension gives the correct shape but wrong results, because the denominator of $softmax$ depends on the complete second dimension. So we analyze the range of data the output depends on, and the sharding is valid only if the output shard only depends on the input shard. For example, any shard of $A+1$ only depends on the same shard of $A$, so it is valid. We trace the indexing expressions(TensorElement) from fcompute of all outputs, to get the range of input indices that are accessed. And then we use Z3 SMT solver to check if any accessed input indices are out of the shard boundary. If not, the sharding is valid.

Note that even data dependency does not guarantee correctness completely (if the op plays tricks), but I find it can cover most of the ops I met. Though we still need to test it more thoroughly.

I can add more details about it to the RFC.

cc @yaoyaoding @xinli-git

Fix llama2 and add test for num_heads != num_key_value_heads. --------- Co-authored-by: Allan Lin <allan.lin@centml.ai>

yaoyaoding · 2023-07-31T16:45:44Z

Hi @soodoshll,

Can I know the main purpose of the z3 solver?

Because we also have a bound analyzer as well as an arthimatic simplifier in hidet, is any limitation of our internal analyzer and simpilifer?

We need to be cautious when introducing new dependency, thus more justification is needed when introducing a new one.

soodoshll · 2023-07-31T18:05:23Z

Yes, we can replace it with the internal analyzer, except for dynamic shapes, where symbols might appear in the boundary expression.

I'm not very sure if we need to support dynamic shape for auto-sharding, since later if we use ILP to find the optimal sharding strategy, all shape symbols need to be concretized. It's difficult to solve an optimization problem involving dynamic shapes.

Therefore, if our ultimate goal is a function like hidet.distributed.partition(graph), the shape symbols need to be concretized at some point, and we partition this specially concretized version. Nevertheless, since the ops of concretized and dynamic graphs are one-to-one paired, we can project the partitioning on the concretized graph back to the dynamic-shaped graph, which is runnable but might have suboptimal performance.

So the whole pipeline might be like:

build flowgraph(dynamic) ---> concretization (concrete) ---> auto parallelization --> partition plan (concrete)
          |                                                                                      |
          +<-------------------------------------------------------------------------------------+
          | (projecting)
          v
partitioned graph(dynamic) -> optimize/compile

In one word, if auto-partition only happens for concretized shapes, then we do not need Z3.

@yaoyaoding @xinli-git Do you think the pipeline above makes sense?

soodoshll · 2023-07-31T20:00:13Z

z3 has been removed

yaoyaoding · 2023-07-31T20:29:12Z

Make sense. For different sequence length, the best partition strategy might be different. Thus, we do need to optimize for some specific input size and then use the found stragegy for all sequence length.

yaoyaoding · 2023-08-01T04:30:03Z

Hi @soodoshll, ping me when the PR is ready to be reviewed. Thanks!

Assuming `U1` and `U2` are unary elementwise operators, and `B` is a binary elementwise operator. Introduce a new fused operator called Composite Elementwise Operation, where `comp_elemwise(x, U1, U2, B) = B(U1(x), U2(x))`. Also add subgraph rewrite rule. This allows for more fusion opportunity.

soodoshll · 2023-08-01T19:14:24Z

Hi @yaoyaoding Update: I found that the engineering effort for the end-to-end partitioning pipeline is not as much as we thought. And it is hard to tell if some design is good without seeing how it is used by downstream components and what's lacking. So I think I can first try to build a whole auto-partition pipeline prototype, and iteratively improve it.

I spent some days building a prototype from sharding rule discovery, partition scheme search (ILP), and communication op injection, weight sharding, launching script etc.

I'm working on this branch, the code is still messy and needs to be refactored. It's runnable except for gathering final outputs from different gpus.

https://github.com/soodoshll/hidet/blob/auto-parallel-connect/examples/distributed/resnet.py

Once it's finished, we can have something like python -m hidet.distributed.launch [num_gpus] resnet.py [out_dir] helping us auto-partition the graph and run inference.

…-org#339) This PR clears the intermediate object files generated during tuning. This should reduce 2/3 cache size without hurting any functionality. hidet-org#338

yaoyaoding · 2023-08-02T19:53:16Z

Hi @soodoshll, thanks for the update!

After you finish the whole pipeline, you can split it into several PRs and @xinli-git and I can help to review.

xinli-git

I am still going through the code to understand the sharding rules better but posting some comments first.

Feel free to merge without me once Yaoyao thinks it looks good :)

xinli-git · 2023-08-04T04:19:04Z

python/hidet/distributed/partition/rule.py

+
+class IndexRewriter(ExprRewriter, ComputeRewriter):
+    def __init__(self):
+        super().__init__()


I believe this is implicitly just calling ExprRewriter.init ?

Maybe a nitpick but relying on MRO might be a bit error prone (if say in the future the inheritance order changes, or if a different method with the same name is added to the first base class ), it might be easier to just explicitly call Base.init(self) ? I guess same for super().visit, etc?

Hi @xinli-git , the current writing looks fine. This function does not require any explicit ordering of calling its parent classes' init method. The default one super().__init__() will call them in MRO order and it works as we expected. Explicitly calling some parent class's init method might ignore some other parent classes and is error prone.

xinli-git · 2023-08-04T04:39:36Z

tests/models/test_llama.py

@@ -63,3 +63,61 @@ def test_llama2(device, opt):
    print(current_memory_pool("cuda"))
    print(current_memory_pool("cpu"))
    print(current_memory_pool("vcuda"))
+


Is this a github bug? these changes are already merged in https://github.com/hidet-org/hidet/pull/333/files ?

Oh NVM, can we do a rebase of main to auto-parallel ? so this PR only includes the related changes.

xinli-git · 2023-08-04T04:53:31Z

python/hidet/distributed/partition/rule.py

+    found_rules = []
+
+    # We do not allow dynamic shapes
+    assert len(op.task.symbols) == 0


can we raise NotImpleentError instead?

python/hidet/distributed/partition/rule.py

xinli-git · 2023-08-04T16:11:58Z

python/hidet/distributed/partition/rule.py

+            if len(new_inputs) < len(inputs):
+                continue
+            outputs = op.reforward(new_inputs)
+        except (ValueError, RuntimeError, AssertionError, IndexError):


how do we arrive at these exceptions? I assume through trial and error?

Yes. For example, if we have an op that adds up two 4x4 matrices, and we try to shard one of them along the first dimension, then it becomes 2x4 + 4x4, which is illegal.

xinli-git · 2023-08-04T16:17:25Z

python/hidet/distributed/partition/shard.py

+    # Only works for 1-D partition
+    slice_list = []
+    for i in range(len(tensor.shape)):
+        if i != shard_dim:


regarding the -1 vs None, I think the code would be cleaner with None meaning not sharded and avoid -1. For example, here, we can just check if shard_dim is None and be more explicit.

Yes. Especially given -1 can stand for the last dimension :( I will switch to None.

yaoyaoding

It looks good to me now.

Feel free to merge after fixing the existing comments. Thanks @soodoshll !

yaoyaoding · 2023-08-04T18:56:18Z

python/hidet/distributed/partition/rule.py

+    def __init__(self):
+        super().__init__()
+        self.var2idx: Dict[Var, Expr] = {}
+        self.input_accessed_indices: Dict[TensorInput, List[Expr]] = {}


Dict[TensorInput, List[List[Expr]]] ?

yaoyaoding · 2023-08-04T19:00:06Z

python/hidet/distributed/partition/rule.py

+        index_rewriter = IndexRewriter()
+        for o in self.op.task.outputs:
+            if not isinstance(o, GridCompute):
+                self.valid = False
+                return  # we don't support scalar output
+            index_rewriter.visit(o)


Suggested change

index_rewriter = IndexRewriter()

for o in self.op.task.outputs:

if not isinstance(o, GridCompute):

self.valid = False

return # we don't support scalar output

index_rewriter.visit(o)

index_rewriter = IndexRewriter()

index_rewriter.visit(self.op.task.outputs)

self.valid = all(isinstance(out, GridCompute) for out in self.op.task.outputs)

…discovery (#342) Please refer to #336 for original discussion. I accidentally messed up the branch :( So I re-open this PR. Have reset the upstream/auto-parallel to the latest commit of the main branch. This PR only contains auto-partition related modifications. --------- Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding and others added 13 commits July 28, 2023 21:00

[CI] List package versions in ci (hidet-org#334)

a491cb3

[CI] List package versions in ci (hidet-org#334)

0bc0f2d

update

8a3cdec

update

6a064bf

update

78fbf83

update

924d9d4

update

317d7b4

update

8748c2e

update

28dcec9

[Models] Llama2 fix (hidet-org#333)

0532154

Fix llama2 and add test for num_heads != num_key_value_heads. --------- Co-authored-by: Allan Lin <allan.lin@centml.ai>

update

be37b1f

format

6790b82

[Models] Llama2 fix (hidet-org#333)

bb9e1cb

Fix llama2 and add test for num_heads != num_key_value_heads. --------- Co-authored-by: Allan Lin <allan.lin@centml.ai>

update

09f340e

update

3a37a1e

soodoshll added 2 commits July 31, 2023 16:35

fix

33bcff8

update

138f007

soodoshll and others added 2 commits August 1, 2023 00:55

update

4eaa65c

[Fixbug] Clear the intermediate object files for kernel tuning (hidet…

a395fb1

…-org#339) This PR clears the intermediate object files generated during tuning. This should reduce 2/3 cache size without hurting any functionality. hidet-org#338

soodoshll added 3 commits August 3, 2023 03:53

update

7e798e1

format

476acaa

update

42b3d0f

xinli-git reviewed Aug 4, 2023

View reviewed changes

soodoshll added 23 commits August 4, 2023 19:03

shard spec data structure

0b61aec

update

012748d

update

dbf6edf

update

6b2e4a5

update

f28c935

update

fcd1381

update

d00e34d

update

8a36b44

update

6d2ee5a

format

bef294b

update

2e68239

update

600d643

fix

2f8aca7

update

091b4d5

update

a513d6a

update

c58effe

format

2152e8e

update

87443c8

refactor

5c69558

fix

de4daef

update

16aecb2

update

c47a41c

fix

36611a4

yaoyaoding approved these changes Aug 4, 2023

View reviewed changes

soodoshll added 2 commits August 4, 2023 19:07

fix

7d529b5

merge

0e860be

soodoshll merged commit 0e860be into hidet-org:auto-parallel Aug 4, 2023
1 of 2 checks passed

soodoshll mentioned this pull request Aug 4, 2023

[Revise][Distributed][auto-parallel] Sharding Specification and rule discovery #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Distributed][auto-parallel] Sharding Specification and rule discovery #336

[Distributed][auto-parallel] Sharding Specification and rule discovery #336

soodoshll commented Jul 31, 2023

yaoyaoding commented Jul 31, 2023

soodoshll commented Jul 31, 2023 •

edited

soodoshll commented Jul 31, 2023

yaoyaoding commented Jul 31, 2023

yaoyaoding commented Aug 1, 2023

soodoshll commented Aug 1, 2023 •

edited

yaoyaoding commented Aug 2, 2023

xinli-git left a comment

xinli-git Aug 4, 2023

yaoyaoding Aug 4, 2023

xinli-git Aug 4, 2023

xinli-git Aug 4, 2023

xinli-git Aug 4, 2023

xinli-git Aug 4, 2023

soodoshll Aug 4, 2023

xinli-git Aug 4, 2023

soodoshll Aug 4, 2023

yaoyaoding left a comment

yaoyaoding Aug 4, 2023

yaoyaoding Aug 4, 2023

[Distributed][auto-parallel] Sharding Specification and rule discovery #336

[Distributed][auto-parallel] Sharding Specification and rule discovery #336

Conversation

soodoshll commented Jul 31, 2023

yaoyaoding commented Jul 31, 2023

soodoshll commented Jul 31, 2023 • edited

soodoshll commented Jul 31, 2023

yaoyaoding commented Jul 31, 2023

yaoyaoding commented Aug 1, 2023

soodoshll commented Aug 1, 2023 • edited

yaoyaoding commented Aug 2, 2023

xinli-git left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaoyaoding left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soodoshll commented Jul 31, 2023 •

edited

soodoshll commented Aug 1, 2023 •

edited