[Core][Distributed] add fast broadcast for tensor dict #4757

youkaichao · 2024-05-11T07:10:55Z

An ongoing effort of #4440 .

Reduce the number of broadcast from 2 to 1.

Broadcast time (before): 0.38772106170654297ms
Broadcast time (after): 0.128173828125ms

TODO:

improve the broadcast for prepare input in the same way.

youkaichao · 2024-05-11T22:54:48Z

The TensorMetadata is not good for serialization:

from vllm.distributed.communication_op import TensorMetadata
import torch
d = TensorMetadata("cuda", torch.float32, torch.Size([]))
import pickle
s = pickle.dumps(d)
len(s) # 120
import pickletools
pickletools.dis(s)

output:

    0: \x80 PROTO      4
    2: \x95 FRAME      109
   11: \x8c SHORT_BINUNICODE 'vllm.distributed.communication_op'
   46: \x94 MEMOIZE    (as 0)
   47: \x8c SHORT_BINUNICODE 'TensorMetadata'
   63: \x94 MEMOIZE    (as 1)
   64: \x93 STACK_GLOBAL
   65: \x94 MEMOIZE    (as 2)
   66: \x8c SHORT_BINUNICODE 'cuda'
   72: \x94 MEMOIZE    (as 3)
   73: \x8c SHORT_BINUNICODE 'torch'
   80: \x94 MEMOIZE    (as 4)
   81: \x8c SHORT_BINUNICODE 'float32'
   90: \x94 MEMOIZE    (as 5)
   91: \x93 STACK_GLOBAL
   92: \x94 MEMOIZE    (as 6)
   93: \x8c SHORT_BINUNICODE 'torch'
  100: \x94 MEMOIZE    (as 7)
  101: \x8c SHORT_BINUNICODE 'Size'
  107: \x94 MEMOIZE    (as 8)
  108: \x93 STACK_GLOBAL
  109: \x94 MEMOIZE    (as 9)
  110: )    EMPTY_TUPLE
  111: \x85 TUPLE1
  112: \x94 MEMOIZE    (as 10)
  113: R    REDUCE
  114: \x94 MEMOIZE    (as 11)
  115: \x87 TUPLE3
  116: \x94 MEMOIZE    (as 12)
  117: \x81 NEWOBJ
  118: \x94 MEMOIZE    (as 13)
  119: .    STOP
highest protocol among opcodes = 4

Each single TensorMetadata takes 120 bytes.

youkaichao · 2024-05-11T23:20:06Z

After a8d1d3a, the serialization size is reduced by more than a half (120 bytes to 52 bytes):

from vllm import TensorMeta
import torch
d = TensorMeta("cuda", torch.float32, torch.Size([]))
import pickle
s = pickle.dumps(d)
len(s) # 52
import pickletools
pickletools.dis(s)

output:

    0: \x80 PROTO      4
    2: \x95 FRAME      41
   11: \x8c SHORT_BINUNICODE 'vllm'
   17: \x94 MEMOIZE    (as 0)
   18: \x8c SHORT_BINUNICODE 'TensorMeta'
   30: \x94 MEMOIZE    (as 1)
   31: \x93 STACK_GLOBAL
   32: \x94 MEMOIZE    (as 2)
   33: )    EMPTY_TUPLE
   34: \x81 NEWOBJ
   35: \x94 MEMOIZE    (as 3)
   36: ]    EMPTY_LIST
   37: \x94 MEMOIZE    (as 4)
   38: (    MARK
   39: \x8c     SHORT_BINUNICODE 'cuda'
   45: \x94     MEMOIZE    (as 5)
   46: K        BININT1    17
   48: )        EMPTY_TUPLE
   49: e        APPENDS    (MARK at 38)
   50: b    BUILD
   51: .    STOP

youkaichao · 2024-05-11T23:59:56Z

With all above optimization, the bytes to broadcast BlockMetaData can be reduced from 260 bytes to 107 bytes.

This benefit will become more significant when we apply the technique to prepare input related data stucture.

rkooo567

"improve the broadcast for prepare input in the same way." -> planning to do this in this PR? Also can you tell me the perf improvement from it?

Also can you update

vllm/tests/basic_correctness/test_preemption.py

Line 109 in 702bee4

def test_swap(

this test to use tp > 1? I think it can verify the correctness of this change

rkooo567 · 2024-05-13T13:47:58Z

vllm/distributed/communication_op.py

    metadata_list = []
    tensor_list = []
-    for key, value in tensor_dict.items():
+    used_keys = keys or tensor_dict.keys()


Should we assert keys == len(tensor_dict)?

Not necessary though. This current code is more flexible without assert.

rkooo567 · 2024-05-13T13:49:11Z

vllm/__init__.py

+@dataclasses.dataclass
+class TensorMeta:
+    """
+    This class is placed here to reduce the size of qualified name,


what about we just create vllm/tensor_meta.py? Is this still long?

Yes, vllm/tensor_meta.py will lead to vllm.tensor_meta.TensorMeta , longer than vllm.TensorMeta

does tensor_meta make a big difference? Feel like if it is just a little bit difference (like 2 digits microsecond), I prefer to avoid it...

rkooo567 · 2024-05-13T13:50:25Z

vllm/distributed/communication_op.py

            tensor_list.append(value)
        else:
            metadata_list.append((key, value))
+    if keys is not None:


nit; why don't we just check it in line 213?

if keys is not None: metadata_list.append((key, value)) else: metadata_list.append(value)

I think control flow in the loop is more expensive (N control flow) than control flow outside of the loop (1 control flow).

In [12]: def control(): ...: b = True ...: result = [] ...: a = [i for i in range(3000)] ...: for i in a: ...: if b: ...: result.append((i, i)) ...: else: ...: result.append(i) In [16]: def copy(): ...: b = True ...: result = [] ...: a = [i for i in range(3000)] ...: for i in a: ...: result.append((i, i)) ...: result = [value for key, value in result] In [22]: timeit copy() 192 µs ± 686 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [23]: timeit control() 159 µs ± 487 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Hmm actually I tried and it looks like control is faster. But I think the perf diff here is not very meaningful (it is premature optimization). I was asking because I thought it is easier to understand, but not strong opinion. I will leave it up to you.

rkooo567 · 2024-05-13T13:52:27Z

vllm/distributed/communication_op.py

+
+    This class represents a dictionary of tensors with bounded metadata.
+    The upperbound of the buffer size is known a priori. Therefore, we can
+    pre-allocate a buffer for the metadata, and invoke only one collective


Is this correct because we are now broadcasting using cpu "tensor", we don't need to broadcast the object size (which is the implementation detail of broadcast_object_list)?

The key idea is not cpu "tensor", but we know the maximum size of the serialization, so we don't need to broadcast the length. This is indeed an implementation detail of broadcast_object_list.

Can you add this to the comment that it relies on that implementation detail?

vllm/distributed/communication_op.py

rkooo567 · 2024-05-13T14:01:50Z

vllm/distributed/communication_op.py

+     dtypes). If `cls` is provided, we can know the length of the metadata
+     roughly and allocate a buffer for it, then broadcasting metadata requires
+     only one broadcast call. Otherwise, we need to broadcast the metadata
+     length first, then broadcast the metadata.


Can you add a simple example of how to use TensorDictWithBoundedMetadata in the docstring?

vllm/worker/worker.py

youkaichao · 2024-05-13T17:56:31Z

"improve the broadcast for prepare input in the same way."

It will require another PR.

Also can you tell me the perf improvement from it?

For broadcasting blocks to swap/copy, the benefit is:

Broadcast time (before): 0.38772106170654297ms
Broadcast time (after): 0.128173828125ms

I don't have an end-to-end benchmarking.

update test_swap

It requires quite a large modification to the test procedure (separate the test into distributed tests) . Meanwhile, the correctness is already checked in https://github.com/vllm-project/vllm/pull/4757/files#diff-cba46ef2b8ff23834781fa3b43794a3f19ffc6b4f1ec2353a8d13d1cdc2d0588R110 .

youkaichao · 2024-05-13T19:02:43Z

@rkooo567 can you help take a look at https://buildkite.com/vllm/ci/builds/7258#018f732a-46ad-4e69-a35b-25f5200d0e19 ? The failure looks like a ray issue, the function cannot access the name List, although it is imported in the top.

njhill · 2024-05-13T20:10:31Z

@youkaichao it would be good to check whether there's non-negligible performance difference in end-to-end tests before introducing the additional complexity, it's not always easy to infer this from a microbenchmark. A simple before/after test generating a decent number of tokens with a TP deployment would be sufficient I think?

Do you know how much of the latency benefit comes from compressing the number of bytes with the new TensorMeta class vs eliminating one of the broadcasts?

The two broadcast_tensor_dicts (this one and prepare_input_tensors) are done immediately after each other, and it does not look like the second depends on the first, could we combine them?

Especially if they're combined, I'm wondering whether we can avoid the quite convoluted abstractions for what is just a single case.

Implementation-wise, instead of requiring the custom classes, what do you think about this:

Within broadcast_tensor_dict, maintain a single static tensor buffer (instead of per class). Its initial size can be like 16 bytes.
Use the first byte of the broadcast tensor to indicate the kind of message that follows, either
1. a regular pickled object
2. a buffer resize, where the following 4 or 8 bytes are the encoded int size that the buffer should be increased to. After this, the broadcast is repeated.

Then there's no need to maintain special classes. I also don't think there's any need to have special handling for the keys, we can just pass lists instead of dicts?

njhill · 2024-05-13T20:18:45Z

@youkaichao another reason the above approach might be better - IIUC the get_example_metadata_list approach won't work if the size varies much at runtime (not sure whether that might be the case for prepare_input_tensors)?

youkaichao · 2024-05-13T21:02:16Z

First of all, this PR is the first step for later optimization. Itself is a pure benefit because it reduces the broadcast from twice to once.

The followup for applying the optimization in prepare input needs to come after the refactor #4681 .

Within broadcast_tensor_dict, maintain a single static tensor buffer (instead of per class). Its initial size can be like 16 bytes.
Use the first byte of the broadcast tensor to indicate the kind of message that follows, either
a regular pickled object
a buffer resize, where the following 4 or 8 bytes are the encoded int size that the buffer should be increased to. After this, the broadcast is repeated.

This does not reduce the broadcast. It still requires two broadcast even if we don't have any tensor data to broadcast.

rkooo567 · 2024-05-14T01:00:26Z

I think the nice benchmark to back up is;

how much is input broadcast overhead for e2e latency? -> In our internal bechmhark, we found this overhead is "very big". https://docs.google.com/spreadsheets/d/1GMyebF9XwlLJzpkpRxZrzUNcQTSibHlQ7zifldaDPtI/edit#gid=0, almost as big as model fwd at high tp.
I think there are 2 parts we can optimize. 1. reduce overhead of braodcast_object_list. 2. reduce the # of tensor broadcast (we do it per tensor). I think this tackles 1, and it'd be great to know how much is 1 in e2e broadcasting overhead (I believe @youkaichao already has the number).

rkooo567

For PR, I think it LGTM.

prefer to avoid having TensorMeta inside init unless the perf diff is very big. I will approve it for now, but please resolve the discussion with @njhill before merging it!

rkooo567 · 2024-05-14T00:51:58Z

vllm/distributed/communication_op.py

+
+    This class represents a dictionary of tensors with bounded metadata.
+    The upperbound of the buffer size is known a priori. Therefore, we can
+    pre-allocate a buffer for the metadata, and invoke only one collective


Can you add this to the comment that it relies on that implementation detail?

rkooo567 · 2024-05-14T00:55:05Z

vllm/distributed/communication_op.py

            tensor_list.append(value)
        else:
            metadata_list.append((key, value))
+    if keys is not None:


In [12]: def control(): ...: b = True ...: result = [] ...: a = [i for i in range(3000)] ...: for i in a: ...: if b: ...: result.append((i, i)) ...: else: ...: result.append(i) In [16]: def copy(): ...: b = True ...: result = [] ...: a = [i for i in range(3000)] ...: for i in a: ...: result.append((i, i)) ...: result = [value for key, value in result] In [22]: timeit copy() 192 µs ± 686 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [23]: timeit control() 159 µs ± 487 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Hmm actually I tried and it looks like control is faster. But I think the perf diff here is not very meaningful (it is premature optimization). I was asking because I thought it is easier to understand, but not strong opinion. I will leave it up to you.

youkaichao · 2024-05-14T04:14:03Z

@njhill has a proposal to cache the max length of metadata based on callsite, I will wait and see how it works.

njhill · 2024-05-16T19:26:52Z

@youkaichao I've opened #4844 to show the idea, PTAL!

youkaichao added 11 commits May 10, 2024 21:44

add FastBroadcastTensorDict

70bd52f

add subclass init

4807171

add tests

fa717a4

rm new

dced974

update tests

ff95abf

use FastBroadcastTensorDict in worker

88b9e28

add get_example_data

ce74136

update tests

a450b57

add get_example_data in worker

94e2d70

fix init

af22c7f

add get_example_metadata_list

0bea4ae

youkaichao requested a review from cadedaniel May 11, 2024 15:56

youkaichao and others added 2 commits May 11, 2024 15:37

Merge branch 'main' into fbtd

b060e0d

rename to TensorDictWithBoundedMetadata

3e6ee16

youkaichao added 3 commits May 11, 2024 16:13

use vllm.TensorMeta

a8d1d3a

avoid circular import

ed24009

add comments

1f9a910

youkaichao added 4 commits May 11, 2024 16:35

use class attributes

0467afb

no need to broadcast keys

ee60d78

fix key

6969587

fix buffer size calculation

8158667

youkaichao added 2 commits May 11, 2024 17:00

use smaller align bytes

52a59ec

assert torch dtype initialized

89bd1ec

rkooo567 self-assigned this May 12, 2024

youkaichao added 3 commits May 12, 2024 12:39

fix torch dtype

d0a43ef

use str

62ac962

Merge branch 'main' into fbtd

5469842

youkaichao added 2 commits May 12, 2024 18:11

fix merge

59f094a

update tests

5c9f0e9

rkooo567 reviewed May 13, 2024

View reviewed changes

youkaichao added 4 commits May 13, 2024 11:09

type annotation for get_example_metadata_list

7f8ce07

type annotation for get_example_metadata_list

4ae2b3a

add assert

8de5ba2

fix name error

70c3664

njhill self-requested a review May 13, 2024 19:07

rkooo567 approved these changes May 14, 2024

View reviewed changes

njhill mentioned this pull request May 16, 2024

[Core] Avoid one broadcast op when propagating metadata #4844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] add fast broadcast for tensor dict #4757

[Core][Distributed] add fast broadcast for tensor dict #4757

youkaichao commented May 11, 2024

youkaichao commented May 11, 2024

youkaichao commented May 11, 2024

youkaichao commented May 11, 2024

rkooo567 left a comment

rkooo567 May 13, 2024

youkaichao May 13, 2024

rkooo567 May 13, 2024

youkaichao May 13, 2024

rkooo567 May 14, 2024

rkooo567 May 13, 2024

youkaichao May 13, 2024

rkooo567 May 14, 2024

rkooo567 May 13, 2024

youkaichao May 13, 2024

rkooo567 May 14, 2024

rkooo567 May 13, 2024

youkaichao commented May 13, 2024

youkaichao commented May 13, 2024

njhill commented May 13, 2024

njhill commented May 13, 2024

youkaichao commented May 13, 2024

rkooo567 commented May 14, 2024 •

edited

rkooo567 left a comment

rkooo567 May 14, 2024

rkooo567 May 14, 2024

youkaichao commented May 14, 2024

njhill commented May 16, 2024

[Core][Distributed] add fast broadcast for tensor dict #4757

Are you sure you want to change the base?

[Core][Distributed] add fast broadcast for tensor dict #4757

Conversation

youkaichao commented May 11, 2024

youkaichao commented May 11, 2024

youkaichao commented May 11, 2024

youkaichao commented May 11, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao commented May 13, 2024

youkaichao commented May 13, 2024

njhill commented May 13, 2024

njhill commented May 13, 2024

youkaichao commented May 13, 2024

rkooo567 commented May 14, 2024 • edited

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao commented May 14, 2024

njhill commented May 16, 2024

rkooo567 commented May 14, 2024 •

edited