New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] memory planning base with naive strategy #64347
Conversation
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them. [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit c545281 (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_py3_6_gcc5_4_test (1/2)Step: "Test" (full log | diagnosis details | 🔁 rerun)
|
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. [ghstack-poisoned]
@makslevental has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, looks great !!
Do you mind commenting a bit about memory_observer
and what it's doing / why it's doing it ? EDIT: JK, this is all in the PR description... I will read that and re-review tomorrow
}, | ||
aliasAnalysisSpecialCase()), | ||
Operator( | ||
"prim::ReleaseSlab(Storage slab, ...) -> ()", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, slab
here could relate to any number of Tensors, and it would be kind of a pain to track all of the dependencies there. I think it probably makes sense just to mark prim::ReleaseSlab
as an op with side effects for now, which means we can't move any node around it and it won't get DCE'd.
I'm not sure what a good alternative is. We don't want to set the slab
as containing the tensors which point to it because that would pessimistically extend their lifetimes. We could maybe set slab to directly alias with these tensors, but it's might be a little weird to have Values of different types alias each other, and will also run into the same extending lifetime issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the constraints here are but just fyi I don't know the value (or the sensibility) of trying to do alias analysis on the slices of slab
that get handed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think you're right. We might want to just make it an invariant that memory planning is the last pass we run, and throw if we see any of these nodes in alias analysis. What do people think cc @Krovatkin @desertfire ?
What I was trying to avoid is a future pass moving nodes around have dependencies that are not reflected topologically.
E.g., you can't move prim::ReleaseSlab around another node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More I think about it, all of the logic here is dependent on topology being frozen. we should probably just throw in alias analysis if we see these nodes and enforce invariant this is the final pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reflected topologically
aren't uses
relationships exactly topological relationships? all of AllocateTensor
nodes have a source -> node -> sink relationship with AllocateSlab
and whichever op consumes the output of AllocateTensor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say you have:
graph(a, b, c, d):
y = a + b
z = c + d
y1 = y + z
y0 = a + c
z0 = b + d
z1 = y0 + z0
return z1 + y1
there is no topological constraint about whether to compute y/z
or y0/z0
first, however their lifetimes will be implicitly baked into the memory planning scheme. let's say that y/z
are at offset 0, and 256, and then those same offset/sizes get reused for y0/z0, if you were to move y0 = a + c
above z = c + d
, that would overwrite the value of y
, however, there is nothing reflected topologically that says this is an invalid move
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eellison this is a good (very good) observation but if you move y0 = a + c
above z = c + d
then you would also be required to move its AllocateTensor
with it as well right? in which case you could not mistakenly overwrite (unless I'm mistaken)
"prim::ReleaseSlab(Storage slab, ...) -> ()", | ||
[](Stack* stack) { | ||
auto num_inputs = pop(stack).toInt(); | ||
std::vector<at::Tensor> inputs(num_inputs - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of taking in Tensors here ? otherwise, as soon as a Tensor has it's last use its ref-count goes to zero and it gets deallocated. What does having the Tensors as inputs provide ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For one thing there's a pass in CodeImpl
that'll prune unused tensors
For another I thought being perfectly explicit about free/destruction would be a good thing (this pertains to your comments further down about ref count)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a Tensor being pruned after all its uses a bad thing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So despite what I have implemented here currently I think the answer to this question is "because otherwise we would get a double free". My reasoning: if you
- don't delete the
deleter
at theStorage
allocation site (as in howAllocateTensor
is now) of the temp tensor (note I'm talking about theStorage
abstraction rather than actual memory) . - let tensors go out of scope before the end of the run (therefore let
Storage
perform a free)
then you will get a double free/the slab being overwritten by the system memory manager. I'm not 100% sure of this because I haven't closely studied StorageImpl
but intuitively this is how it should work.
So the right thing to do is either
- delete the deleter and then only free the slab in
ReleaseSlab
- don't delete the deleter but "use" the tensor in
ReleaseSlab
so that it doesn't get freed before the slab gets freed.
In actuality I think the second implementation also leads to double free (first tensor then slab) and so the only correct thing is 1. Indeed this is basically what static runtime does:
void MemoryPlanner::allocate() {
buffer_ = allocateBuffer(managed_bytes_);
...
size_t offset = 0;
uint8_t* start = static_cast<uint8_t*>(buffer_.get());
...
void* src = static_cast<void*>(start + offset);
...
for (auto* tensor : tensors) {
tensor->storage().set_data_ptr_noswap(
at::DataPtr(src, src, nullptr, tensor->device()));
...
}
...
}
void MemoryPlanner::deallocate() {
for (auto& ms : managed_tensors_) {
const auto& tensors = ms.second;
for (auto& tensor : tensors) {
tensor->storage().unsafeGetStorageImpl()->reset();
}
buffer_ = {};
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the storage not also ref-coutned ? what do we think is going to be double free'd, because I dont think the storage will be. we can sync more on this or clear up in follow up as well
pop(stack, slab); | ||
uint8_t* start = static_cast<uint8_t*>(slab.data()); | ||
void* src = static_cast<void*>(start + offset); | ||
at::Tensor temp_tensor = at::from_blob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any assertions/invariants we want to add here (maybe not i dont know)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially I had some checks about exceeding max memory and stuff like that but it's pretty superficial i think since there's a validation pass in the planner itself
temp_tensor.unsafeReleaseTensorImpl()->release_resources(); | ||
} | ||
auto slab = pop(stack).toStorage(); | ||
// slab.allocator()->raw_deallocate(slab.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this, like the free of the temp tensors, is up for debate (explicitness vs concision)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current state of the interpreter is it handles the freeing of Tensors after their final use. I dont see a compelling reason to duplicate that logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea that makes sense - just wasn't sure what would be "best practices". in that case the ReleaseSlab
node is unnecessary in its entirety
namespace torch { | ||
namespace jit { | ||
|
||
int overlap(size_t a, size_t b, size_t c, size_t d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not returning a bool result because overlap
is used for both LiveRange and MemRegion. Is it possible to unify both by making LiveRange in the [,)
form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea you could by adding 1 to the right endpoint since [a,b] doesn't intersect [c,d] if [a,b+1) doesn't intersect [c,d+1) (since all endpoints are integers). i guess that's simpler
FYI, the debug name is not guaranteed to be stable (but it is guaranteed to be unique). We should not use this for any sort of mapping to a Value * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments... more to come
namespace torch { | ||
namespace jit { | ||
|
||
int overlap(size_t a, size_t b, size_t c, size_t d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not: can we name the inputs better
TORCH_INTERNAL_ASSERT(a <= b); | ||
TORCH_INTERNAL_ASSERT(c <= d); | ||
size_t outer = std::max(b, d) - std::min(a, c); | ||
size_t l1 = (b - a), l2 = (d - c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not: rename l1, l2 more descriptively
continue; | ||
} | ||
auto size = computeStorageSize(*out_v); | ||
if (size > 0 && !isOptimizableContainerType(node, node_has_out_variant)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does isOptimizableContainerType mean here ? Name is a little vague .. maybe add comment
managed_values.insert( | ||
{out_v, {{live_ranges[out_v], out_v->debugName()}, size.value()}}); | ||
} else { | ||
leaked_values.insert(out_v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, how is leaked different than unmanaged? Could we just insert them into unmanaged here. leak gives impression of memory leakage which is not what we're doing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea fair. this is language from static runtime but you're right it's basically unmanaged memory. i'll change it
this gets at the core of some of fundamental issues - what is a unique and stable mapping from Value/tensor to semantically meaningful identifier. At first I was sticking the entire stacktrace (to the allocator call) in that id string but not only is that slow but it's not even stable because of control flow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does |
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks great !!! 🚢 🚢 🚢
Follows up / nice to haves
Smaller:
- figure out why prim::AllocateTensor cant be schematized
- Dont store TypePtr on prim::AllocateTensor
Larger:
- Dont try to handle aliased tensors
- Automatic inplacing
add->add_
- Dont have dependency on strides, propagate is_dense information instead
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
@makslevental has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Hi @makslevental! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
1 similar comment
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack:
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:
The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.
Differential Revision: D30769100
This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.
Memory Planner
We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e.,
[a,b]
) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the typeThis is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the
id
is thedebugName
of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made semantically).The planner produces a sequence of assignments of (
offset
,size
) allocations corresponding to lifetimes.Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.
Registering Ops
We add three new primitive ops
prim::AllocateSlab
,prim::AllocateTensor
, andprim::ReleaseSlab
. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storagetensor->storage().set_data_ptr_noswap
like in static runtime).Memory Observer
This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds
FrameNodId
, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.
Odds and ends
valid_add
andvalid_sub
check for overflow since we're dealing with arithmetic onsize_t
. On GCC and Clang we have__builtin_add_overflow
and__builtin_sub_overflow
but not on MSC so there we use the "dumber"a + b >= a
anda >= b
(respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.PreprocessGraph
is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). WithPreprocessGraph
we can anticipate and reconcile those changes to the graph.