[JIT] memory planning base with naive strategy #64347

makslevental · 2021-09-01T08:35:19Z

Stack from ghstack:

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
Greedy by size (a heuristic that allocates largest tensors first)
Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: D30769100

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., [a,b]) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the id is the debugName of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made semantically).

The planner produces a sequence of assignments of (offset, size) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

Registering Ops

We add three new primitive ops prim::AllocateSlab, prim::AllocateTensor, and prim::ReleaseSlab. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage tensor->storage().set_data_ptr_noswap like in static runtime).

Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds FrameNodId, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

Odds and ends

valid_add and valid_sub check for overflow since we're dealing with arithmetic on size_t. On GCC and Clang we have __builtin_add_overflow and __builtin_sub_overflow but not on MSC so there we use the "dumber" a + b >= a and a >= b (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

PreprocessGraph is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With PreprocessGraph we can anticipate and reconcile those changes to the graph.

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them. [ghstack-poisoned]

facebook-github-bot · 2021-09-01T08:35:25Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/64347
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit c545281 (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Nov 03 21:08:57 FAIL [0.005s]: test_forward_mod...D_corrcoef_cpu_float64 (__main__.TestGradientsCPU)

Nov 03 21:08:57     result = test(self, **param_kwargs)
Nov 03 21:08:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 03 21:08:57     return test(*args, **kwargs)
Nov 03 21:08:57   File "test_ops.py", line 729, in test_forward_mode_AD
Nov 03 21:08:57     self._forward_grad_helper(device, dtype, op, op.get_op())
Nov 03 21:08:57   File "test_ops.py", line 723, in _forward_grad_helper
Nov 03 21:08:57     check_undefined_grad=False, check_batched_grad=False)
Nov 03 21:08:57 AssertionError: NotImplementedError not raised : Running forward AD for an OP that has does not support it did not raise any error. If your op supports forward AD, you should set supports_forward_ad=True
Nov 03 21:08:57 
Nov 03 21:08:57 ======================================================================
Nov 03 21:08:57 FAIL [0.005s]: test_forward_mode_AD_corrcoef_cpu_float64 (__main__.TestGradientsCPU)
Nov 03 21:08:57 ----------------------------------------------------------------------
Nov 03 21:08:57 Traceback (most recent call last):
Nov 03 21:08:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 03 21:08:57     result = test(self, **param_kwargs)
Nov 03 21:08:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 03 21:08:57     return test(*args, **kwargs)
Nov 03 21:08:57   File "test_ops.py", line 729, in test_forward_mode_AD
Nov 03 21:08:57     self._forward_grad_helper(device, dtype, op, op.get_op())
Nov 03 21:08:57   File "test_ops.py", line 723, in _forward_grad_helper
Nov 03 21:08:57     check_undefined_grad=False, check_batched_grad=False)

Lint / quick-checks (2/2)

Step: "Ensure correct trailing newlines" (full log | diagnosis details | 🔁 rerun)

2021-11-03T19:08:04.6921898Z python: can't open..._launches.py': [Errno 2] No such file or directory

2021-11-03T19:08:04.6571982Z ##[group]Run set -eux
2021-11-03T19:08:04.6572451Z �[36;1mset -eux�[0m
2021-11-03T19:08:04.6573164Z �[36;1mpython torch/testing/_check_kernel_launches.py |& tee "${GITHUB_WORKSPACE}"/cuda_kernel_launch_checks.txt�[0m
2021-11-03T19:08:04.6612376Z shell: /bin/bash -e {0}
2021-11-03T19:08:04.6612737Z env:
2021-11-03T19:08:04.6613300Z   pythonLocation: /opt/hostedtoolcache/Python/3.10.0/x64
2021-11-03T19:08:04.6614026Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.10.0/x64/lib
2021-11-03T19:08:04.6614569Z ##[endgroup]
2021-11-03T19:08:04.6709650Z + python torch/testing/_check_kernel_launches.py
2021-11-03T19:08:04.6791912Z + tee /home/runner/work/pytorch/pytorch/cuda_kernel_launch_checks.txt
2021-11-03T19:08:04.6921898Z python: can't open file '/home/runner/work/pytorch/pytorch/torch/testing/_check_kernel_launches.py': [Errno 2] No such file or directory
2021-11-03T19:08:04.6999283Z ##[group]Run (! git --no-pager grep -I -no $'#include <cub/' --  ./aten  ':(exclude)aten/src/ATen/cuda/cub*.cuh' || (echo "The above files have direct cub include; please include ATen/cuda/cub.cuh instead and wrap your cub calls in at::native namespace if necessary"; false))
2021-11-03T19:08:04.7001236Z �[36;1m(! git --no-pager grep -I -no $'#include <cub/' --  ./aten  ':(exclude)aten/src/ATen/cuda/cub*.cuh' || (echo "The above files have direct cub include; please include ATen/cuda/cub.cuh instead and wrap your cub calls in at::native namespace if necessary"; false))�[0m
2021-11-03T19:08:04.7037684Z shell: /bin/bash -e {0}
2021-11-03T19:08:04.7038065Z env:
2021-11-03T19:08:04.7038930Z   pythonLocation: /opt/hostedtoolcache/Python/3.10.0/x64
2021-11-03T19:08:04.7039699Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.10.0/x64/lib
2021-11-03T19:08:04.7040252Z ##[endgroup]
2021-11-03T19:08:04.7339742Z ##[group]Run (! git --no-pager grep -I -no $'cudaStreamSynchronize' --  ./aten ./c10 ':(exclude)aten/src/ATen/test' ':(exclude)c10/cuda/CUDAFunctions.h' || (echo "The above files call raw cuda APIs directly; please use at::cuda wrappers instead"; false))
2021-11-03T19:08:04.7341801Z �[36;1m(! git --no-pager grep -I -no $'cudaStreamSynchronize' --  ./aten ./c10 ':(exclude)aten/src/ATen/test' ':(exclude)c10/cuda/CUDAFunctions.h' || (echo "The above files call raw cuda APIs directly; please use at::cuda wrappers instead"; false))�[0m
2021-11-03T19:08:04.7378397Z shell: /bin/bash -e {0}

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them. [ghstack-poisoned]

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. [ghstack-poisoned]

makslevental · 2021-09-06T22:47:14Z

@makslevental has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Most of the action is under `torch/csrc/jit/passes/memory_planning/`. Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) [ghstack-poisoned]

eellison

Cool, looks great !!

Do you mind commenting a bit about memory_observer and what it's doing / why it's doing it ? EDIT: JK, this is all in the PR description... I will read that and re-review tomorrow

torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp

eellison · 2021-09-14T23:23:58Z

torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp

+         },
+         aliasAnalysisSpecialCase()),
+     Operator(
+         "prim::ReleaseSlab(Storage slab, ...) -> ()",


Hmm, slab here could relate to any number of Tensors, and it would be kind of a pain to track all of the dependencies there. I think it probably makes sense just to mark prim::ReleaseSlab as an op with side effects for now, which means we can't move any node around it and it won't get DCE'd.

I'm not sure what a good alternative is. We don't want to set the slab as containing the tensors which point to it because that would pessimistically extend their lifetimes. We could maybe set slab to directly alias with these tensors, but it's might be a little weird to have Values of different types alias each other, and will also run into the same extending lifetime issue.

I'm not sure what the constraints here are but just fyi I don't know the value (or the sensibility) of trying to do alias analysis on the slices of slab that get handed out.

Yea, I think you're right. We might want to just make it an invariant that memory planning is the last pass we run, and throw if we see any of these nodes in alias analysis. What do people think cc @Krovatkin @desertfire ?

What I was trying to avoid is a future pass moving nodes around have dependencies that are not reflected topologically.
E.g., you can't move prim::ReleaseSlab around another node.

More I think about it, all of the logic here is dependent on topology being frozen. we should probably just throw in alias analysis if we see these nodes and enforce invariant this is the final pass

@eellison

reflected topologically

aren't uses relationships exactly topological relationships? all of AllocateTensor nodes have a source -> node -> sink relationship with AllocateSlab and whichever op consumes the output of AllocateTensor?

say you have:

graph(a, b, c, d): y = a + b z = c + d y1 = y + z y0 = a + c z0 = b + d z1 = y0 + z0 return z1 + y1

there is no topological constraint about whether to compute y/z or y0/z0 first, however their lifetimes will be implicitly baked into the memory planning scheme. let's say that y/z are at offset 0, and 256, and then those same offset/sizes get reused for y0/z0, if you were to move y0 = a + c above z = c + d, that would overwrite the value of y, however, there is nothing reflected topologically that says this is an invalid move

@eellison this is a good (very good) observation but if you move y0 = a + c above z = c + d then you would also be required to move its AllocateTensor with it as well right? in which case you could not mistakenly overwrite (unless I'm mistaken)

eellison · 2021-09-14T23:26:15Z

torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp

+         "prim::ReleaseSlab(Storage slab, ...) -> ()",
+         [](Stack* stack) {
+           auto num_inputs = pop(stack).toInt();
+           std::vector<at::Tensor> inputs(num_inputs - 1);


What is the purpose of taking in Tensors here ? otherwise, as soon as a Tensor has it's last use its ref-count goes to zero and it gets deallocated. What does having the Tensors as inputs provide ?

For one thing there's a pass in CodeImpl that'll prune unused tensors

https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/interpreter/preprocess_graph.cpp#L211

For another I thought being perfectly explicit about free/destruction would be a good thing (this pertains to your comments further down about ref count)

Why is a Tensor being pruned after all its uses a bad thing ?

So despite what I have implemented here currently I think the answer to this question is "because otherwise we would get a double free". My reasoning: if you

don't delete the deleter at the Storage allocation site (as in how AllocateTensor is now) of the temp tensor (note I'm talking about the Storage abstraction rather than actual memory) .

let tensors go out of scope before the end of the run (therefore let Storage perform a free)

then you will get a double free/the slab being overwritten by the system memory manager. I'm not 100% sure of this because I haven't closely studied StorageImpl but intuitively this is how it should work.

So the right thing to do is either

delete the deleter and then only free the slab in ReleaseSlab

don't delete the deleter but "use" the tensor in ReleaseSlab so that it doesn't get freed before the slab gets freed.

In actuality I think the second implementation also leads to double free (first tensor then slab) and so the only correct thing is 1. Indeed this is basically what static runtime does:

void MemoryPlanner::allocate() { buffer_ = allocateBuffer(managed_bytes_); ... size_t offset = 0; uint8_t* start = static_cast<uint8_t*>(buffer_.get()); ... void* src = static_cast<void*>(start + offset); ... for (auto* tensor : tensors) { tensor->storage().set_data_ptr_noswap( at::DataPtr(src, src, nullptr, tensor->device())); ... } ... } void MemoryPlanner::deallocate() { for (auto& ms : managed_tensors_) { const auto& tensors = ms.second; for (auto& tensor : tensors) { tensor->storage().unsafeGetStorageImpl()->reset(); } buffer_ = {}; }

Is the storage not also ref-coutned ? what do we think is going to be double free'd, because I dont think the storage will be. we can sync more on this or clear up in follow up as well

torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp

eellison · 2021-09-14T23:27:16Z

torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp

+             pop(stack, slab);
+             uint8_t* start = static_cast<uint8_t*>(slab.data());
+             void* src = static_cast<void*>(start + offset);
+             at::Tensor temp_tensor = at::from_blob(


are there any assertions/invariants we want to add here (maybe not i dont know)

initially I had some checks about exceeding max memory and stuff like that but it's pretty superficial i think since there's a validation pass in the planner itself

torch/csrc/jit/passes/memory_planning/memory_observer.cpp

eellison · 2021-09-14T23:45:07Z

torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp

+             temp_tensor.unsafeReleaseTensorImpl()->release_resources();
+           }
+           auto slab = pop(stack).toStorage();
+           //           slab.allocator()->raw_deallocate(slab.data());


commented out code

this, like the free of the temp tensors, is up for debate (explicitness vs concision)

The current state of the interpreter is it handles the freeing of Tensors after their final use. I dont see a compelling reason to duplicate that logic

yea that makes sense - just wasn't sure what would be "best practices". in that case the ReleaseSlab node is unnecessary in its entirety

desertfire · 2021-09-15T15:58:45Z

torch/csrc/jit/passes/memory_planning.cpp

+namespace torch {
+namespace jit {
+
+int overlap(size_t a, size_t b, size_t c, size_t d) {


You are not returning a bool result because overlap is used for both LiveRange and MemRegion. Is it possible to unify both by making LiveRange in the [,) form?

yea you could by adding 1 to the right endpoint since [a,b] doesn't intersect [c,d] if [a,b+1) doesn't intersect [c,d+1) (since all endpoints are integers). i guess that's simpler

eellison · 2021-09-15T16:04:43Z

Currently the id is the debugName of the output tensor

FYI, the debug name is not guaranteed to be stable (but it is guaranteed to be unique). We should not use this for any sort of mapping to a Value *

eellison

Few more comments... more to come

eellison · 2021-09-15T16:10:40Z

torch/csrc/jit/passes/memory_planning.cpp

+namespace torch {
+namespace jit {
+
+int overlap(size_t a, size_t b, size_t c, size_t d) {


Not: can we name the inputs better

eellison · 2021-09-15T16:11:54Z

torch/csrc/jit/passes/memory_planning.cpp

+  TORCH_INTERNAL_ASSERT(a <= b);
+  TORCH_INTERNAL_ASSERT(c <= d);
+  size_t outer = std::max(b, d) - std::min(a, c);
+  size_t l1 = (b - a), l2 = (d - c);


Not: rename l1, l2 more descriptively

torch/csrc/jit/passes/memory_planning.cpp

eellison · 2021-09-15T16:33:49Z

torch/csrc/jit/passes/memory_planning.cpp

+        continue;
+      }
+      auto size = computeStorageSize(*out_v);
+      if (size > 0 && !isOptimizableContainerType(node, node_has_out_variant)) {


What does isOptimizableContainerType mean here ? Name is a little vague .. maybe add comment

eellison · 2021-09-15T16:34:31Z

torch/csrc/jit/passes/memory_planning.cpp

+        managed_values.insert(
+            {out_v, {{live_ranges[out_v], out_v->debugName()}, size.value()}});
+      } else {
+        leaked_values.insert(out_v);


Sorry, how is leaked different than unmanaged? Could we just insert them into unmanaged here. leak gives impression of memory leakage which is not what we're doing here

yea fair. this is language from static runtime but you're right it's basically unmanaged memory. i'll change it

makslevental · 2021-09-15T20:39:13Z

Currently the id is the debugName of the output tensor

FYI, the debug name is not guaranteed to be stable (but it is guaranteed to be unique). We should not use this for any sort of mapping to a Value *

this gets at the core of some of fundamental issues - what is a unique and stable mapping from Value/tensor to semantically meaningful identifier. At first I was sticking the entire stacktrace (to the allocator call) in that id string but not only is that slow but it's not even stable because of control flow!

eellison

cc @d1jang and @hlu1 spying on the memory allocations logic here could be useful for uncovering aten operators which do internal allocations, like we discovered with layer_norm

eellison · 2021-09-15T21:08:29Z

this gets at the core of some of fundamental issues - what is a unique and stable mapping from Value/tensor to semantically meaningful identifier. At first I was sticking the entire stacktrace (to the allocator call) in that id string but not only is that slow but it's not even stable because of control flow!

Does Value* not suffice ?

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]

eellison

Awesome, looks great !!! 🚢 🚢 🚢

Follows up / nice to haves
Smaller:

figure out why prim::AllocateTensor cant be schematized
Dont store TypePtr on prim::AllocateTensor

Larger:

Dont try to handle aliased tensors
Automatic inplacing add->add_
Dont have dependency on strides, propagate is_dense information instead

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are: * Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf) * Greedy by size (a heuristic that allocates largest tensors first) * Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first) The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf. Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100) This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here. # Memory Planner We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type ```cpp struct UniqueLiveRange { LiveRange lvr; std::string id; }; ``` This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*). The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes. Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing. # Registering Ops We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime). # Memory Observer This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes). Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose. # Odds and ends `valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure. `PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph. [ghstack-poisoned]

pytorch-probot · 2021-11-03T19:04:39Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/c545281e1ca6b93d21f12c2bdfd8087a21b8795d/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
docker-builds	`ciflow/all`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis	`ciflow/all`, `ciflow/linux`, `ciflow/mobile`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

makslevental · 2021-11-03T19:16:33Z

@makslevental has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-02-12T00:16:48Z

Hi @makslevental!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2022-03-22T20:27:35Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2022-03-22T21:22:03Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

github-actions · 2022-05-21T21:35:02Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Sep 1, 2021

Maksim Levental and others added 8 commits September 1, 2021 04:00

Maksim Levental added 2 commits September 6, 2021 21:51

makslevental requested review from eellison, d1jang, Krovatkin and desertfire September 9, 2021 15:03

Maksim Levental added 3 commits September 12, 2021 17:32

eellison reviewed Sep 14, 2021

View reviewed changes

desertfire reviewed Sep 15, 2021

View reviewed changes

eellison reviewed Sep 15, 2021

View reviewed changes

This was referenced Sep 20, 2021

lots of changes #65371

Closed

lots of changes #65372

Closed

makslevental added 4 commits September 20, 2021 16:55

eellison approved these changes Sep 22, 2021

View reviewed changes

pytorch-probot bot added the ciflow/default label Nov 3, 2021

pytorchbot added the open source label Feb 12, 2022

suo removed the ciflow/default label Mar 22, 2022

github-actions bot added the Stale label May 21, 2022

github-actions bot closed this Jun 20, 2022

facebook-github-bot deleted the gh/makslevental/28/head branch July 21, 2022 14:22

[JIT] memory planning base with naive strategy #64347

[JIT] memory planning base with naive strategy #64347

Conversation

makslevental commented Sep 1, 2021 • edited

Memory Planner

Registering Ops

Memory Observer

Odds and ends

facebook-github-bot commented Sep 1, 2021 • edited

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/2)

Lint / quick-checks (2/2)

makslevental commented Sep 6, 2021

eellison left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makslevental Sep 20, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eellison commented Sep 15, 2021

eellison left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makslevental commented Sep 15, 2021

eellison left a comment

Choose a reason for hiding this comment

eellison commented Sep 15, 2021

eellison left a comment

Choose a reason for hiding this comment

pytorch-probot bot commented Nov 3, 2021

⚛️ CI Flow

makslevental commented Nov 3, 2021

facebook-github-bot commented Feb 12, 2022

Process

facebook-github-bot commented Mar 22, 2022

facebook-github-bot commented Mar 22, 2022

github-actions bot commented May 21, 2022

makslevental commented Sep 1, 2021 •

edited

facebook-github-bot commented Sep 1, 2021 •

edited

eellison left a comment •

edited

makslevental Sep 20, 2021 •

edited