Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] memory planning base with naive strategy #64347

Closed
wants to merge 20 commits into from

Conversation

makslevental
Copy link
Contributor

@makslevental makslevental commented Sep 1, 2021

Stack from ghstack:

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: D30769100

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., [a,b]) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the id is the debugName of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made semantically).

The planner produces a sequence of assignments of (offset, size) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

Registering Ops

We add three new primitive ops prim::AllocateSlab, prim::AllocateTensor, and prim::ReleaseSlab. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage tensor->storage().set_data_ptr_noswap like in static runtime).

Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds FrameNodId, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

Odds and ends

valid_add and valid_sub check for overflow since we're dealing with arithmetic on size_t. On GCC and Clang we have __builtin_add_overflow and __builtin_sub_overflow but not on MSC so there we use the "dumber" a + b >= a and a >= b (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

PreprocessGraph is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With PreprocessGraph we can anticipate and reconcile those changes to the graph.

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them.

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Sep 1, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 1, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit c545281 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Nov 03 21:08:57 FAIL [0.005s]: test_forward_mod...D_corrcoef_cpu_float64 (__main__.TestGradientsCPU)
Nov 03 21:08:57     result = test(self, **param_kwargs)
Nov 03 21:08:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 03 21:08:57     return test(*args, **kwargs)
Nov 03 21:08:57   File "test_ops.py", line 729, in test_forward_mode_AD
Nov 03 21:08:57     self._forward_grad_helper(device, dtype, op, op.get_op())
Nov 03 21:08:57   File "test_ops.py", line 723, in _forward_grad_helper
Nov 03 21:08:57     check_undefined_grad=False, check_batched_grad=False)
Nov 03 21:08:57 AssertionError: NotImplementedError not raised : Running forward AD for an OP that has does not support it did not raise any error. If your op supports forward AD, you should set supports_forward_ad=True
Nov 03 21:08:57 
Nov 03 21:08:57 ======================================================================
Nov 03 21:08:57 FAIL [0.005s]: test_forward_mode_AD_corrcoef_cpu_float64 (__main__.TestGradientsCPU)
Nov 03 21:08:57 ----------------------------------------------------------------------
Nov 03 21:08:57 Traceback (most recent call last):
Nov 03 21:08:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 03 21:08:57     result = test(self, **param_kwargs)
Nov 03 21:08:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 03 21:08:57     return test(*args, **kwargs)
Nov 03 21:08:57   File "test_ops.py", line 729, in test_forward_mode_AD
Nov 03 21:08:57     self._forward_grad_helper(device, dtype, op, op.get_op())
Nov 03 21:08:57   File "test_ops.py", line 723, in _forward_grad_helper
Nov 03 21:08:57     check_undefined_grad=False, check_batched_grad=False)

See GitHub Actions build Lint / quick-checks (2/2)

Step: "Ensure correct trailing newlines" (full log | diagnosis details | 🔁 rerun)

2021-11-03T19:08:04.6921898Z python: can't open..._launches.py': [Errno 2] No such file or directory
2021-11-03T19:08:04.6571982Z ##[group]Run set -eux
2021-11-03T19:08:04.6572451Z �[36;1mset -eux�[0m
2021-11-03T19:08:04.6573164Z �[36;1mpython torch/testing/_check_kernel_launches.py |& tee "${GITHUB_WORKSPACE}"/cuda_kernel_launch_checks.txt�[0m
2021-11-03T19:08:04.6612376Z shell: /bin/bash -e {0}
2021-11-03T19:08:04.6612737Z env:
2021-11-03T19:08:04.6613300Z   pythonLocation: /opt/hostedtoolcache/Python/3.10.0/x64
2021-11-03T19:08:04.6614026Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.10.0/x64/lib
2021-11-03T19:08:04.6614569Z ##[endgroup]
2021-11-03T19:08:04.6709650Z + python torch/testing/_check_kernel_launches.py
2021-11-03T19:08:04.6791912Z + tee /home/runner/work/pytorch/pytorch/cuda_kernel_launch_checks.txt
2021-11-03T19:08:04.6921898Z python: can't open file '/home/runner/work/pytorch/pytorch/torch/testing/_check_kernel_launches.py': [Errno 2] No such file or directory
2021-11-03T19:08:04.6999283Z ##[group]Run (! git --no-pager grep -I -no $'#include <cub/' --  ./aten  ':(exclude)aten/src/ATen/cuda/cub*.cuh' || (echo "The above files have direct cub include; please include ATen/cuda/cub.cuh instead and wrap your cub calls in at::native namespace if necessary"; false))
2021-11-03T19:08:04.7001236Z �[36;1m(! git --no-pager grep -I -no $'#include <cub/' --  ./aten  ':(exclude)aten/src/ATen/cuda/cub*.cuh' || (echo "The above files have direct cub include; please include ATen/cuda/cub.cuh instead and wrap your cub calls in at::native namespace if necessary"; false))�[0m
2021-11-03T19:08:04.7037684Z shell: /bin/bash -e {0}
2021-11-03T19:08:04.7038065Z env:
2021-11-03T19:08:04.7038930Z   pythonLocation: /opt/hostedtoolcache/Python/3.10.0/x64
2021-11-03T19:08:04.7039699Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.10.0/x64/lib
2021-11-03T19:08:04.7040252Z ##[endgroup]
2021-11-03T19:08:04.7339742Z ##[group]Run (! git --no-pager grep -I -no $'cudaStreamSynchronize' --  ./aten ./c10 ':(exclude)aten/src/ATen/test' ':(exclude)c10/cuda/CUDAFunctions.h' || (echo "The above files call raw cuda APIs directly; please use at::cuda wrappers instead"; false))
2021-11-03T19:08:04.7341801Z �[36;1m(! git --no-pager grep -I -no $'cudaStreamSynchronize' --  ./aten ./c10 ':(exclude)aten/src/ATen/test' ':(exclude)c10/cuda/CUDAFunctions.h' || (echo "The above files call raw cuda APIs directly; please use at::cuda wrappers instead"; false))�[0m
2021-11-03T19:08:04.7378397Z shell: /bin/bash -e {0}

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Maksim Levental and others added 8 commits September 1, 2021 04:00
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Currently the tests don't do anything except exercise the code paths. As we find corner cases we will add tests to check for them.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

[ghstack-poisoned]
@makslevental
Copy link
Contributor Author

@makslevental has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Maksim Levental added 2 commits September 6, 2021 21:51
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

[ghstack-poisoned]
Maksim Levental added 3 commits September 12, 2021 17:32
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Most of the action is under `torch/csrc/jit/passes/memory_planning/`.
Currently this only supports CPU. In principle there's nothing blocking supporting GPU (just requires adding cudaAllocator to GetAllocator) but there are lots of implicit assumptions encoded in the allocation strategies that are probably false for GPU.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

[ghstack-poisoned]
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, looks great !!

Do you mind commenting a bit about memory_observer and what it's doing / why it's doing it ? EDIT: JK, this is all in the PR description... I will read that and re-review tomorrow

torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp Outdated Show resolved Hide resolved
},
aliasAnalysisSpecialCase()),
Operator(
"prim::ReleaseSlab(Storage slab, ...) -> ()",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, slab here could relate to any number of Tensors, and it would be kind of a pain to track all of the dependencies there. I think it probably makes sense just to mark prim::ReleaseSlab as an op with side effects for now, which means we can't move any node around it and it won't get DCE'd.

I'm not sure what a good alternative is. We don't want to set the slab as containing the tensors which point to it because that would pessimistically extend their lifetimes. We could maybe set slab to directly alias with these tensors, but it's might be a little weird to have Values of different types alias each other, and will also run into the same extending lifetime issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the constraints here are but just fyi I don't know the value (or the sensibility) of trying to do alias analysis on the slices of slab that get handed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think you're right. We might want to just make it an invariant that memory planning is the last pass we run, and throw if we see any of these nodes in alias analysis. What do people think cc @Krovatkin @desertfire ?

What I was trying to avoid is a future pass moving nodes around have dependencies that are not reflected topologically.
E.g., you can't move prim::ReleaseSlab around another node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More I think about it, all of the logic here is dependent on topology being frozen. we should probably just throw in alias analysis if we see these nodes and enforce invariant this is the final pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eellison

reflected topologically

aren't uses relationships exactly topological relationships? all of AllocateTensor nodes have a source -> node -> sink relationship with AllocateSlab and whichever op consumes the output of AllocateTensor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say you have:

graph(a, b, c, d):
   y = a + b
   z = c + d
   y1 = y + z   

   y0 = a + c 
   z0 = b + d
   z1 = y0 + z0
    
   return z1 + y1

there is no topological constraint about whether to compute y/z or y0/z0 first, however their lifetimes will be implicitly baked into the memory planning scheme. let's say that y/z are at offset 0, and 256, and then those same offset/sizes get reused for y0/z0, if you were to move y0 = a + c above z = c + d, that would overwrite the value of y, however, there is nothing reflected topologically that says this is an invalid move

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eellison this is a good (very good) observation but if you move y0 = a + c above z = c + d then you would also be required to move its AllocateTensor with it as well right? in which case you could not mistakenly overwrite (unless I'm mistaken)

"prim::ReleaseSlab(Storage slab, ...) -> ()",
[](Stack* stack) {
auto num_inputs = pop(stack).toInt();
std::vector<at::Tensor> inputs(num_inputs - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of taking in Tensors here ? otherwise, as soon as a Tensor has it's last use its ref-count goes to zero and it gets deallocated. What does having the Tensors as inputs provide ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one thing there's a pass in CodeImpl that'll prune unused tensors

https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/interpreter/preprocess_graph.cpp#L211

For another I thought being perfectly explicit about free/destruction would be a good thing (this pertains to your comments further down about ref count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a Tensor being pruned after all its uses a bad thing ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So despite what I have implemented here currently I think the answer to this question is "because otherwise we would get a double free". My reasoning: if you

  1. don't delete the deleter at the Storage allocation site (as in how AllocateTensor is now) of the temp tensor (note I'm talking about the Storage abstraction rather than actual memory) .
  2. let tensors go out of scope before the end of the run (therefore let Storage perform a free)

then you will get a double free/the slab being overwritten by the system memory manager. I'm not 100% sure of this because I haven't closely studied StorageImpl but intuitively this is how it should work.

So the right thing to do is either

  1. delete the deleter and then only free the slab in ReleaseSlab
  2. don't delete the deleter but "use" the tensor in ReleaseSlab so that it doesn't get freed before the slab gets freed.

In actuality I think the second implementation also leads to double free (first tensor then slab) and so the only correct thing is 1. Indeed this is basically what static runtime does:

void MemoryPlanner::allocate() {
  buffer_ = allocateBuffer(managed_bytes_);
  ...
  size_t offset = 0;
  uint8_t* start = static_cast<uint8_t*>(buffer_.get());
  ...
  void* src = static_cast<void*>(start + offset);
  ...
  for (auto* tensor : tensors) {
    tensor->storage().set_data_ptr_noswap(
        at::DataPtr(src, src, nullptr, tensor->device()));
    ...
  }
  ...
}

void MemoryPlanner::deallocate() {
  for (auto& ms : managed_tensors_) {
    const auto& tensors = ms.second;

    for (auto& tensor : tensors) {
      tensor->storage().unsafeGetStorageImpl()->reset();
    }

  buffer_ = {};
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the storage not also ref-coutned ? what do we think is going to be double free'd, because I dont think the storage will be. we can sync more on this or clear up in follow up as well

pop(stack, slab);
uint8_t* start = static_cast<uint8_t*>(slab.data());
void* src = static_cast<void*>(start + offset);
at::Tensor temp_tensor = at::from_blob(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any assertions/invariants we want to add here (maybe not i dont know)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially I had some checks about exceeding max memory and stuff like that but it's pretty superficial i think since there's a validation pass in the planner itself

temp_tensor.unsafeReleaseTensorImpl()->release_resources();
}
auto slab = pop(stack).toStorage();
// slab.allocator()->raw_deallocate(slab.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this, like the free of the temp tensors, is up for debate (explicitness vs concision)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current state of the interpreter is it handles the freeing of Tensors after their final use. I dont see a compelling reason to duplicate that logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that makes sense - just wasn't sure what would be "best practices". in that case the ReleaseSlab node is unnecessary in its entirety

namespace torch {
namespace jit {

int overlap(size_t a, size_t b, size_t c, size_t d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not returning a bool result because overlap is used for both LiveRange and MemRegion. Is it possible to unify both by making LiveRange in the [,) form?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea you could by adding 1 to the right endpoint since [a,b] doesn't intersect [c,d] if [a,b+1) doesn't intersect [c,d+1) (since all endpoints are integers). i guess that's simpler

@eellison
Copy link
Contributor

Currently the id is the debugName of the output tensor

FYI, the debug name is not guaranteed to be stable (but it is guaranteed to be unique). We should not use this for any sort of mapping to a Value *

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments... more to come

namespace torch {
namespace jit {

int overlap(size_t a, size_t b, size_t c, size_t d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not: can we name the inputs better

TORCH_INTERNAL_ASSERT(a <= b);
TORCH_INTERNAL_ASSERT(c <= d);
size_t outer = std::max(b, d) - std::min(a, c);
size_t l1 = (b - a), l2 = (d - c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not: rename l1, l2 more descriptively

torch/csrc/jit/passes/memory_planning.cpp Outdated Show resolved Hide resolved
torch/csrc/jit/passes/memory_planning.cpp Show resolved Hide resolved
torch/csrc/jit/passes/memory_planning.cpp Outdated Show resolved Hide resolved
torch/csrc/jit/passes/memory_planning.cpp Show resolved Hide resolved
torch/csrc/jit/passes/memory_planning.cpp Show resolved Hide resolved
torch/csrc/jit/passes/memory_planning.cpp Show resolved Hide resolved
continue;
}
auto size = computeStorageSize(*out_v);
if (size > 0 && !isOptimizableContainerType(node, node_has_out_variant)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does isOptimizableContainerType mean here ? Name is a little vague .. maybe add comment

managed_values.insert(
{out_v, {{live_ranges[out_v], out_v->debugName()}, size.value()}});
} else {
leaked_values.insert(out_v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, how is leaked different than unmanaged? Could we just insert them into unmanaged here. leak gives impression of memory leakage which is not what we're doing here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea fair. this is language from static runtime but you're right it's basically unmanaged memory. i'll change it

@makslevental
Copy link
Contributor Author

Currently the id is the debugName of the output tensor

FYI, the debug name is not guaranteed to be stable (but it is guaranteed to be unique). We should not use this for any sort of mapping to a Value *

this gets at the core of some of fundamental issues - what is a unique and stable mapping from Value/tensor to semantically meaningful identifier. At first I was sticking the entire stacktrace (to the allocator call) in that id string but not only is that slow but it's not even stable because of control flow!

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @d1jang and @hlu1 spying on the memory allocations logic here could be useful for uncovering aten operators which do internal allocations, like we discovered with layer_norm

@eellison
Copy link
Contributor

this gets at the core of some of fundamental issues - what is a unique and stable mapping from Value/tensor to semantically meaningful identifier. At first I was sticking the entire stacktrace (to the allocator call) in that id string but not only is that slow but it's not even stable because of control flow!

Does Value* not suffice ?

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

# Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

```cpp
struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};
```

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*).

The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

# Registering Ops

We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime).

# Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

# Odds and ends

`valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

`PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph.

[ghstack-poisoned]
This was referenced Sep 20, 2021
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

# Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

```cpp
struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};
```

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*).

The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

# Registering Ops

We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime).

# Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

# Odds and ends

`valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

`PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

# Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

```cpp
struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};
```

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*).

The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

# Registering Ops

We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime).

# Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

# Odds and ends

`valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

`PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

# Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

```cpp
struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};
```

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*).

The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

# Registering Ops

We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime).

# Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

# Odds and ends

`valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

`PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph.

[ghstack-poisoned]
This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

# Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

```cpp
struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};
```

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*).

The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

# Registering Ops

We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime).

# Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

# Odds and ends

`valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

`PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph.

[ghstack-poisoned]
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks great !!! 🚢 🚢 🚢

Follows up / nice to haves
Smaller:

  • figure out why prim::AllocateTensor cant be schematized
  • Dont store TypePtr on prim::AllocateTensor

Larger:

  • Dont try to handle aliased tensors
  • Automatic inplacing add->add_
  • Dont have dependency on strides, propagate is_dense information instead

This PR adds static memory planning for traces with shape info. The three strategies implemented so far are stacked on top. They are:

* Linear scan (a heuristic based on https://www.usenix.org/legacy/events/vee05/full_papers/p132-wimmer.pdf)
* Greedy by size (a heuristic that allocates largest tensors first)
* Greedy by operator breadth (a heuristic that allocates largest tensors for operators with largest breadths first)

The latter 2 are based on https://arxiv.org/pdf/2001.03288.pdf.

Differential Revision: [D30769100](https://our.internmc.facebook.com/intern/diff/D30769100)

This stack of PRs have through a lot of revisions (as you can see from the # of commits) and this module should've probably had a design quip but here we are so I'll briefly summarize the design here.

# Memory Planner

We first perform alias analysis to find all of the values/tensors we'll managed allocations for (and their lifetimes). This reuses functionality from static runtime. With these in hand we can delegate to one of several planning strategies (the interface for theses planners is sorted map from lifetime (i.e., `[a,b]`) to size of required memory). Note that these required memory sizes need to be procured by some other mechanism (e.g. shape analysis or LTC or runtime profiling). One technical detail is that we explicitly enforce unique lifetimes through the type

```cpp
struct UniqueLiveRange {
  LiveRange lvr;
  std::string id;
};
```

This is to anticipate when this will be extended to GPU where it's possible that concurrent ops will have memory requirements that have identical lifetimes. Currently the `id` is the `debugName` of the output tensor but later on it will be something like a stack trace (i.e. uniquely identifying where the allocation request was made *semantically*).

The planner produces a sequence of assignments of (`offset`, `size`) allocations corresponding to lifetimes.
Once a plan is constructed it is validated and then we execute a graph pass to insert nodes into the graph that implement slab allocation, memory slicing, and slab freeing.

# Registering Ops

We add three new primitive ops `prim::AllocateSlab`, `prim::AllocateTensor`, and `prim::ReleaseSlab`. One general uncertainty I have is around the right way to perform the allocation and the freeing since there seem to several equivalent ways (e.g. you can swap storage `tensor->storage().set_data_ptr_noswap` like in static runtime).

# Memory Observer

This is functionality to collect data about allocations and frees at runtime. It's modeled on kineto but adds `FrameNodId`, a means to figuring out which node triggered which allocation (currently kineto only knows about dispatcher calls rather than graph nodes).

Lumped in with this PR it serves only to validate test cases and therefore might seem a tad over-engineered. But in reality it's a rebase that started life as a necessary component of "memorization" based planning. Indeed it's modeled as kineto because it started as additions to kineto that got too orthogonal from kineto's purpose.

# Odds and ends

`valid_add` and `valid_sub` check for overflow since we're dealing with arithmetic on `size_t`. On GCC and Clang we have `__builtin_add_overflow` and `__builtin_sub_overflow` but not on MSC so there we use the "dumber" `a + b >= a` and `a >= b` (respectfully). It could be argued that we should just drop the more opaque completely but I'm not sure.

`PreprocessGraph` is made public because the interpreter preprocesses graphs before running and makes verifying that the planner plans successfully impossible (SSA names are changed, amongst other things). With `PreprocessGraph` we can anticipate and reconcile those changes to the graph.

[ghstack-poisoned]
@pytorch-probot
Copy link

pytorch-probot bot commented Nov 3, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/c545281e1ca6b93d21f12c2bdfd8087a21b8795d/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis ciflow/all, ciflow/linux, ciflow/mobile 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@makslevental
Copy link
Contributor Author

@makslevental has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

Hi @makslevental!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

1 similar comment
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@github-actions
Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label May 21, 2022
@github-actions github-actions bot closed this Jun 20, 2022
@facebook-github-bot facebook-github-bot deleted the gh/makslevental/28/head branch July 21, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: jit Add this issue/PR to JIT oncall triage queue open source Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants