Improve allocation caching #709

nkoppel · 2023-04-16T03:53:43Z

So far, makes the TensorCache api cleaner using the CacheStorage trait to allow calling buffer conversion logic within the context of a TensorCache. Also adds CacheWrapper, which implements Drop and encapsulates all of the unsafe operations in the cache, aside from returning uninitialized memory.

Todo:

More thoroughly document safety details
Add buffer removal strategy (FIFO until total size of the cache is less than some number of bytes)
~~Set a reasonable default maximum cache size~~, and allow users to configure cache size.
Write unit tests to ensure that cache shrinking works correctly
Fix unbounded drop_queue size when cache is large

build.rs

nkoppel · 2023-04-16T19:19:24Z

src/tensor/cache.rs

+            // TODO: default max size
+            max_size: RwLock::new(1_000_000),


@coreylowman Thoughts on a default max_size? I think that 1gb is reasonable in most cases, but maybe we can do something like a percentage of available system memory or we could require the user to specify this value upon enabling the cache.

I've decided to require users to specify the maximum size of the cache when they enable it.

I like it, great idea! Thoughts on using an enum to represent different options for this? I'm definitely going to forget what a plain usize would represent

dev.enable_cache(CacheSize::Unlimited) dev.enable_cache(CacheSize::NumItems(1000)) dev.enable_cache(CacheSize::Bytes(10)) dev.enable_cache(CacheSize::MB(10)) dev.enable_cache(CacheSize::GB(10))

Optional num bytes? Option<usize>

coreylowman · 2023-04-20T15:45:51Z

So are these the main benefits of this PR?

Enable maximum cache size
More unified transmutation/drop logic across devices

nkoppel · 2023-04-20T17:01:51Z

Yes, and this should also be more memory safe.

coreylowman

I like the size limiting idea a lot, I don't even think that's configurable in pytorch so that'd be a great add!

Will need to comb over to determine if safety is still met. The new layers of abstraction here do make it a bit more difficult to reason about the safety at each part, which may be a downside of trying to abstract between cpu/cuda.

And something to think about: I think we may want to enable re-using allocations that are bigger than what is requested (e.g. I need 100MB, and the allocation cache gives me a 200MB buffer). These changes shouldn't make this harder or anything, but worth thinking about what changes would need to be made in both cases

src/tensor/cache.rs

nkoppel · 2023-04-21T16:47:08Z

Will need to comb over to determine if safety is still met. The new layers of abstraction here do make it a bit more difficult to reason about the safety at each part, which may be a downside of trying to abstract between cpu/cuda.

I still think this is good for memory safety because it reduces the surface area of memory unsafety to CacheWrapper and CacheStorage, rather than having cache operations using memory unsafe operations directly.

And something to think about: I think we may want to enable re-using allocations that are bigger than what is requested (e.g. I need 100MB, and the allocation cache gives me a 200MB buffer). These changes shouldn't make this harder or anything, but worth thinking about what changes would need to be made in both cases

I was actually thinking about this, and implementing this could be super easy due to the use of a BTreeMap, because BTreeMap's make it fast to get the next largest key after the one you requested. My main concern is that some operations may rely on the physical size of the buffer to get physical_numel or the number of threads to launch, and we would need to implement a Tensor::physical_numel method to instead calculate this using shape and strides.

coreylowman · 2023-04-24T15:51:10Z

src/tensor/cuda/device.rs

+        let src_layout = Layout::new::<T>().pad_to_align();
+        let dst_layout = Layout::new::<T2>().pad_to_align();


I'm not sure if it's okay to use these with cuda - I don't think they follow the same layout/alignment rules as rust does.

It seems like they are just used to get number of bytes. Maybe just use std::mem::size_of::<>() * self.data.len()?

I think that cuda formats memory in much the same way as the cpu, as we are able to dtoh and htod directly from one to another, any my understanding is that this is a byte-for-byte copy of the host array. I believe that we should keep this in case we wish to arrays with structs, as this is able to compute the padded size of a type as it would appear in an array.

Yeah byte to byte I think is fine, I'm just not sure about the sizes/alignment.

pad_to_align says the following, which makes it sound like it can modify the length?

Creates a layout by rounding the size of this layout up to a multiple of the layout’s alignment.

As far as alignment it sounds like cuda is always aligned to at least 256, which is definitely not the case for rust:

Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

src/tensor/cache.rs

coreylowman · 2023-04-24T15:54:24Z

src/tensor/cache.rs

+
+    /// Uses transmute_elements to convert to an element type with alignment `align` before dropping.
+    /// This **must** be a memory safe way to drop self, given the correct alignment
+    unsafe fn drop_with_alignment(self, align: usize) {


Thoughts on removing default impl of this? This implementation is the CPU one, and I'm not sure we should be doing all of this for cuda.

I think that this is correct for cuda, and while it may not be necessary, it doesn't have a significant performance impact as transmute_elements is very fast.

src/tensor/cache.rs

coreylowman · 2023-04-24T16:02:52Z

src/tensor/cache.rs

+    // Tracks the number of matching 'AllocationKey's in drop_queue to ignore. This is used to
+    // "remove" the next instance of the matching AllocationKey in the drop_queue, without having
+    // to run an O(n) operation to actually remove the key.
+    ignore_drops: usize,


Is all this just because we have the invariant of "if a key is in the tree, it must have > 0 values"? I wonder if it would be better to just remove that assumption, and just keep keys in the tree forever.

Also, have you noticed speedups when doing this? I'm wondering how much of an impact on speed the key removal actually has. My main concern is that this will be hard to understand/maintain whenever we revisit. I think the performance benefit would have to be pretty high for this to be worth it IMO.

Thoughts?

"if a key is in the tree, it must have > 0 values"

Keys now remain in the map if they have either ignore_drops > 0 or allocations.len() > 0. This is to satisfy this constraint, which states that

(instances of key in drop_queue) = allocations[key].ignore_drops + allocations[key].allocations.len()

for all keys.

Also, have you noticed speedups when doing this?

This exists mainly for correctness, as failing to remove keys from the drop_queue upon popping would cause preferential removal of frequently used keys over old keys. While this implementation works, it has a major flaw that I've overlooked until now. If shrink is never called, drop_queue will grow every time insert is called, without bound.

coreylowman · 2023-04-24T16:05:19Z

src/tensor/cache.rs

        }
+        std::mem::drop(cache);
+        self.shrink();


Think its worth only calling this if necessary?

I've made a small modification to shrink that minimizes the number of RwLock operations done by shrink when no shrinking needs to occur.

nkoppel added 2 commits April 15, 2023 22:46

improve TensorCache api; move buffer conversion logic to CacheStorage

935810f

split an assert; run cargo fmt

0a2d7d9

nkoppel changed the title ~~Improve buffer caching~~ Improve allocation caching Apr 16, 2023

nkoppel added 6 commits April 16, 2023 09:49

reduce visibility of internals; add safety comments

347b24b

create safe wrapper around CacheStorage objects

f055ae5

run cargo fmt

1697002

add check_key method to ensure CacheWrappers are used with valid keys

240553c

typo; run cargo fmt

101497e

add shrink method

ea73294

nkoppel commented Apr 16, 2023

View reviewed changes

build.rs Show resolved Hide resolved

nkoppel commented Apr 16, 2023

View reviewed changes

nkoppel added 8 commits April 16, 2023 17:44

document ignore_drops; fix constraint violations

9efc215

fix another constraint violation; fix TensorCache::len

e99a2ae

add size parameter to enable_cache and add set_cache_size

76798c9

remove len argument from TensorCache::insert; add test for cache

e1f4622

run cargo fmt

7476b80

fix clippy error

80c20e7

Merge branch 'main' into caching

ed24740

remove debug print statements

3d443ba

coreylowman reviewed Apr 21, 2023

View reviewed changes

src/tensor/cache.rs Show resolved Hide resolved

src/tensor/cache.rs Show resolved Hide resolved

src/tensor/cache.rs Outdated Show resolved Hide resolved

src/tensor/cache.rs Show resolved Hide resolved

improve documentation

9761e41

nkoppel added 3 commits April 21, 2023 11:58

add 'in bytes' to enable_cache documentation

aef3be9

add CacheSize enum

59ed3b1

run cargo fmt

902529e

coreylowman reviewed Apr 24, 2023

View reviewed changes

src/tensor/cache.rs Outdated Show resolved Hide resolved

coreylowman reviewed Apr 24, 2023

View reviewed changes

src/tensor/cache.rs Outdated Show resolved Hide resolved

coreylowman reviewed Apr 24, 2023

View reviewed changes

further clarify comment; remove align to 16 check

a2a565d

coreylowman reviewed Apr 24, 2023

View reviewed changes

nkoppel added 2 commits April 24, 2023 11:27

add check to minimize RwLock operations in shrink; export CacheSize

58518e2

run cargo fmt

5a03b08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve allocation caching #709

Improve allocation caching #709

nkoppel commented Apr 16, 2023 •

edited

nkoppel Apr 16, 2023

nkoppel Apr 17, 2023

coreylowman Apr 21, 2023 •

edited

coreylowman Apr 21, 2023

coreylowman commented Apr 20, 2023

nkoppel commented Apr 20, 2023

coreylowman left a comment

nkoppel commented Apr 21, 2023

coreylowman Apr 24, 2023

nkoppel Apr 24, 2023

coreylowman Apr 24, 2023 •

edited

coreylowman Apr 24, 2023

nkoppel Apr 24, 2023

coreylowman Apr 24, 2023

nkoppel Apr 24, 2023

coreylowman Apr 24, 2023

nkoppel Apr 24, 2023

		let src_layout = Layout::new::<T>().pad_to_align();
		let dst_layout = Layout::new::<T2>().pad_to_align();

Improve allocation caching #709

Are you sure you want to change the base?

Improve allocation caching #709

Conversation

nkoppel commented Apr 16, 2023 • edited

Todo:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coreylowman Apr 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coreylowman commented Apr 20, 2023

nkoppel commented Apr 20, 2023

coreylowman left a comment

Choose a reason for hiding this comment

nkoppel commented Apr 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coreylowman Apr 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkoppel commented Apr 16, 2023 •

edited

coreylowman Apr 21, 2023 •

edited

coreylowman Apr 24, 2023 •

edited