Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve allocation caching #709

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Improve allocation caching #709

wants to merge 23 commits into from

Conversation

nkoppel
Copy link
Contributor

@nkoppel nkoppel commented Apr 16, 2023

So far, makes the TensorCache api cleaner using the CacheStorage trait to allow calling buffer conversion logic within the context of a TensorCache. Also adds CacheWrapper, which implements Drop and encapsulates all of the unsafe operations in the cache, aside from returning uninitialized memory.

Todo:

  • More thoroughly document safety details
  • Add buffer removal strategy (FIFO until total size of the cache is less than some number of bytes)
  • Set a reasonable default maximum cache size, and allow users to configure cache size.
  • Write unit tests to ensure that cache shrinking works correctly
  • Fix unbounded drop_queue size when cache is large

@nkoppel nkoppel changed the title Improve buffer caching Improve allocation caching Apr 16, 2023
build.rs Show resolved Hide resolved
Comment on lines 175 to 176
// TODO: default max size
max_size: RwLock::new(1_000_000),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coreylowman Thoughts on a default max_size? I think that 1gb is reasonable in most cases, but maybe we can do something like a percentage of available system memory or we could require the user to specify this value upon enabling the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to require users to specify the maximum size of the cache when they enable it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, great idea! Thoughts on using an enum to represent different options for this? I'm definitely going to forget what a plain usize would represent

dev.enable_cache(CacheSize::Unlimited)
dev.enable_cache(CacheSize::NumItems(1000))
dev.enable_cache(CacheSize::Bytes(10))
dev.enable_cache(CacheSize::MB(10))
dev.enable_cache(CacheSize::GB(10))

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional num bytes? Option<usize>

@coreylowman
Copy link
Owner

So are these the main benefits of this PR?

  1. Enable maximum cache size
  2. More unified transmutation/drop logic across devices

@nkoppel
Copy link
Contributor Author

nkoppel commented Apr 20, 2023

Yes, and this should also be more memory safe.

Copy link
Owner

@coreylowman coreylowman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the size limiting idea a lot, I don't even think that's configurable in pytorch so that'd be a great add!

Will need to comb over to determine if safety is still met. The new layers of abstraction here do make it a bit more difficult to reason about the safety at each part, which may be a downside of trying to abstract between cpu/cuda.

And something to think about: I think we may want to enable re-using allocations that are bigger than what is requested (e.g. I need 100MB, and the allocation cache gives me a 200MB buffer). These changes shouldn't make this harder or anything, but worth thinking about what changes would need to be made in both cases

src/tensor/cache.rs Show resolved Hide resolved
src/tensor/cache.rs Show resolved Hide resolved
src/tensor/cache.rs Outdated Show resolved Hide resolved
src/tensor/cache.rs Show resolved Hide resolved
@nkoppel
Copy link
Contributor Author

nkoppel commented Apr 21, 2023

Will need to comb over to determine if safety is still met. The new layers of abstraction here do make it a bit more difficult to reason about the safety at each part, which may be a downside of trying to abstract between cpu/cuda.

I still think this is good for memory safety because it reduces the surface area of memory unsafety to CacheWrapper and CacheStorage, rather than having cache operations using memory unsafe operations directly.

And something to think about: I think we may want to enable re-using allocations that are bigger than what is requested (e.g. I need 100MB, and the allocation cache gives me a 200MB buffer). These changes shouldn't make this harder or anything, but worth thinking about what changes would need to be made in both cases

I was actually thinking about this, and implementing this could be super easy due to the use of a BTreeMap, because BTreeMap's make it fast to get the next largest key after the one you requested. My main concern is that some operations may rely on the physical size of the buffer to get physical_numel or the number of threads to launch, and we would need to implement a Tensor::physical_numel method to instead calculate this using shape and strides.

Comment on lines +37 to +38
let src_layout = Layout::new::<T>().pad_to_align();
let dst_layout = Layout::new::<T2>().pad_to_align();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's okay to use these with cuda - I don't think they follow the same layout/alignment rules as rust does.

It seems like they are just used to get number of bytes. Maybe just use std::mem::size_of::<>() * self.data.len()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that cuda formats memory in much the same way as the cpu, as we are able to dtoh and htod directly from one to another, any my understanding is that this is a byte-for-byte copy of the host array. I believe that we should keep this in case we wish to arrays with structs, as this is able to compute the padded size of a type as it would appear in an array.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah byte to byte I think is fine, I'm just not sure about the sizes/alignment.

pad_to_align says the following, which makes it sound like it can modify the length?

Creates a layout by rounding the size of this layout up to a multiple of the layout’s alignment.

As far as alignment it sounds like cuda is always aligned to at least 256, which is definitely not the case for rust:

Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

src/tensor/cache.rs Outdated Show resolved Hide resolved

/// Uses transmute_elements to convert to an element type with alignment `align` before dropping.
/// This **must** be a memory safe way to drop self, given the correct alignment
unsafe fn drop_with_alignment(self, align: usize) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on removing default impl of this? This implementation is the CPU one, and I'm not sure we should be doing all of this for cuda.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is correct for cuda, and while it may not be necessary, it doesn't have a significant performance impact as transmute_elements is very fast.

src/tensor/cache.rs Outdated Show resolved Hide resolved
Comment on lines 63 to 66
// Tracks the number of matching 'AllocationKey's in drop_queue to ignore. This is used to
// "remove" the next instance of the matching AllocationKey in the drop_queue, without having
// to run an O(n) operation to actually remove the key.
ignore_drops: usize,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all this just because we have the invariant of "if a key is in the tree, it must have > 0 values"? I wonder if it would be better to just remove that assumption, and just keep keys in the tree forever.

Also, have you noticed speedups when doing this? I'm wondering how much of an impact on speed the key removal actually has. My main concern is that this will be hard to understand/maintain whenever we revisit. I think the performance benefit would have to be pretty high for this to be worth it IMO.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if a key is in the tree, it must have > 0 values"

Keys now remain in the map if they have either ignore_drops > 0 or allocations.len() > 0. This is to satisfy this constraint, which states that

(instances of key in drop_queue) = allocations[key].ignore_drops + allocations[key].allocations.len()

for all keys.

Also, have you noticed speedups when doing this?

This exists mainly for correctness, as failing to remove keys from the drop_queue upon popping would cause preferential removal of frequently used keys over old keys. While this implementation works, it has a major flaw that I've overlooked until now. If shrink is never called, drop_queue will grow every time insert is called, without bound.

}
std::mem::drop(cache);
self.shrink();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think its worth only calling this if necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a small modification to shrink that minimizes the number of RwLock operations done by shrink when no shrinking needs to occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants