Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RAM to VRAM transfer #6312

Merged
merged 11 commits into from May 24, 2024
Merged

Conversation

lstein
Copy link
Collaborator

@lstein lstein commented May 5, 2024

Summary

This PR speeds up the model manager’s system for moving models back and forth between RAM and VRAM . Instead of calling model.to() to accomplish the transfer, the model manager now stores a copy of the model’s state dict in RAM. When the model needs to be moved into VRAM for inference, the manager makes a VRAM copy of the state dict and assigns it to the model using load_state_dict(). When inference is done, the model is cleared from VRAM by calling load_state_dict() with the CPU copy of the state dict.

Benchmarking an SDXL model shows an improvement from 3 seconds to 0.81 seconds for a model load/unload cycle. Most of the improvement comes from the unload step, as shown in the table below:


model          from    to      old(s) new(s)                                                                    
-----          ----    --      -----  -----                                                                     
unet           cpu     cuda:0  0.69   0.52                                                                      
text_encoder   cpu     cuda:0  0.15   0.09                                                                      
text_encoder_2 cpu     cuda:0  0.16   0.14                                                                      
vae            cpu     cuda:0  0.02   0.02                                                                      
          LOAD TO CUDA TOTAL   1.02   0.77                                                                      
                                                                                                                
unet           cuda:0  cpu     1.45   0.03                                                                      
vae            cuda:0  cpu     0.09   0.00                                                                      
text_encoder   cuda:0  cpu     0.07   0.00                                                                      
text_encoder_2 cuda:0  cpu     0.40   0.01                                                                      
        UNLOAD FROM CUDA TOTAL 2.01   0.04

Thanks to @RyanJDick for suggesting this load/unload scheme.

Related Issues / Discussions

QA Instructions

Change models a number of times. Monitor RAM and VRAM for memory leaks.

Merge Plan

Merge when approved

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)

@github-actions github-actions bot added python PRs that change python files backend PRs that change backend files labels May 5, 2024
@hipsterusername
Copy link
Member

This is a huge speed up!!! Awesome. Will wait for @RyanJDick to take a look

Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I tested it out with some simple T2I workflows, and saw the speedup, as promised.

I left a few comments. Once those are addressed, I'll run it through its paces with a bunch of model types to make sure there aren't any weird edge cases.

@lstein lstein requested a review from RyanJDick May 7, 2024 04:44
Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a bug during testing. I hadn't thought about this before, but this approach breaks if a model is moved between devices while a patch is applied. I can trigger this by using a TI. The series of events is:

  • The text encoder is loaded and registered with the model cache.
  • We apply the TI to the text encoder while the text encoder is on the CPU. This patch creates a new tensor of token embeddings with a different shape.
  • We attempt to move the text encoder to the GPU. This operation fails because the state_dict tensor sizes no longer match.

We could probably find a quick way to solve this particular problem, but it makes me worry about the risk of similar bugs. We need clear rules for how the model cache and model patching are intended to interact.

One approach would be to require that models are patched and unpatched during the span of a model cache lock. TIs are a little weird in that the patch is applied on the CPU before copying the model to GPU. We should look into whether we can just do all of this on the GPU. If not, we may have to consider splitting the concepts of model access locking and model device locking.

@lstein
Copy link
Collaborator Author

lstein commented May 13, 2024

I ran into a bug during testing. I hadn't thought about this before, but this approach breaks if a model is moved between devices while a patch is applied. I can trigger this by using a TI. The series of events is:

  • The text encoder is loaded and registered with the model cache.
  • We apply the TI to the text encoder while the text encoder is on the CPU. This patch creates a new tensor of token embeddings with a different shape.
  • We attempt to move the text encoder to the GPU. This operation fails because the state_dict tensor sizes no longer match.

We could probably find a quick way to solve this particular problem, but it makes me worry about the risk of similar bugs. We need clear rules for how the model cache and model patching are intended to interact.

One approach would be to require that models are patched and unpatched during the span of a model cache lock. TIs are a little weird in that the patch is applied on the CPU before copying the model to GPU. We should look into whether we can just do all of this on the GPU. If not, we may have to consider splitting the concepts of model access locking and model device locking.

@RyanJDick I'm not all that familiar with model patching. Is patching done prior to every generation and then reversed? If so, the trick would be to refresh the cached state_dict whenever patching is done on a CPU-based model.

@RyanJDick
Copy link
Collaborator

Patching (for LoRA or TI) is managed using context managers (applied on entry, and reversed on exit).

Examples:

  • with (
    ModelPatcher.apply_ti(tokenizer_model, text_encoder_model, ti_list) as (
    tokenizer,
    ti_manager,
    ),
    text_encoder_info as text_encoder,
    # Apply the LoRA after text_encoder has been moved to its target device for faster patching.
    ModelPatcher.apply_lora(text_encoder, _lora_loader(), lora_prefix),
    # Apply CLIP Skip after LoRA to prevent LoRA application from failing on skipped layers.
    ModelPatcher.apply_clip_skip(text_encoder_model, clip_field.skipped_layers),
    ):
  • with (
    ExitStack() as exit_stack,
    ModelPatcher.apply_freeu(unet_info.model, self.unet.freeu_config),
    set_seamless(unet_info.model, self.unet.seamless_axes), # FIXME
    unet_info as unet,
    # Apply the LoRA after unet has been moved to its target device for faster patching.
    ModelPatcher.apply_lora_unet(unet, _lora_loader()),
    ):

Now that the model cache has the power to modify a model's weights (restore them to a previous state), we need clearer ownership semantics (i.e. who can modify a model?, when can they modify it? what guarantees do they have to offer?).

Designing this well would take more thought / effort than I can spend on it right now.

We might be able to take a shortcut to get this working now though. I think this might be achievable with some combination of:

  • Store the model state_dict at the time that the model is moved to the device instead of at the time that the model is added to the cache.
  • Make TI patching work with on-device models and switch the order of the context managers.
  • Make this new optimized behavior configurable. I.e. something like with model_info.on_device(allow_copy=True) as model:

More investigation needed to figure out which of those makes the most sense.

@github-actions github-actions bot added the invocations PRs that change invocations label May 18, 2024
@lstein
Copy link
Collaborator Author

lstein commented May 18, 2024

@RyanJDick I finally got back to this after an interlude. It was a relatively minor fix to get all the model patching done after loading the model into the target device, and the code is cleaner too. I've tested LoRA, TI and clip skip, and they all seem to be working as expected. Seamless doesn't seem to do much of anything, either with this PR or on current main. Not sure what's up with that; I haven't used seamless in over a year.

@lstein lstein requested a review from RyanJDick May 18, 2024 04:27
Copy link
Collaborator

@psychedelicious psychedelicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the overall strategy, but I'm having trouble wrapping my head around the fix for models changing device. If I understand correctly, the solution is very simple - re-order the context managers. Can you ELI5 how changing the order of the context managers fixes this?

Also curious about this edge case - say we have two compel nodes:

  • We execute compel node 1. At this time, the models are in VRAM.
  • Time passes and we load other models, evicting the UNet and CLIP from VRAM. Maybe they are in RAM, maybe they aren't cached at all.
  • We execute compel node 2. Is this a problem?

invokeai/app/invocations/compel.py Show resolved Hide resolved
invokeai/app/invocations/compel.py Show resolved Hide resolved
@psychedelicious
Copy link
Collaborator

BTW - seamless working fine for me. Tested SD1.5 and SDXL, all permutations of axes. I wonder if there is some interaction with other settings you were using?

image
image
image

@psychedelicious
Copy link
Collaborator

Note: I tested this PR to see if it fixed #6375. It does not.

@lstein
Copy link
Collaborator Author

lstein commented May 18, 2024

I understand the overall strategy, but I'm having trouble wrapping my head around the fix for models changing device. If I understand correctly, the solution is very simple - re-order the context managers. Can you ELI5 how changing the order of the context managers fixes this?

Also curious about this edge case - say we have two compel nodes:

  • We execute compel node 1. At this time, the models are in VRAM.
  • Time passes and we load other models, evicting the UNet and CLIP from VRAM. Maybe they are in RAM, maybe they aren't cached at all.
  • We execute compel node 2. Is this a problem?

The context managers were reordered so that the context manager calls that lock the model in VRAM are executed before the patches are applied, and it is the locked model that is passed to the patchers. I also switched the relative order of the TI and LoRA patchers, but only because it made the code formatting easier to read. I tested both orders and got identical images.

Here's the edge case:

  1. We execute compel node 1. The models may or may not be in VRAM (and may not be in RAM either). When the compel invocation runs, the models are moved into VRAM by the model manager's context manager and locked there for the duration of the context. Within the context the model is patched in VRAM.
  2. As soon as the compel context is finished, the model is unpatched. It is also likely removed from VRAM unless it happens to fit into the VRAM cache space.
  3. A new compel node is executed. If the model is no longer in VRAM, a fresh copy of the model weights are copied into VRAM and the process described in step 1 is repeated.

RAM->VRAM operations are about twice as fast as VRAM->RAM on my system. I am tempted to remove the VRAM cache entirely so that we are guaranteed to have a fresh copy of the model weights each time. However, if the patchers are unpatching correctly, this shouldn't be an issue.

@lstein
Copy link
Collaborator Author

lstein commented May 18, 2024

Note: I tested this PR to see if it fixed #6375. It does not.

Rats. I was rather hoping it would. I'm digging into the LoRA loading issue now.

@lstein
Copy link
Collaborator Author

lstein commented May 18, 2024

BTW - seamless working fine for me. Tested SD1.5 and SDXL, all permutations of axes. I wonder if there is some interaction with other settings you were using?

It is working for me as well. I just had to adjust the image dimensions to see the effect. Seamless is not something I ever use.

@lstein
Copy link
Collaborator Author

lstein commented May 19, 2024

@psychedelicious @RyanJDick I have included a fix for #6375 in this PR. There was some old model cache code originally written by Stalker that traversed the garbage collector and forcibly deleted local variables from unused stack frames. This code was written to work around a Python 3.9 GC bug, but it seems to wreak havoc on context managers. The low RAM cache setting simply triggered the problem. I suspect this may have caused rare failures in other contexts as well (pun intended).

I removed the code and tested for signs of memory leaks. I didn't see any, but please keep an eye out.

Going off on a tangent, while reviewing the patching code, I discovered that lora patching uses the following pattern:

  1. Load the LoRA as lora_info using the MM
  2. Get the model in CPU using lora_info.model
  3. Iterate through each of the LoRA layers, move them into VRAM
  4. Apply the LoRA layer weights (saving the original model weights to restore later)
  5. Move the layer back to RAM.

I think it would be more performant to:

  1. Load the LoRA using the MM
  2. Enter the context that moves the LoRA weights into VRAM (using the new RAM->VRAM transfer)
  3. Apply the weights layer by layer
  4. Exit the context

The downside is that this will transiently use more VRAM because all the LoRA layers are loaded at once.

Another potential optimization would be to stop saving the original model's weights on entry to the patcher context and restoring them on exit. Since we are now keeping a virgin copy of the state dictionary in the RAM cache, the patched model in VRAM is cleared out at the end of a node's invocation and will be replaced with a fresh copy the next time it is needed.

I gave both of these things a quick try and the system felt snappier, but I didn't do timing or any stress tests. If you think this is worth pursuing, I'll submit a new PR for them.

[EDIT] I can shave off ~2s of generation walltime (from 10.8 to 8.8s) by avoiding the unecessary step of restoring weights to the VRAM copy of the model.

@github-actions github-actions bot added services PRs that change app services python-tests PRs that change python tests docs PRs that change docs labels May 20, 2024
@lstein
Copy link
Collaborator Author

lstein commented May 20, 2024

The latest commit implements an optimization that circumvents the LoRA unpatching step when working with a model that is resident in CUDA VRAM. This works because the new scheme never copies the model weights back from VRAM ito RAM, but instead reinitializes the VRAM copy from a fresh RAM state dict the next time the model is needed. The behavior for CPU and MPS devices has not changed, since these operate on the RAM copy. When generating with SDXL models, this optimization saves roughly 1s per LoRA per generation, which I think makes the special casing worth it.

The other optimization I tried was to let the model manager load the LoRA into VRAM using its usual model locking mechanism rather than manually moving each layer into VRAM before patching. However, this did not give a performance gain and needed special casing for LoRAs in the model manager because LoRAs don't have load_state_dict.

Other changes in this commit:

  1. I have removed the VRAM cache along with the configuration variables that control its behavior. The cache is incompatible with the LoRA unpatching optimization, and I was planning to get rid of it anyway given the trouble users have with it.
  2. I have updated the config schema to 4.0.2 and added a migration script that removes the VRAM settings from the user's invokeai.yaml file.
  3. I have modified the test_lora.py test to accommodate the lack of unpatching when running on CUDA.

Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test the effect of removing the VRAM cache with a large VRAM cache size (e.g. large enough to hold all working models)? For this usage pattern, I'm afraid that there is going to be a significant speed regression from removing it.


The behavior of the apply_lora(...) and the model locker context managers now change significantly depending on the environment in which they run.

apply_lora():

  • Without CUDA GPU:
    • __enter__: apply lora weights
    • __exit__: revert lora weights
  • With CUDA GPU:
    • __enter__: apply lora weights
    • __exit__: do nothing

Model locker:

  • Without CUDA GPU:
    • __enter__: move model to target device
    • __exit__: move model to RAM
  • With CUDA GPU:
    • __enter__: move model to target device
    • __exit__: move model to RAM, and revert any changes made to the weights

Given these major differences in behavior depending on environment, the caller of these context managers needs to be deeply familiar with their implementation details to use them correctly. It might be better to force the caller to explicitly specify the desired behavior. For example:

with (
	ModelCache.model_on_device(model_info, target_device, copy_weights_to_device=True) as model,
	ModelPatcher.apply_lora(model, _lora_loader(), lora_prefix, revert_on_exit=False)
):
	...

What do you think? It would be a breaking change to the API, but we're making a major breaking change to the behaviour either way.

Separately, we may also want to consider making model_info.model a private attribute. I can't think of a good reason to access the model directly outside of a model locker context.


For my own reference, here's a rough checklist of the tests we should run once the code settles down to check for performance and behavior regressions:

  • Context managers:
    • Lora
    • TI
    • FreeU
    • Seamless
    • Clip Skip
  • HW
    • CUDA (multiple device types in case this impacts copy speeds)
    • CPU
    • MPS

invokeai/backend/model_patcher.py Outdated Show resolved Hide resolved
invokeai/app/services/config/config_default.py Outdated Show resolved Hide resolved
@RyanJDick
Copy link
Collaborator

Did you test the effect of removing the VRAM cache with a large VRAM cache size (e.g. large enough to hold all working models)? For this usage pattern, I'm afraid that there is going to be a significant speed regression from removing it.

I had a chance to do some testing of this today. As I suspected, removing the VRAM cache does result in a regression when the VRAM cache is large enough to hold the larger models (>1 sec per generation). The improved LoRA patching speed makes up for it in some cases, but not in others.

image

This PR has grown quite a bit in scope. It now covers:

  1. Keep copy on CPU to improve VRAM offload speed
  2. Remove the hacky garbage collection logic that is causing the lora patching bug
  3. Remove the VRAM cache altogether
  4. Optimize LoRA patch/unpatch time

How about we split these up so that we can properly evaluate and test each one? I feel like we definitely want 1 and 2 (ideally as separate PRs). 3 and 4 come with major tradeoffs, maybe we can find a way to get both the benefit of a VRAM cache and smarter LoRA patching.

@lstein
Copy link
Collaborator Author

lstein commented May 24, 2024

Did you test the effect of removing the VRAM cache with a large VRAM cache size (e.g. large enough to hold all working models)? For this usage pattern, I'm afraid that there is going to be a significant speed regression from removing it.

I had a chance to do some testing of this today. As I suspected, removing the VRAM cache does result in a regression when the VRAM cache is large enough to hold the larger models (>1 sec per generation). The improved LoRA patching speed makes up for it in some cases, but not in others.

image This PR has grown quite a bit in scope. It now covers:
  1. Keep copy on CPU to improve VRAM offload speed
  2. Remove the hacky garbage collection logic that is causing the lora patching bug
  3. Remove the VRAM cache altogether
  4. Optimize LoRA patch/unpatch time

How about we split these up so that we can properly evaluate and test each one? I feel like we definitely want 1 and 2 (ideally as separate PRs). 3 and 4 come with major tradeoffs, maybe we can find a way to get both the benefit of a VRAM cache and smarter LoRA patching.

4 is dependent on 3. How about I just remove the code changes for 3 and 4 and we can consider them as a separate future PR? This is easier for me as I’ll just reset to an earlier commit.

@lstein lstein force-pushed the lstein/feat/cpu_to_vram_optimization branch from 5d4b747 to c775b59 Compare May 24, 2024 02:06
@lstein
Copy link
Collaborator Author

lstein commented May 24, 2024

@RyanJDick I’ve undone the model patching changes and the removal of the VRAM cache, and what’s left is the original cpu->vram optimization, the fix to the TI patching, and the weird context manager bug that was causing LoRAs not to patch. It is a fairly minimal PR now, so I hope we can get it merged. I’ll work on the LoRA patching optimization separately.

@psychedelicious psychedelicious self-requested a review May 24, 2024 02:17
Copy link
Collaborator

@psychedelicious psychedelicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved so my requested changes aren't a blocker for this PR

Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks for splitting up the PRs.

I did some quick manual regression testing - everything looked good. I tried:

  • Text-to-image, LoRA, TI
  • CPU-only
  • A bunch of model switching - no obvious signs of a memory leak.

I also ran some performance tests.
With vram: 0.25:

  • SDXL T2I, cold cache: 10.4s -> 9.6s
  • SDXL T2I, warm cache: 6.9s -> 6.1s
  • SDXL T2I + 2 LoRA, warm cache: 9.0 -> 8.6s

With vram: 16 (no significant change, as expected):

  • SDXL T2I, cold cache: 8.0s -> 8.0s
  • SDXL T2I, warm cache: 4.7s -> 4.6s
  • SDXL T2I + 2 LoRA, warm cache: 6.9s -> 6.9s

@RyanJDick
Copy link
Collaborator

@lstein There are a few torch features that might stack nicely on this PR to give even more speedup for Host-to-Device copies:

  • torch.Tensor.pin_memory()
  • torch.Tensor.to(..., non_blocking=True)

Have you looked into these at all? I don't want to expand the scope of this PR, but these could be an easy follow-up if you're interested in trying them out (or I can do it).

@lstein lstein enabled auto-merge (squash) May 24, 2024 15:19
@lstein
Copy link
Collaborator Author

lstein commented May 24, 2024

I'm going to merge this in and then will start working on further optimizations including the lora loading/unloading.

@lstein
Copy link
Collaborator Author

lstein commented May 24, 2024

Awesome. Thanks for splitting up the PRs.

I did some quick manual regression testing - everything looked good. I tried:

  • Text-to-image, LoRA, TI
  • CPU-only
  • A bunch of model switching - no obvious signs of a memory leak.

I also ran some performance tests. With vram: 0.25:

  • SDXL T2I, cold cache: 10.4s -> 9.6s
  • SDXL T2I, warm cache: 6.9s -> 6.1s
  • SDXL T2I + 2 LoRA, warm cache: 9.0 -> 8.6s

With vram: 16 (no significant change, as expected):

  • SDXL T2I, cold cache: 8.0s -> 8.0s
  • SDXL T2I, warm cache: 4.7s -> 4.6s
  • SDXL T2I + 2 LoRA, warm cache: 6.9s -> 6.9s

Thanks for doing the timings. It's not as big a speedup as I saw, but probably very dependent on hardware.

@lstein lstein merged commit 532f82c into main May 24, 2024
14 checks passed
@lstein lstein deleted the lstein/feat/cpu_to_vram_optimization branch May 24, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files docs PRs that change docs invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants