What to expect from Offloading in Remat? #19063

luyug · 2023-12-20T15:56:05Z

luyug
Dec 20, 2023

Hello! I am trying to optimize memory usage of a neural network. I read #17576 which describes that the activation offloading feature in remat.

I tried changing the dot_with_no_batch_dims_saveable:

def dot_with_no_batch_dims_saveable(prim, *_, **params):
  # This is a useful heuristic for transformers.
  if prim is lax_internal.dot_general_p:
    (_, _), (lhs_b, rhs_b) = params['dimension_numbers']
    if not lhs_b and not rhs_b:
      return Saveable
  return Recompute

policy = dot_with_no_batch_dims_saveable

into

def dot_with_no_batch_dims_offload(prim, *_, **params):
  # This is a useful heuristic for transformers.
  if prim is lax_internal.dot_general_p:
    (_, _), (lhs_b, rhs_b) = params['dimension_numbers']
    if not lhs_b and not rhs_b:
      return Offloadable(src="tpu_hbm", dst="unpinned_host")
  return Recompute

policy = dot_with_no_batch_dims_offload

when doing remat_call = jax.checkpoint(partial(model.__call__, train=True), policy=policy).

I got a very similar OOM error with identical peak tpu hbm usage from the compiler. This seems counter-intuitive to me; offloading activations supposedly will reduce the peak ram usage. Am I missing something here? Does the compiler account for the memory saving from offloading? Or are there other compiler stuff going on? Thanks!

mattjj · 2023-12-20T19:39:40Z

mattjj
Dec 20, 2023
Maintainer

Thanks for the question!

While #17576 added the JAX-side plumbing, I believe the XLA compiler doesn't actually yet support this feature.

@yashk2810 is that right?

0 replies

yashk2810 · 2023-12-20T20:31:20Z

yashk2810
Dec 20, 2023
Maintainer

Yes, that's correct.

We are a couple of PRs away from enabling it for TPUs. So it should be landing soon enough.

4 replies

mattjj Dec 20, 2023
Maintainer

Does that mean 2023 or 2024? :P

SandSnip3r Dec 20, 2023

I'm working on this. It won't land before 2024. We plan to land the initial activation offloading functionality in early January.

kurtisdavid Apr 13, 2024

I'm working on this. It won't land before 2024. We plan to land the initial activation offloading functionality in early January.

Awesome! How can we track the progress of this? I'm interested in using this for memory constrained models we're deploying on TPUs.

yashk2810 Apr 13, 2024
Maintainer

Hey! Yes, this works now!

You can see some tests here:

jax/tests/memories_test.py

Line 1156 in 70dca30

class ActivationOffloadingTest(jtu.JaxTestCase):

Usage in maxtext: https://github.com/google/maxtext/blob/ebd39aa64d670fa13a313b6f776e01ad9e450321/MaxText/layers/models.py#L236

kurtisdavid · 2024-04-21T03:39:18Z

kurtisdavid
Apr 21, 2024

Hey! Yes, this works now!

You can see some tests here:

jax/tests/memories_test.py

Line 1156 in 70dca30

class ActivationOffloadingTest(jtu.JaxTestCase):

Usage in maxtext: https://github.com/google/maxtext/blob/ebd39aa64d670fa13a313b6f776e01ad9e450321/MaxText/layers/models.py#L236

Awesome!! This is my first time trying offloading, what are some ways to know it's working? I've implemented it and then viewed the HBM usage but I didn't see a difference (I don't think I'm using it right just yet, are there other ways to check?)

1 reply

yashk2810 Apr 21, 2024
Maintainer

You can use the AOT APIs to see if the host stats show up:

f.lower(*args).compile().memory_analysis()

jax/tests/memories_test.py

Lines 1207 to 1210 in 70dca30

    
           compiled_stats = compiled_f.memory_analysis() 
        
           if compiled_stats is not None and jtu.test_device_matches(["tpu"]): 
        
             if xla_extension_version >= 240 and jtu.pjrt_c_api_version_at_least(0, 43): 
        
               self.assertGreater(compiled_stats.host_temp_size_in_bytes, 0)

Note that you will have to use the latest jax, jaxlib and libtpu nightlies for this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to expect from Offloading in Remat? #19063

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What to expect from Offloading in Remat? #19063

luyug Dec 20, 2023

Replies: 3 comments · 5 replies

mattjj Dec 20, 2023 Maintainer

yashk2810 Dec 20, 2023 Maintainer

mattjj Dec 20, 2023 Maintainer

SandSnip3r Dec 20, 2023

kurtisdavid Apr 13, 2024

yashk2810 Apr 13, 2024 Maintainer

kurtisdavid Apr 21, 2024

yashk2810 Apr 21, 2024 Maintainer

luyug
Dec 20, 2023

Replies: 3 comments 5 replies

mattjj
Dec 20, 2023
Maintainer

yashk2810
Dec 20, 2023
Maintainer

mattjj Dec 20, 2023
Maintainer

yashk2810 Apr 13, 2024
Maintainer

kurtisdavid
Apr 21, 2024

yashk2810 Apr 21, 2024
Maintainer