Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.summary.histogram allocates a lot of memory when called with large data #6714

Open
hrsht opened this issue Jan 3, 2024 · 5 comments
Open

Comments

@hrsht
Copy link

hrsht commented Jan 3, 2024

Environment information (required)

Irrelevant

Issue description

I am calling tf.summary.histogram with large enough data (in my case a 1-D tensor with ~16M int32 elements). I noticed that this would cause the method to allocate around 3.6GB of memory, with the following log line from tensorflow:

2024-01-02 19:27:52.663452: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 3903564240 exceeds 10% of free system memory.

Digging down deeper, I notice that this call to tf.one_hot in the method to calculate buckets is the culprit. This tensor has bucket_count times the original number of elements and with it being a tf.float64 type, it explains the 3.6GB of memory allocation in my case with default bucket count value of 30 (16M * 30 * 8 = ~3.6GB).

Looks like this tensor is only used to calculate bucket_counts tensor, which is a count of the number of elements in each bucket. Can this be done in a more memory efficient way? Something like:

bucket_counts = tf.map_fn(lambda i: tf.reduce_sum(tf.cast(clamped_indices == i, tf.int32)), tf.range(bucket_count))
@hrsht hrsht changed the title tf.summary.histogram allocates a lot of memory with called with large data tf.summary.histogram allocates a lot of memory when called with large data Jan 3, 2024
@arcra
Copy link
Member

arcra commented Jan 4, 2024

I suppose this seems reasonable. I'll ask the team if they're aware of a particular reason why it was done this way.

I don't know if/when we'll prioritize this, but contributions are welcome. I expect that as long as tests pass, the change should be fine.

One additional thing to validate might be that this other solution doesn't have considerably worse performance than the existing solution, so it would be nice to do some tests in that regard, but I expect that would work fine.

@hrsht
Copy link
Author

hrsht commented Jan 4, 2024

Thanks for your reply!

Can you let me know if there is a particular reason to do it this way? If there is no particular reason, I can work on this issue.

@yatbear
Copy link
Member

yatbear commented Jan 4, 2024

Hi @hrsht,

We intentionally casted the value to float64 to avoid accumulating floating point errors with large counts: #5337 (see tensorflow/tensorflow#36128 for context).

@hrsht
Copy link
Author

hrsht commented Jan 5, 2024

@yatbear That's not the issue here. The bucket count can be done as float64 (or IMO int64 for all practical purposes).

The usage of tf.one_hot is the issue as it ends up using order of bucket_count times more memory than the original data. So with default bucket count of 30, data of size 7e7 will allocate a tensor of size ~16GB in memory, which can be troublesome specially on a GPU (or in my case a machine with 16GB of RAM).

@arcra
Copy link
Member

arcra commented Jan 5, 2024

Indeed, apologies to both. I was seeking help from @yatbear about this issue but I probably did not communicate my question clearly.

It's been a long time since anybody has touched this code, so we're trying to think hard about it. Talking to @yatbear and @nfelt, Nick suggested that these operations need to be "TPU-compatible" (i.e. they need to be able to run in the TPU, not in the host machine... or perhaps a GPU as well, not sure), and that might be why this one-hot solution was implemented, since TPUs are pretty good at doing matrix operations, but we're not quite sure yet.

Additionally, as an FYI, related to the decision for using floats instead of integers, @nfelt mentioned that it was likely because "the output format of the histogram tensor is defined to be float64 anyway [...] and the 'wire' format is float64 because we allow the bucket boundaries to be floats and those are all serialized into a single tensor, so float64 is the common type".

So that explains why floats were used, whereas what @yatbear mentioned, explains why they are 64-bit floats, specifically, but as you stated, there's still the question of why using the one-hot tensor approach.

I'll let you know if we find out more about whether that operation (or a different one that is more memory-efficient than the current implementation) can be run in a TPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants