Fix memory leaks in xla::BufferDonationTest #59033

rbaranchuk-capgemini · 2022-12-28T15:55:25Z

Leaks were detected by AddressSanitizer

cheshire · 2023-01-01T20:39:35Z

Please wait for code owners approval.

On Sat, Dec 31, 2022 at 5:53 PM Mihai Maruseac ***@***.***> wrote: ***@***.**** approved this pull request. — Reply to this email directly, view it on GitHub <#59033 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACVGH3J6NIDKXOF3GLEIS3WQBXJ5ANCNFSM6AAAAAATLL3UWI> . You are receiving this because your review was requested.Message ID: ***@***.***>

cheshire

(please wait for review from code owners, sorry for the first approval)

Not my place to review this, should have steered clear of XLA. Sorry

cheshire · 2023-01-08T12:42:03Z

Could you post a leak description and a stacktrace? I'm having a somewhat hard time connecting the code changes to a leak, and we do run tests (at least internally) under asan - perhaps we have missed something?

rbaranchuk-capgemini · 2023-01-09T11:32:08Z

Could you post a leak description and a stacktrace? I'm having a somewhat hard time connecting the code changes to a leak, and we do run tests (at least internally) under asan - perhaps we have missed something?

Sure, the summary from asan is

SUMMARY: AddressSanitizer: 40 byte(s) leaked in 7 allocation(s).

That goes from 3 tests, first one is BufferDonationTest_SimpleWhileTupleTestCopyProtection_Test

There's 3 buffers in ScopedShapedBuffer generated by AllocateScopedShapedBuffer in method RunAndCheck, sizes 16, 4 and 4 bytes, and they're all leaked.

Buffer 16 is not aliased, buffers 4 are aliased. Donation is not specified for this test.

In the end, inside ExecutionOutput object, buffer 16 is marked as aliased (in aliased_indices_ member), so when the object goes out of scope it's not deallocated inside its destructor. I've added a call to Commit method so aliased_indices_ would be cleared and these buffers would be deallocated. Same applies for another test (which I'll describe in a separate comment I think)

For buffers with the size 4, as they're aliased but not donated, it involves making copy buffers in CpuExecutable::CreateResultShapedBuffer method, and the references to the original buffers are lost, so they're not deallocated in the end. So I'm adding them to ownership_buffers so they would be deallocated when this variable goes out of scope (please see the comment 1. Aliasing is specified but the buffer is not donated )

Stacktrace for this case:

Direct leak of 24 byte(s) in 3 object(s) allocated from:
#0 0x5615d73cbce7 in __interceptor_posix_memalign (/tf/output_user_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/xla/tests/buffer_donation_test_cpu+0xbcce7) (BuildId: 93c31d7b9d14041de66592cfacf1cd32)
#1 0x7f2fa44a5c22 in tsl::port::AlignedMalloc(unsigned long, int) /proc/self/cwd/tensorflow/tsl/platform/default/port.cc:404:13
#2 0x7f2fa49ddc14 in stream_executor::host::HostExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/host/host_gpu_executor.cc:77:7
#3 0x7f2fa495b2d5 in stream_executor::StreamExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:536:43
#4 0x7f2fa4962b7f in stream_executor::DeviceMemory stream_executor::StreamExecutor::AllocateArray(unsigned long, long) /proc/self/cwd/./tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.h:858:26
#5 0x7f2fa4962b7f in stream_executor::StreamExecutorMemoryAllocator::Allocate(int, unsigned long, bool, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:947:17
#6 0x7f2fa81fb1d8 in xla::TransferManager::AllocateScopedShapedBuffer(xla::Shape const&, stream_executor::DeviceMemoryAllocator*, int, std::function<xla::Shape (xla::Shape const&)>) /proc/self/cwd/tensorflow/compiler/xla/service/transfer_manager.cc:341:5
#7 0x5615d7408d9d in xla::(anonymous namespace)::BufferDonationTest::RunAndCheck(std::unique_ptr<xla::HloModule, std::default_deletexla::HloModule>, absl::lts_20220623::Span<xla::Literal const>, absl::lts_20220623::Span, absl::lts_20220623::Span, xla::Literal const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator>) /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:102:7
#8 0x5615d741eb69 in xla::(anonymous namespace)::BufferDonationTest_SimpleWhileTupleTestCopyProtection_Test::TestBody() /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:282:3

rbaranchuk-capgemini · 2023-01-09T11:36:04Z

Second test is BufferDonationTest_TestMustAliasNotDonated_Test

There's 2 buffers, 4 bytes each, both are leaked. For this test failure is expected, so the buffers are not deallocated when ExecutionOutput object goes out of scope. So I'm adding them to ownership_buffers so they would be deallocated when this variable goes out of scope (that's 2. Failure is expected, so ExecutionOutput won't be committed )

Stacktrace:

Direct leak of 8 byte(s) in 2 object(s) allocated from:
#0 0x5615d73cbce7 in __interceptor_posix_memalign (/tf/output_user_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/xla/tests/buffer_donation_test_cpu+0xbcce7) (BuildId: 93c31d7b9d14041de66592cfacf1cd32)
#1 0x7f2fa44a5c22 in tsl::port::AlignedMalloc(unsigned long, int) /proc/self/cwd/tensorflow/tsl/platform/default/port.cc:404:13
#2 0x7f2fa49ddc14 in stream_executor::host::HostExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/host/host_gpu_executor.cc:77:7
#3 0x7f2fa495b2d5 in stream_executor::StreamExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:536:43
#4 0x7f2fa4962b7f in stream_executor::DeviceMemory stream_executor::StreamExecutor::AllocateArray(unsigned long, long) /proc/self/cwd/./tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.h:858:26
#5 0x7f2fa4962b7f in stream_executor::StreamExecutorMemoryAllocator::Allocate(int, unsigned long, bool, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:947:17
#6 0x7f2fa81fb1d8 in xla::TransferManager::AllocateScopedShapedBuffer(xla::Shape const&, stream_executor::DeviceMemoryAllocator*, int, std::function<xla::Shape (xla::Shape const&)>) /proc/self/cwd/tensorflow/compiler/xla/service/transfer_manager.cc:341:5
#7 0x5615d7408d9d in xla::(anonymous namespace)::BufferDonationTest::RunAndCheck(std::unique_ptr<xla::HloModule, std::default_deletexla::HloModule>, absl::lts_20220623::Span<xla::Literal const>, absl::lts_20220623::Span, absl::lts_20220623::Span, xla::Literal const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator>) /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:102:7
#8 0x5615d74235f8 in xla::(anonymous namespace)::BufferDonationTest_TestMustAliasNotDonated_Test::TestBody() /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:342:3

rbaranchuk-capgemini · 2023-01-09T11:45:35Z

And the third one is BufferDonationTest_SimpleWhileTupleTest_Test

There's 3 buffers, 16, 4 and 4 bytes. Buffer 16 is not aliased, buffers 4 are aliased. Donation is specified for this test.

Originally only buffers with the size 4 are leaked. It's the same as for the first test - they're in aliased_indices_, so it's cleared in Commit method, but then there's new error - buffer 16 is double-freed. The reason is that this buffer eventually ends up in two separate places inside ExecutionOutput object: in result_ member and in to_be_released_ member. To me it looks wrong, and my assumption is that donation can only be applicable for aliased buffers, i.e. one should not donate a buffer that is not aliased. That's why I've modified the condition to be if (donate_argument && aliased) not just if (donate_argument)

Stacktrace:

Direct leak of 8 byte(s) in 2 object(s) allocated from:
#0 0x5615d73cbce7 in __interceptor_posix_memalign (/tf/output_user_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/xla/tests/buffer_donation_test_cpu+0xbcce7) (BuildId: 93c31d7b9d14041de66592cfacf1cd32)
#1 0x7f2fa44a5c22 in tsl::port::AlignedMalloc(unsigned long, int) /proc/self/cwd/tensorflow/tsl/platform/default/port.cc:404:13
#2 0x7f2fa49ddc14 in stream_executor::host::HostExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/host/host_gpu_executor.cc:77:7
#3 0x7f2fa495b2d5 in stream_executor::StreamExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:536:43
#4 0x7f2fa4962b7f in stream_executor::DeviceMemory stream_executor::StreamExecutor::AllocateArray(unsigned long, long) /proc/self/cwd/./tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.h:858:26
#5 0x7f2fa4962b7f in stream_executor::StreamExecutorMemoryAllocator::Allocate(int, unsigned long, bool, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:947:17
#6 0x7f2fa81fb1d8 in xla::TransferManager::AllocateScopedShapedBuffer(xla::Shape const&, stream_executor::DeviceMemoryAllocator*, int, std::function<xla::Shape (xla::Shape const&)>) /proc/self/cwd/tensorflow/compiler/xla/service/transfer_manager.cc:341:5
#7 0x5615d7408d9d in xla::(anonymous namespace)::BufferDonationTest::RunAndCheck(std::unique_ptr<xla::HloModule, std::default_deletexla::HloModule>, absl::lts_20220623::Span<xla::Literal const>, absl::lts_20220623::Span, absl::lts_20220623::Span, xla::Literal const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator>) /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:102:7
#8 0x5615d7401619 in xla::(anonymous namespace)::BufferDonationTest_SimpleWhileTupleTest_Test::TestBody() /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:265:3

rbaranchuk-capgemini · 2023-01-09T11:51:58Z

And the last change ShapeTree<se::DeviceMemoryBase> &input_buffers = shaped_buffer.buffers(); is kinda for optimization, it doesn't relate to leaks. buffers method returns a reference, there's no real need to make a separate ShapeTree object there

cheshire · 2023-01-11T14:57:26Z

@reedwm Could you take a look?

cheshire · 2023-01-11T14:57:56Z

And the last change ShapeTreese::DeviceMemoryBase &input_buffers = shaped_buffer.buffers(); is kinda for optimization, it doesn't relate to leaks

Could we remove it from this PR then? Overall the PR looks way too large for something addressing a leak inside a test.

rbaranchuk-capgemini · 2023-01-11T18:35:59Z

And the last change ShapeTreese::DeviceMemoryBase &input_buffers = shaped_buffer.buffers(); is kinda for optimization, it doesn't relate to leaks

Could we remove it from this PR then? Overall the PR looks way too large for something addressing a leak inside a test.

Sure, this change's been reverted

gbaned · 2023-05-05T10:10:03Z

Hi @cheshire Can you please review this PR ? Thank you!

gbaned · 2023-06-23T06:39:07Z

Hi @cheshire Can you please review this PR ? Thank you!

gbaned · 2023-08-25T08:26:55Z

Hi @cheshire Can you please review this PR ? Thank you!

gbaned · 2023-09-25T10:09:46Z

Hi @rbaranchuk-capgemini Can you please resolve conflicts? Thank you!

github-actions · 2023-10-10T01:47:41Z

This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

gbaned · 2023-10-11T06:20:11Z

Hi @rbaranchuk-capgemini Any update on this PR? Please. Thank you!

rbaranchuk-capgemini · 2023-10-11T06:54:48Z

Hi, I'm afraid I have to say at this point I don't even remember what this PR was about when I opened it. And most likely the code been changed since then to the extent that this PR hardly could have any relevance now. So I'm just gonna close it.

google-ml-butler bot added the size:S CL Change Size: Small label Dec 28, 2022

google-ml-butler bot assigned gbaned Dec 28, 2022

google-ml-butler bot requested a review from r4nt December 28, 2022 15:55

google-ml-butler bot added the awaiting review Pull request awaiting review label Dec 28, 2022

gbaned added the comp:xla XLA label Dec 29, 2022

gbaned added this to Assigned Reviewer in PR Queue via automation Dec 29, 2022

gbaned requested a review from cheshire December 29, 2022 04:10

mihaimaruseac previously approved these changes Dec 31, 2022

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Dec 31, 2022

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Dec 31, 2022

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 31, 2022

cheshire requested changes Jan 1, 2023

View reviewed changes

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Jan 1, 2023

cheshire requested a review from reedwm January 11, 2023 14:58

Fix memory leaks in xla::BufferDonationTest

66d21f5

rbaranchuk-capgemini force-pushed the xla-buffer-test branch from 337a4ab to 66d21f5 Compare January 11, 2023 18:34

google-ml-butler bot removed the ready to pull PR ready for merge process label Jan 11, 2023

gbaned requested a review from cheshire January 12, 2023 18:51

gbaned removed the request for review from cheshire March 21, 2023 20:18

gbaned requested a review from cheshire March 21, 2023 20:18

gbaned added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Sep 25, 2023

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Oct 10, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Oct 10, 2023

google-ml-butler bot removed the stale This label marks the issue/pr stale - to be closed automatically if no activity label Oct 11, 2023

rbaranchuk-capgemini closed this Oct 11, 2023

PR Queue automation moved this from Reviewer Requested Changes to Closed/Rejected Oct 11, 2023

rbaranchuk-capgemini deleted the xla-buffer-test branch October 11, 2023 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leaks in xla::BufferDonationTest #59033

Fix memory leaks in xla::BufferDonationTest #59033

rbaranchuk-capgemini commented Dec 28, 2022

cheshire commented Jan 1, 2023 via email

cheshire left a comment

cheshire commented Jan 8, 2023

rbaranchuk-capgemini commented Jan 9, 2023

rbaranchuk-capgemini commented Jan 9, 2023

rbaranchuk-capgemini commented Jan 9, 2023

rbaranchuk-capgemini commented Jan 9, 2023

cheshire commented Jan 11, 2023

cheshire commented Jan 11, 2023

rbaranchuk-capgemini commented Jan 11, 2023

gbaned commented May 5, 2023

gbaned commented Jun 23, 2023

gbaned commented Aug 25, 2023

gbaned commented Sep 25, 2023

github-actions bot commented Oct 10, 2023

gbaned commented Oct 11, 2023

rbaranchuk-capgemini commented Oct 11, 2023

Fix memory leaks in xla::BufferDonationTest #59033

Fix memory leaks in xla::BufferDonationTest #59033

Conversation

rbaranchuk-capgemini commented Dec 28, 2022

cheshire commented Jan 1, 2023 via email

cheshire left a comment

Choose a reason for hiding this comment

cheshire commented Jan 8, 2023

rbaranchuk-capgemini commented Jan 9, 2023

rbaranchuk-capgemini commented Jan 9, 2023

rbaranchuk-capgemini commented Jan 9, 2023

rbaranchuk-capgemini commented Jan 9, 2023

cheshire commented Jan 11, 2023

cheshire commented Jan 11, 2023

rbaranchuk-capgemini commented Jan 11, 2023

gbaned commented May 5, 2023

gbaned commented Jun 23, 2023

gbaned commented Aug 25, 2023

gbaned commented Sep 25, 2023

github-actions bot commented Oct 10, 2023

gbaned commented Oct 11, 2023

rbaranchuk-capgemini commented Oct 11, 2023