Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leaks in xla::BufferDonationTest #59033

Conversation

rbaranchuk-capgemini
Copy link
Contributor

Leaks were detected by AddressSanitizer

@google-ml-butler google-ml-butler bot added the size:S CL Change Size: Small label Dec 28, 2022
@google-ml-butler google-ml-butler bot requested a review from r4nt December 28, 2022 15:55
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Dec 28, 2022
@gbaned gbaned added the comp:xla XLA label Dec 29, 2022
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Dec 29, 2022
@gbaned gbaned requested a review from cheshire December 29, 2022 04:10
mihaimaruseac
mihaimaruseac previously approved these changes Dec 31, 2022
PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Dec 31, 2022
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Dec 31, 2022
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 31, 2022
@cheshire
Copy link
Member

cheshire commented Jan 1, 2023 via email

Copy link
Member

@cheshire cheshire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(please wait for review from code owners, sorry for the first approval)

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Jan 1, 2023
@mihaimaruseac mihaimaruseac dismissed their stale review January 2, 2023 15:17

Not my place to review this, should have steered clear of XLA. Sorry

@cheshire
Copy link
Member

cheshire commented Jan 8, 2023

Could you post a leak description and a stacktrace? I'm having a somewhat hard time connecting the code changes to a leak, and we do run tests (at least internally) under asan - perhaps we have missed something?

@rbaranchuk-capgemini
Copy link
Contributor Author

Could you post a leak description and a stacktrace? I'm having a somewhat hard time connecting the code changes to a leak, and we do run tests (at least internally) under asan - perhaps we have missed something?

Sure, the summary from asan is

SUMMARY: AddressSanitizer: 40 byte(s) leaked in 7 allocation(s).

That goes from 3 tests, first one is BufferDonationTest_SimpleWhileTupleTestCopyProtection_Test

There's 3 buffers in ScopedShapedBuffer generated by AllocateScopedShapedBuffer in method RunAndCheck, sizes 16, 4 and 4 bytes, and they're all leaked.

Buffer 16 is not aliased, buffers 4 are aliased. Donation is not specified for this test.

In the end, inside ExecutionOutput object, buffer 16 is marked as aliased (in aliased_indices_ member), so when the object goes out of scope it's not deallocated inside its destructor. I've added a call to Commit method so aliased_indices_ would be cleared and these buffers would be deallocated. Same applies for another test (which I'll describe in a separate comment I think)

For buffers with the size 4, as they're aliased but not donated, it involves making copy buffers in CpuExecutable::CreateResultShapedBuffer method, and the references to the original buffers are lost, so they're not deallocated in the end. So I'm adding them to ownership_buffers so they would be deallocated when this variable goes out of scope (please see the comment 1. Aliasing is specified but the buffer is not donated )

Stacktrace for this case:

Direct leak of 24 byte(s) in 3 object(s) allocated from:
#0 0x5615d73cbce7 in __interceptor_posix_memalign (/tf/output_user_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/xla/tests/buffer_donation_test_cpu+0xbcce7) (BuildId: 93c31d7b9d14041de66592cfacf1cd32)
#1 0x7f2fa44a5c22 in tsl::port::AlignedMalloc(unsigned long, int) /proc/self/cwd/tensorflow/tsl/platform/default/port.cc:404:13
#2 0x7f2fa49ddc14 in stream_executor::host::HostExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/host/host_gpu_executor.cc:77:7
#3 0x7f2fa495b2d5 in stream_executor::StreamExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:536:43
#4 0x7f2fa4962b7f in stream_executor::DeviceMemory stream_executor::StreamExecutor::AllocateArray(unsigned long, long) /proc/self/cwd/./tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.h:858:26
#5 0x7f2fa4962b7f in stream_executor::StreamExecutorMemoryAllocator::Allocate(int, unsigned long, bool, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:947:17
#6 0x7f2fa81fb1d8 in xla::TransferManager::AllocateScopedShapedBuffer(xla::Shape const&, stream_executor::DeviceMemoryAllocator*, int, std::function<xla::Shape (xla::Shape const&)>) /proc/self/cwd/tensorflow/compiler/xla/service/transfer_manager.cc:341:5
#7 0x5615d7408d9d in xla::(anonymous namespace)::BufferDonationTest::RunAndCheck(std::unique_ptr<xla::HloModule, std::default_deletexla::HloModule>, absl::lts_20220623::Span<xla::Literal const>, absl::lts_20220623::Span, absl::lts_20220623::Span, xla::Literal const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator>) /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:102:7
#8 0x5615d741eb69 in xla::(anonymous namespace)::BufferDonationTest_SimpleWhileTupleTestCopyProtection_Test::TestBody() /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:282:3

@rbaranchuk-capgemini
Copy link
Contributor Author

Second test is BufferDonationTest_TestMustAliasNotDonated_Test

There's 2 buffers, 4 bytes each, both are leaked. For this test failure is expected, so the buffers are not deallocated when ExecutionOutput object goes out of scope. So I'm adding them to ownership_buffers so they would be deallocated when this variable goes out of scope (that's 2. Failure is expected, so ExecutionOutput won't be committed )

Stacktrace:

Direct leak of 8 byte(s) in 2 object(s) allocated from:
#0 0x5615d73cbce7 in __interceptor_posix_memalign (/tf/output_user_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/xla/tests/buffer_donation_test_cpu+0xbcce7) (BuildId: 93c31d7b9d14041de66592cfacf1cd32)
#1 0x7f2fa44a5c22 in tsl::port::AlignedMalloc(unsigned long, int) /proc/self/cwd/tensorflow/tsl/platform/default/port.cc:404:13
#2 0x7f2fa49ddc14 in stream_executor::host::HostExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/host/host_gpu_executor.cc:77:7
#3 0x7f2fa495b2d5 in stream_executor::StreamExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:536:43
#4 0x7f2fa4962b7f in stream_executor::DeviceMemory stream_executor::StreamExecutor::AllocateArray(unsigned long, long) /proc/self/cwd/./tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.h:858:26
#5 0x7f2fa4962b7f in stream_executor::StreamExecutorMemoryAllocator::Allocate(int, unsigned long, bool, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:947:17
#6 0x7f2fa81fb1d8 in xla::TransferManager::AllocateScopedShapedBuffer(xla::Shape const&, stream_executor::DeviceMemoryAllocator*, int, std::function<xla::Shape (xla::Shape const&)>) /proc/self/cwd/tensorflow/compiler/xla/service/transfer_manager.cc:341:5
#7 0x5615d7408d9d in xla::(anonymous namespace)::BufferDonationTest::RunAndCheck(std::unique_ptr<xla::HloModule, std::default_deletexla::HloModule>, absl::lts_20220623::Span<xla::Literal const>, absl::lts_20220623::Span, absl::lts_20220623::Span, xla::Literal const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator>) /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:102:7
#8 0x5615d74235f8 in xla::(anonymous namespace)::BufferDonationTest_TestMustAliasNotDonated_Test::TestBody() /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:342:3

@rbaranchuk-capgemini
Copy link
Contributor Author

And the third one is BufferDonationTest_SimpleWhileTupleTest_Test

There's 3 buffers, 16, 4 and 4 bytes. Buffer 16 is not aliased, buffers 4 are aliased. Donation is specified for this test.

Originally only buffers with the size 4 are leaked. It's the same as for the first test - they're in aliased_indices_, so it's cleared in Commit method, but then there's new error - buffer 16 is double-freed. The reason is that this buffer eventually ends up in two separate places inside ExecutionOutput object: in result_ member and in to_be_released_ member. To me it looks wrong, and my assumption is that donation can only be applicable for aliased buffers, i.e. one should not donate a buffer that is not aliased. That's why I've modified the condition to be if (donate_argument && aliased) not just if (donate_argument)

Stacktrace:

Direct leak of 8 byte(s) in 2 object(s) allocated from:
#0 0x5615d73cbce7 in __interceptor_posix_memalign (/tf/output_user_root/fbac33eb30dbfb6b11b15a7ff5ac830d/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/xla/tests/buffer_donation_test_cpu+0xbcce7) (BuildId: 93c31d7b9d14041de66592cfacf1cd32)
#1 0x7f2fa44a5c22 in tsl::port::AlignedMalloc(unsigned long, int) /proc/self/cwd/tensorflow/tsl/platform/default/port.cc:404:13
#2 0x7f2fa49ddc14 in stream_executor::host::HostExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/host/host_gpu_executor.cc:77:7
#3 0x7f2fa495b2d5 in stream_executor::StreamExecutor::Allocate(unsigned long, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:536:43
#4 0x7f2fa4962b7f in stream_executor::DeviceMemory stream_executor::StreamExecutor::AllocateArray(unsigned long, long) /proc/self/cwd/./tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.h:858:26
#5 0x7f2fa4962b7f in stream_executor::StreamExecutorMemoryAllocator::Allocate(int, unsigned long, bool, long) /proc/self/cwd/tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:947:17
#6 0x7f2fa81fb1d8 in xla::TransferManager::AllocateScopedShapedBuffer(xla::Shape const&, stream_executor::DeviceMemoryAllocator*, int, std::function<xla::Shape (xla::Shape const&)>) /proc/self/cwd/tensorflow/compiler/xla/service/transfer_manager.cc:341:5
#7 0x5615d7408d9d in xla::(anonymous namespace)::BufferDonationTest::RunAndCheck(std::unique_ptr<xla::HloModule, std::default_deletexla::HloModule>, absl::lts_20220623::Span<xla::Literal const>, absl::lts_20220623::Span, absl::lts_20220623::Span, xla::Literal const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator>) /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:102:7
#8 0x5615d7401619 in xla::(anonymous namespace)::BufferDonationTest_SimpleWhileTupleTest_Test::TestBody() /proc/self/cwd/tensorflow/compiler/xla/tests/buffer_donation_test.cc:265:3

@rbaranchuk-capgemini
Copy link
Contributor Author

And the last change ShapeTree<se::DeviceMemoryBase> &input_buffers = shaped_buffer.buffers(); is kinda for optimization, it doesn't relate to leaks. buffers method returns a reference, there's no real need to make a separate ShapeTree object there

@cheshire
Copy link
Member

@reedwm Could you take a look?

@cheshire
Copy link
Member

And the last change ShapeTreese::DeviceMemoryBase &input_buffers = shaped_buffer.buffers(); is kinda for optimization, it doesn't relate to leaks

Could we remove it from this PR then? Overall the PR looks way too large for something addressing a leak inside a test.

@cheshire cheshire requested a review from reedwm January 11, 2023 14:58
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Jan 11, 2023
@rbaranchuk-capgemini
Copy link
Contributor Author

And the last change ShapeTreese::DeviceMemoryBase &input_buffers = shaped_buffer.buffers(); is kinda for optimization, it doesn't relate to leaks

Could we remove it from this PR then? Overall the PR looks way too large for something addressing a leak inside a test.

Sure, this change's been reverted

@gbaned gbaned requested a review from cheshire January 12, 2023 18:51
@gbaned gbaned removed the request for review from cheshire March 21, 2023 20:18
@gbaned gbaned requested a review from cheshire March 21, 2023 20:18
@gbaned
Copy link
Contributor

gbaned commented May 5, 2023

Hi @cheshire Can you please review this PR ? Thank you!

2 similar comments
@gbaned
Copy link
Contributor

gbaned commented Jun 23, 2023

Hi @cheshire Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Aug 25, 2023

Hi @cheshire Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Sep 25, 2023

Hi @rbaranchuk-capgemini Can you please resolve conflicts? Thank you!

@gbaned gbaned added stat:awaiting response Status - Awaiting response from author and removed awaiting review Pull request awaiting review labels Sep 25, 2023
@github-actions
Copy link

This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Oct 10, 2023
@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Oct 10, 2023
@gbaned
Copy link
Contributor

gbaned commented Oct 11, 2023

Hi @rbaranchuk-capgemini Any update on this PR? Please. Thank you!

@google-ml-butler google-ml-butler bot removed the stale This label marks the issue/pr stale - to be closed automatically if no activity label Oct 11, 2023
@rbaranchuk-capgemini
Copy link
Contributor Author

Hi, I'm afraid I have to say at this point I don't even remember what this PR was about when I opened it. And most likely the code been changed since then to the extent that this PR hardly could have any relevance now. So I'm just gonna close it.

PR Queue automation moved this from Reviewer Requested Changes to Closed/Rejected Oct 11, 2023
@rbaranchuk-capgemini rbaranchuk-capgemini deleted the xla-buffer-test branch October 11, 2023 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:xla XLA size:S CL Change Size: Small
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

5 participants