Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when emitting trace during a snapshot operation (7.3.27) #11308

Closed
gm42 opened this issue Apr 16, 2024 · 10 comments
Closed

Segmentation fault when emitting trace during a snapshot operation (7.3.27) #11308

gm42 opened this issue Apr 16, 2024 · 10 comments

Comments

@gm42
Copy link
Contributor

gm42 commented Apr 16, 2024

When using the snapshot command, I can reliably reproduce a crash with FoundationDB 7.3.27.

This crash cannot be reproduced with 7.1.15.

The corresponding server-side trace:

Backtrace: addr2line -e fdbserver.debug -p -C -f -i 0x545d6fd 0x545d9c3 0x5457bc4 0x54253fb 0x7fd58eeb7630 0x26a72cc 0x1cb54a8 0x1cb53cd 0x26abdb8 0x26abcb9 0x1d10de8 0x1d10444 0x1bc2278 0x1bc2197 0x522f43d 0x522ed13 0x53e3998 0x32202f5 0x7fd58eafc555
DateTime: 2024-04-16T10:30:29Z
ErrorKind: BugDetected
FDBSeverity: 40
ID: 0000000000000000
LogGroup: mycluster
Machine: 10.20.234.245:4501
Name: Segmentation fault
Roles: CS,DD,MS,RK
Signal: 11
ThreadID: 3611621951435064848
Time: 1713263429.797395
Trace: addr2line -e fdbserver.debug -p -C -f -i 0x7fd58eeb7630 0x26a72cc 0x1cb54a8 0x1cb53cd 0x26abdb8 0x26abcb9 0x1d10de8 0x1d10444 0x1bc2278 0x1bc2197 0x522f43d 0x522ed13 0x53e3998 0x32202f5 0x7fd58eafc555
Type: Crash

Symbolizing leads to this output (I had to Ctrl+C because it was hanging):

$ addr2line -e fdbserver.debug.x86_64 -p -C -f -i 0x545d6fd 0x545d9c3 0x5457bc4 0x54253fb 0x7fd58eeb7630 0x26a72cc 0x1cb54a8 0x1cb53cd 0x26abdb8 0x26abcb9 0x1d10de8 0x1d10444 0x1bc2278 0x1bc2197 0x522f43d 0x522ed13 0x53e3998 0x32202f5 0x7fd58eafc555
BaseTraceEvent::backtrace(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:?
std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__is_long[abi:v15006]() const at /usr/local/bin/../include/c++/v1/string:1499
 (inlined by) std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string() at /usr/local/bin/../include/c++/v1/string:2333
 (inlined by) BaseTraceEvent::log() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1326
BaseTraceEvent::~BaseTraceEvent() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1369
crashHandler(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Platform.actor.cpp:3684
?? ??:0
^C
@jzhou77
Copy link
Contributor

jzhou77 commented Apr 17, 2024

I am not familiar with this feature. However, from the log it seems the roles are "Roles: CS,DD,MS,RK", which are all stateless and shouldn't create disk snapshot data. "CS" is a new ConsistencyScan role, so it might be that snapshot was erroneously not excluding "CS" role.

@gm42
Copy link
Contributor Author

gm42 commented Apr 19, 2024

Thanks @jzhou77, it was indeed a stateless process affected. Is there a way to get a better addr2line output? I am assuming that I am doing something wrong because it doesn't symbolize everything and hangs.

Also: is this CS role something I can disable, to verify the hypothesis?

@gm42
Copy link
Contributor Author

gm42 commented Apr 19, 2024

Another idea: perhaps we could make the snapshot command have a configurable "max concurrency" parameter, so that we avoid crashing all storage processes at once if issues like these were to present again (due to a bug or OOM)?

jzhou77 pushed a commit that referenced this issue Apr 19, 2024
* Cherry pick #11308 Raise visibility of gray failure actions

* format change

---------

Co-authored-by: Dan Lambright <hlambright@apple.com>
@gm42
Copy link
Contributor Author

gm42 commented Apr 24, 2024

I have been looking into this using logs; I suspect that this commit 29f98f3 might be involved in the problem.

image

The crash seems to be in an unspecified code line while trying to create requests for each of the stateful workers.

@sfc-gh-clin do you have any insight into this?

@sfc-gh-clin
Copy link
Collaborator

Yeah, I just looked at 7.3.27 and it contains the bug here https://github.com/apple/foundationdb/blob/7.3.27/fdbserver/DataDistribution.actor.cpp#L1446
It should be g_network instead of g_simulator,
which was fixed in this PR(#10984)

@gm42
Copy link
Contributor Author

gm42 commented Apr 25, 2024

Thanks @sfc-gh-clin! Any plans to backport this fix to a stable release? I am going to cherry pick the commit until then.

@sfc-gh-clin
Copy link
Collaborator

No problem. Do you mind doing it yourself or maybe @jzhou77 can do it?
I am not sure which release is a stable one.

@jzhou77
Copy link
Contributor

jzhou77 commented Apr 26, 2024

NP. I created #11341 to fix this.

@gm42
Copy link
Contributor Author

gm42 commented May 1, 2024

Just noticed the follow-up, thanks!

I am unfamiliar with the release process, will this at some point land on a new 7.3.x tagged release, marked as pre-release? In such case I think I will stick to 7.3.27 + cherry-pick to avoid potential issues.

@jzhou77
Copy link
Contributor

jzhou77 commented May 1, 2024

Yes. This fix will be included in the next 7.3 release (marked as pre-release), so I will close this issue for now.

@jzhou77 jzhou77 closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants