Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] Worker panics when running with multiple workers #1320

Closed
PaulOlteanu opened this issue Jan 31, 2024 · 19 comments
Closed

[Rust] Worker panics when running with multiple workers #1320

PaulOlteanu opened this issue Jan 31, 2024 · 19 comments
Labels

Comments

@PaulOlteanu
Copy link
Contributor

Bug Report

mediasoup version: 0.15

I'm trying to run mediasoup with 2 (eventually more) workers per server (on a 2 core server, so one worker per core). After switching to 2 workers, I've been getting pretty common worker panics:

thread 'mediasoup-worker-cd9eec89-8672-4293-b4eb-3a857fdd0341' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/data_consumer.rs:277:21:
Wrong message from worker: NotificationRef { handler_id: Ok("\u{c}\0\0\0\u{8}\0\u{e}\0\u{7}\0\u{8}\0\u{8}\0\0\0\0\0\0\u{4}\u{c}\0\0\0\0\0\u{6}\0\u{8}\0\u{4}\0\u{6}\0\0\0\u{4}\0\0\0q\0\0\0WRTC::WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment\0\0\0"), event: Ok(DataconsumerBufferedAmountLow), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }
thread 'mediasoup-worker-87b809a2-a4f9-4231-89a2-b4f5fe2cd40c' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/data_consumer.rs:277:21:
Wrong message from worker: NotificationRef { handler_id: Ok("\u{c}\0\0\0\u{8}\0\u{e}\0\u{7}\0\u{8}\0\u{8}\0\0\0\0\0\0\u{4}\u{c}\0\0\0\0\0\u{6}\0\u{8}\0\u{4}\0\u{6}\0\0\0\u{4}\0\0\0q\0\0\0WRTC::WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment\0\0\0"), event: Ok(DataconsumerBufferedAmountLow), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }
thread 'mediasoup-worker-bb92335a-9b5b-4b48-b4bc-127543ac7041' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/webrtc_transport.rs:479:21:
Wrong message from worker: NotificationRef { handler_id: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "handler_id", byte_offset: 36 }, error_kind: InvalidUtf8 { source: Utf8Error { valid_up_to: 0, error_len: Some(1) } } }), event: Ok(TransportSctpStateChange), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }
thread 'mediasoup-worker-6e6b7487-e501-42bf-9681-7450e79a0dfb' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/webrtc_transport.rs:465:21:
Wrong message from worker: NotificationRef { handler_id: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "handler_id", byte_offset: 36 }, error_kind: InvalidOffset }), event: Ok(WebrtctransportDtlsStateChange) }

and

thread 'mediasoup-worker-28a9b972-22da-4324-8e10-4284fbab7089' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/worker/channel.rs:256:91:
called `Result::unwrap()` on an `Err` value: Error(Char { character: 'h', index: 29 })

The last one is weird because it seems unrelated to the others.

I've never seen any of these panics before switching to multiple workers, so I'm assuming it's related to that. I also only see this in prod so it might be difficult to try and reproduce

@nazar-pc
Copy link
Collaborator

@jmillan is it possible that some non-thread local static were introduced recently?

@jmillan
Copy link
Member

jmillan commented Feb 1, 2024

Seems related to flatbuffers. I'll be checking this in few hours.

@PaulOlteanu, I'd suggest using a single worker if possible while we troubleshoot this.

@jmillan
Copy link
Member

jmillan commented Feb 1, 2024

For now I don't see any static and non-thread local variables.

Definitely Rust cannot properly read the incoming buffer because it is malformed.

In the following case the payload is a log message (LogRef), but Rust is considering it a notification ('NotificationRef').

EDIT: The following message contains a Notification (DataconsumerBufferedAmountLow) and also the log WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment, which clearly shows the corruption.

thread 'mediasoup-worker-cd9eec89-8672-4293-b4eb-3a857fdd0341' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/data_consumer.rs:277:21:
Wrong message from worker: NotificationRef { handler_id: Ok("\u{c}\0\0\0\u{8}\0\u{e}\0\u{7}\0\u{8}\0\u{8}\0\0\0\0\0\0\u{4}\u{c}\0\0\0\0\0\u{6}\0\u{8}\0\u{4}\0\u{6}\0\0\0\u{4}\0\0\0q\0\0\0WRTC::WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment\0\0\0"), event: Ok(DataconsumerBufferedAmountLow), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }

@jmillan
Copy link
Member

jmillan commented Feb 1, 2024

@PaulOlteanu, can you confirm this only happens when using multiple workers?

@PaulOlteanu
Copy link
Contributor Author

Yeah, I've been using mediasoup 0.15 since January 8 and never saw this issue in a single-worker setup. As soon as I tried multiple workers I saw many of these kinds of errors within 24 hours.

I did experiment with multiple workers for a week a few months ago and I also never saw this issue (I think I was using mediasoup 0.12 at the time which was before the flatbuffer change).

@jmillan
Copy link
Member

jmillan commented Feb 2, 2024

@PaulOlteanu, I assume you're back to using a single worker. Please let us know if you get the crash in this scenario. So far we've been unable to reproduce the crash enabling multiple workers.

@GEverding
Copy link

I work with @PaulOlteanu. The majority of our traffic comes from ios and android. Could that be a factor is reproducing this crash? What other info do you want to help debug this crash?

@nazar-pc
Copy link
Collaborator

nazar-pc commented Feb 2, 2024

Ideally we'd have a reliable reproduction. If you can take one of example apps (videoroom is good), modify (if necessary) to reproduce an issue and share code with us, that'd be ideal.

@PaulOlteanu
Copy link
Contributor Author

PaulOlteanu commented Feb 2, 2024

One thing that comes to mind is that we're running on aarch64 which I believe has different memory model guarantees to x86. So I guess if the bug is an incorrect atomic usage somewhere it could be not reproducing because of that (I have no idea why this issue would be related to atomics though)?

I've also only seen this issue happen in production where we have 1000-2000 users per server at the moment. So that would also make it difficult to reproduce

@PaulOlteanu
Copy link
Contributor Author

We tried running with multiple workers in prod again with mediasoup logs set to debug level in the rust filter, and this:

    let mut worker_settings = WorkerSettings::default();
    worker_settings.log_level = WorkerLogLevel::Debug;
    worker_settings.log_tags = vec![
        WorkerLogTag::Info,
        WorkerLogTag::Ice,
        WorkerLogTag::Dtls,
        WorkerLogTag::Rtp,
        WorkerLogTag::Srtp,
        WorkerLogTag::Rtcp,
        WorkerLogTag::Rtx,
        WorkerLogTag::Bwe,
        WorkerLogTag::Score,
        WorkerLogTag::Simulcast,
        WorkerLogTag::Svc,
        WorkerLogTag::Sctp,
    ];

in the worker settings. We didn't see the same issue previously mentioned, but instead saw a segfault happen twice. Will experiment more tomorrow and see if I can get a core dump. In one of the logs for the segfault this is the last thing:

2024-02-07T01:11:12.541426Z DEBUG mediasoup::worker::channel: request() [method:TransportCloseDataproducer, id:56474]    
2024-02-07T01:11:12.541457Z DEBUG mediasoup::worker::channel: request() [method:RouterCloseTransport, id:56475]    
double free or corruption (out)

@satoren
Copy link
Contributor

satoren commented Mar 4, 2024

@PaulOlteanu
Are you using SCTP(DataProducer/DataConsumer)?
This has been an issue since versions before 0.15, but there is an open issue where it crashes when turning on sctp. So maybe this is the cause.

https://mediasoup.discourse.group/t/rarely-crashes-on-sendsctpdata-when-enablesctp-is-enabled/5340/5

Since I stopped using dataproducer, the priority was lowered and no further investigation was made. I'm sorry.

@PaulOlteanu
Copy link
Contributor Author

PaulOlteanu commented Mar 4, 2024

Yes we are using data producers & consumers but are phasing them out (less than 2% usage at this point). If the issue is with sctp I would be surprised that we didn't run into this issue the first time we tried multiple workers (pre-0.15).

Since then we've stopped pinning mediasoup worker threads to specific cpu cores, so unless there's some way the pinning was masking that issue (I don't really see how that's possible) I'm not sure it's the same thing.

Edit: I may have misunderstood. If you're referring to the segfault then I guess that's possible, I haven't really looked into that issue (or the original issue), but I think I'll have time to focus on this again now.

@satoren
Copy link
Contributor

satoren commented Mar 5, 2024

When I investigated, it appeared that SCTP was running on the thread of the first Worker that executed. And other Workers are sharing it. So I think it is possible that a thread data race could corrupt the data in the flatbuffers to be sent. In the case of JSON (pre-0.15), it may have been a transmission method that was less likely to cause conflicts.

@ibc
Copy link
Member

ibc commented Mar 5, 2024

Is this something that was not properly done when Rust layer was created? In theory DepUsrSCTP.cpp should run a Checker singleton (which runs the usrsctp thread) per Worker even in Rust rather than running a unique singleton for all threads when multiple Workers run in a Rust application. This was solved in many other areas using thread_local, perhaps this was not done here.

UPDATE: By looking at the code it looks ok.

@ibc
Copy link
Member

ibc commented Mar 5, 2024

@nazar-pc perhaps you see something wrong in https://github.com/versatica/mediasoup/blob/v3/worker/src/DepUsrSCTP.cpp when running multiple mediasoup Worker threads in Rust?

The constructor in Worker.cpp creates a DepUsrSCTP::Checker singleton, which is defined as thread_local static Checker* checker in DepUsrSCTP.hpp.

Problem (not sure) is that we start such a singleton when first SctpAssociation instance is created (it calls DepUsrSCTP::RegisterSctpAssociation()) and stop the singleton when last running SctpAssociation instance is closed (it calls DepUsrSCTP::DeregisterSctpAssociation(). However these methods are protected by mutex locks:

void DepUsrSCTP::DeregisterSctpAssociation(RTC::SctpAssociation* sctpAssociation)
{
	MS_TRACE();

	const std::lock_guard<std::mutex> lock(GlobalSyncMutex);

	MS_ASSERT(DepUsrSCTP::checker != nullptr, "Checker not created");

	auto found = DepUsrSCTP::mapIdSctpAssociation.erase(sctpAssociation->id);

	MS_ASSERT(found > 0, "SctpAssociation not found");
	MS_ASSERT(DepUsrSCTP::numSctpAssociations > 0u, "numSctpAssociations was not higher than 0");

	if (--DepUsrSCTP::numSctpAssociations == 0u)
	{
		DepUsrSCTP::checker->Stop();
	}
}

Then when a Worker is closed, lib.cpp "may" finish usrsctp stuff since it calls DepUsrSCTP::ClassDestroy(), but IMHO we are also protected here:

void DepUsrSCTP::ClassDestroy()
{
	MS_TRACE();

	const std::lock_guard<std::mutex> lock(GlobalSyncMutex);
	--GlobalInstances;

	if (GlobalInstances == 0)
	{
		usrsctp_finish();

		numSctpAssociations   = 0u;
		nextSctpAssociationId = 0u;

		DepUsrSCTP::mapIdSctpAssociation.clear();
	}
}

@satoren do you see something here that may produce the crash when running multiple Workers in Rust within the same process?

@satoren
Copy link
Contributor

satoren commented Mar 5, 2024

@ibc

usrsctp_init_nothreads(0, onSendSctpData, sctpDebug);

Although I cannot be sure because it has been some time since our investigation, it appeared that this callback (onSendSctpData) was being executed in the thread of the Worker where it was first created.

UPDATE:
My guess is that this concern is related.

However, usrsctp lacks documentation and the code is hard to read and appears impossible to debug.

@ibc
Copy link
Member

ibc commented Mar 5, 2024

It makes sense. IMHO the problem is not even related to usrsctp being multithreaded since we use the single thread api, so that shouldn't be the problem.

The problem is that usrsctp uses a global state (such as the send callback passed to usrsctp_start_no_threads() function, it may have another name sorry).

We set that callback within a Worker. When the worker is closed (in this case the thread in multi-worker app in Rust) we don't explicitly finish usrsctp since there are more workers running (see the counter in DepUsrSCTP.cpp). However the thread is closed anyway and probably usrsctp keeps calling it. In this scenario I have no idea what is going on, but doesn't look good.

We use single thread version of usrsctp in mediasoup because originally (in Node apps) there is indeed a single thread, and if the app creates N Workers those run in a separate process with its own and separated usrsctp stack. However in Rust apps we run N threads when we launch N Workers.

Given that, I wonder if we should instead use the multi thread API of usrsctp when in Rust by creating a specific thread for usrsctp (that is never closed) and pass the send callback within it. Or we keep using the single thread API of usrsctp but create a separate thread (that we never close if there are running workers) and pass the send callback to usrsctp within it.

@nazar-pc does it make sense?

@nazar-pc
Copy link
Collaborator

nazar-pc commented Mar 5, 2024

If there is a multi-thread API in usrsctp - we probably should

@ibc
Copy link
Member

ibc commented Mar 5, 2024

@jmillan and me have talked long about this and we have identified the problem. I'm exposing it in detail in a new issue: #1352

I'm then closing this issue so let's please move to the new one.

CC @satoren @PaulOlteanu @nazar-pc @GEverding

@ibc ibc closed this as completed Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants