[Rust] Worker panics when running with multiple workers #1320

PaulOlteanu · 2024-01-31T18:16:18Z

Bug Report

mediasoup version: 0.15

I'm trying to run mediasoup with 2 (eventually more) workers per server (on a 2 core server, so one worker per core). After switching to 2 workers, I've been getting pretty common worker panics:

thread 'mediasoup-worker-cd9eec89-8672-4293-b4eb-3a857fdd0341' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/data_consumer.rs:277:21:
Wrong message from worker: NotificationRef { handler_id: Ok("\u{c}\0\0\0\u{8}\0\u{e}\0\u{7}\0\u{8}\0\u{8}\0\0\0\0\0\0\u{4}\u{c}\0\0\0\0\0\u{6}\0\u{8}\0\u{4}\0\u{6}\0\0\0\u{4}\0\0\0q\0\0\0WRTC::WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment\0\0\0"), event: Ok(DataconsumerBufferedAmountLow), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }

thread 'mediasoup-worker-87b809a2-a4f9-4231-89a2-b4f5fe2cd40c' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/data_consumer.rs:277:21:
Wrong message from worker: NotificationRef { handler_id: Ok("\u{c}\0\0\0\u{8}\0\u{e}\0\u{7}\0\u{8}\0\u{8}\0\0\0\0\0\0\u{4}\u{c}\0\0\0\0\0\u{6}\0\u{8}\0\u{4}\0\u{6}\0\0\0\u{4}\0\0\0q\0\0\0WRTC::WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment\0\0\0"), event: Ok(DataconsumerBufferedAmountLow), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }

thread 'mediasoup-worker-bb92335a-9b5b-4b48-b4bc-127543ac7041' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/webrtc_transport.rs:479:21:
Wrong message from worker: NotificationRef { handler_id: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "handler_id", byte_offset: 36 }, error_kind: InvalidUtf8 { source: Utf8Error { valid_up_to: 0, error_len: Some(1) } } }), event: Ok(TransportSctpStateChange), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }

thread 'mediasoup-worker-6e6b7487-e501-42bf-9681-7450e79a0dfb' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/webrtc_transport.rs:465:21:
Wrong message from worker: NotificationRef { handler_id: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "handler_id", byte_offset: 36 }, error_kind: InvalidOffset }), event: Ok(WebrtctransportDtlsStateChange) }

and

thread 'mediasoup-worker-28a9b972-22da-4324-8e10-4284fbab7089' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/worker/channel.rs:256:91:
called `Result::unwrap()` on an `Err` value: Error(Char { character: 'h', index: 29 })

The last one is weird because it seems unrelated to the others.

I've never seen any of these panics before switching to multiple workers, so I'm assuming it's related to that. I also only see this in prod so it might be difficult to try and reproduce

The text was updated successfully, but these errors were encountered:

nazar-pc · 2024-01-31T18:21:20Z

@jmillan is it possible that some non-thread local static were introduced recently?

jmillan · 2024-02-01T09:25:31Z

Seems related to flatbuffers. I'll be checking this in few hours.

@PaulOlteanu, I'd suggest using a single worker if possible while we troubleshoot this.

jmillan · 2024-02-01T13:46:47Z

For now I don't see any static and non-thread local variables.

Definitely Rust cannot properly read the incoming buffer because it is malformed.

In the following case the payload is a log message (LogRef), but Rust is considering it a notification ('NotificationRef').

EDIT: The following message contains a Notification (DataconsumerBufferedAmountLow) and also the log WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment, which clearly shows the corruption.

thread 'mediasoup-worker-cd9eec89-8672-4293-b4eb-3a857fdd0341' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/mediasoup-0.15.0/src/router/data_consumer.rs:277:21:
Wrong message from worker: NotificationRef { handler_id: Ok("\u{c}\0\0\0\u{8}\0\u{e}\0\u{7}\0\u{8}\0\u{8}\0\0\0\0\0\0\u{4}\u{c}\0\0\0\0\0\u{6}\0\u{8}\0\u{4}\0\u{6}\0\0\0\u{4}\0\0\0q\0\0\0WRTC::WebRtcServer::OnStunDataReceived() | ignoring received STUN packet with unknown remote ICE usernameFragment\0\0\0"), event: Ok(DataconsumerBufferedAmountLow), body: Err(Error { source_location: ErrorLocation { type_: "Notification", method: "body", byte_offset: 36 }, error_kind: InvalidOffset }) }

jmillan · 2024-02-01T14:51:10Z

@PaulOlteanu, can you confirm this only happens when using multiple workers?

PaulOlteanu · 2024-02-01T17:28:46Z

Yeah, I've been using mediasoup 0.15 since January 8 and never saw this issue in a single-worker setup. As soon as I tried multiple workers I saw many of these kinds of errors within 24 hours.

I did experiment with multiple workers for a week a few months ago and I also never saw this issue (I think I was using mediasoup 0.12 at the time which was before the flatbuffer change).

jmillan · 2024-02-02T16:35:55Z

@PaulOlteanu, I assume you're back to using a single worker. Please let us know if you get the crash in this scenario. So far we've been unable to reproduce the crash enabling multiple workers.

GEverding · 2024-02-02T17:39:41Z

I work with @PaulOlteanu. The majority of our traffic comes from ios and android. Could that be a factor is reproducing this crash? What other info do you want to help debug this crash?

nazar-pc · 2024-02-02T17:41:36Z

Ideally we'd have a reliable reproduction. If you can take one of example apps (videoroom is good), modify (if necessary) to reproduce an issue and share code with us, that'd be ideal.

PaulOlteanu · 2024-02-02T17:44:49Z

One thing that comes to mind is that we're running on aarch64 which I believe has different memory model guarantees to x86. So I guess if the bug is an incorrect atomic usage somewhere it could be not reproducing because of that (I have no idea why this issue would be related to atomics though)?

I've also only seen this issue happen in production where we have 1000-2000 users per server at the moment. So that would also make it difficult to reproduce

PaulOlteanu · 2024-02-07T01:21:20Z

We tried running with multiple workers in prod again with mediasoup logs set to debug level in the rust filter, and this:

    let mut worker_settings = WorkerSettings::default();
    worker_settings.log_level = WorkerLogLevel::Debug;
    worker_settings.log_tags = vec![
        WorkerLogTag::Info,
        WorkerLogTag::Ice,
        WorkerLogTag::Dtls,
        WorkerLogTag::Rtp,
        WorkerLogTag::Srtp,
        WorkerLogTag::Rtcp,
        WorkerLogTag::Rtx,
        WorkerLogTag::Bwe,
        WorkerLogTag::Score,
        WorkerLogTag::Simulcast,
        WorkerLogTag::Svc,
        WorkerLogTag::Sctp,
    ];

in the worker settings. We didn't see the same issue previously mentioned, but instead saw a segfault happen twice. Will experiment more tomorrow and see if I can get a core dump. In one of the logs for the segfault this is the last thing:

2024-02-07T01:11:12.541426Z DEBUG mediasoup::worker::channel: request() [method:TransportCloseDataproducer, id:56474]    
2024-02-07T01:11:12.541457Z DEBUG mediasoup::worker::channel: request() [method:RouterCloseTransport, id:56475]    
double free or corruption (out)

satoren · 2024-03-04T01:04:02Z

@PaulOlteanu
Are you using SCTP(DataProducer/DataConsumer)?
This has been an issue since versions before 0.15, but there is an open issue where it crashes when turning on sctp. So maybe this is the cause.

https://mediasoup.discourse.group/t/rarely-crashes-on-sendsctpdata-when-enablesctp-is-enabled/5340/5

Since I stopped using dataproducer, the priority was lowered and no further investigation was made. I'm sorry.

PaulOlteanu · 2024-03-04T20:24:04Z

Yes we are using data producers & consumers but are phasing them out (less than 2% usage at this point). If the issue is with sctp I would be surprised that we didn't run into this issue the first time we tried multiple workers (pre-0.15).

Since then we've stopped pinning mediasoup worker threads to specific cpu cores, so unless there's some way the pinning was masking that issue (I don't really see how that's possible) I'm not sure it's the same thing.

Edit: I may have misunderstood. If you're referring to the segfault then I guess that's possible, I haven't really looked into that issue (or the original issue), but I think I'll have time to focus on this again now.

satoren · 2024-03-05T00:10:32Z

When I investigated, it appeared that SCTP was running on the thread of the first Worker that executed. And other Workers are sharing it. So I think it is possible that a thread data race could corrupt the data in the flatbuffers to be sent. In the case of JSON (pre-0.15), it may have been a transmission method that was less likely to cause conflicts.

ibc · 2024-03-05T07:28:10Z

Is this something that was not properly done when Rust layer was created? In theory DepUsrSCTP.cpp should run a Checker singleton (which runs the usrsctp thread) per Worker even in Rust rather than running a unique singleton for all threads when multiple Workers run in a Rust application. This was solved in many other areas using thread_local, perhaps this was not done here.

UPDATE: By looking at the code it looks ok.

ibc · 2024-03-05T10:37:49Z

@nazar-pc perhaps you see something wrong in https://github.com/versatica/mediasoup/blob/v3/worker/src/DepUsrSCTP.cpp when running multiple mediasoup Worker threads in Rust?

The constructor in Worker.cpp creates a DepUsrSCTP::Checker singleton, which is defined as thread_local static Checker* checker in DepUsrSCTP.hpp.

Problem (not sure) is that we start such a singleton when first SctpAssociation instance is created (it calls DepUsrSCTP::RegisterSctpAssociation()) and stop the singleton when last running SctpAssociation instance is closed (it calls DepUsrSCTP::DeregisterSctpAssociation(). However these methods are protected by mutex locks:

void DepUsrSCTP::DeregisterSctpAssociation(RTC::SctpAssociation* sctpAssociation)
{
	MS_TRACE();

	const std::lock_guard<std::mutex> lock(GlobalSyncMutex);

	MS_ASSERT(DepUsrSCTP::checker != nullptr, "Checker not created");

	auto found = DepUsrSCTP::mapIdSctpAssociation.erase(sctpAssociation->id);

	MS_ASSERT(found > 0, "SctpAssociation not found");
	MS_ASSERT(DepUsrSCTP::numSctpAssociations > 0u, "numSctpAssociations was not higher than 0");

	if (--DepUsrSCTP::numSctpAssociations == 0u)
	{
		DepUsrSCTP::checker->Stop();
	}
}

Then when a Worker is closed, lib.cpp "may" finish usrsctp stuff since it calls DepUsrSCTP::ClassDestroy(), but IMHO we are also protected here:

void DepUsrSCTP::ClassDestroy()
{
	MS_TRACE();

	const std::lock_guard<std::mutex> lock(GlobalSyncMutex);
	--GlobalInstances;

	if (GlobalInstances == 0)
	{
		usrsctp_finish();

		numSctpAssociations   = 0u;
		nextSctpAssociationId = 0u;

		DepUsrSCTP::mapIdSctpAssociation.clear();
	}
}

@satoren do you see something here that may produce the crash when running multiple Workers in Rust within the same process?

satoren · 2024-03-05T13:16:49Z

@ibc

mediasoup/worker/src/DepUsrSCTP.cpp

Line 76 in af7fb8c

usrsctp_init_nothreads(0, onSendSctpData, sctpDebug);

Although I cannot be sure because it has been some time since our investigation, it appeared that this callback　(onSendSctpData) was being executed in the thread of the Worker where it was first created.

UPDATE:
My guess is that this concern is related.

However, usrsctp lacks documentation and the code is hard to read and appears impossible to debug.

ibc · 2024-03-05T14:02:29Z

It makes sense. IMHO the problem is not even related to usrsctp being multithreaded since we use the single thread api, so that shouldn't be the problem.

The problem is that usrsctp uses a global state (such as the send callback passed to usrsctp_start_no_threads() function, it may have another name sorry).

We set that callback within a Worker. When the worker is closed (in this case the thread in multi-worker app in Rust) we don't explicitly finish usrsctp since there are more workers running (see the counter in DepUsrSCTP.cpp). However the thread is closed anyway and probably usrsctp keeps calling it. In this scenario I have no idea what is going on, but doesn't look good.

We use single thread version of usrsctp in mediasoup because originally (in Node apps) there is indeed a single thread, and if the app creates N Workers those run in a separate process with its own and separated usrsctp stack. However in Rust apps we run N threads when we launch N Workers.

Given that, I wonder if we should instead use the multi thread API of usrsctp when in Rust by creating a specific thread for usrsctp (that is never closed) and pass the send callback within it. Or we keep using the single thread API of usrsctp but create a separate thread (that we never close if there are running workers) and pass the send callback to usrsctp within it.

@nazar-pc does it make sense?

nazar-pc · 2024-03-05T14:35:19Z

If there is a multi-thread API in usrsctp - we probably should

ibc · 2024-03-05T17:02:50Z

@jmillan and me have talked long about this and we have identified the problem. I'm exposing it in detail in a new issue: #1352

I'm then closing this issue so let's please move to the new one.

CC @satoren @PaulOlteanu @nazar-pc @GEverding

PaulOlteanu added the bug label Jan 31, 2024

ibc mentioned this issue Mar 5, 2024

Multi-thread bug when using usrsctp in N Worker threads in Rust #1352

Open

ibc closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rust] Worker panics when running with multiple workers #1320

[Rust] Worker panics when running with multiple workers #1320

PaulOlteanu commented Jan 31, 2024

nazar-pc commented Jan 31, 2024

jmillan commented Feb 1, 2024

jmillan commented Feb 1, 2024 •

edited

jmillan commented Feb 1, 2024

PaulOlteanu commented Feb 1, 2024

jmillan commented Feb 2, 2024

GEverding commented Feb 2, 2024

nazar-pc commented Feb 2, 2024

PaulOlteanu commented Feb 2, 2024 •

edited

PaulOlteanu commented Feb 7, 2024

satoren commented Mar 4, 2024

PaulOlteanu commented Mar 4, 2024 •

edited

satoren commented Mar 5, 2024 •

edited

ibc commented Mar 5, 2024 •

edited

ibc commented Mar 5, 2024

satoren commented Mar 5, 2024 •

edited

ibc commented Mar 5, 2024 •

edited

nazar-pc commented Mar 5, 2024

ibc commented Mar 5, 2024

[Rust] Worker panics when running with multiple workers #1320

[Rust] Worker panics when running with multiple workers #1320

Comments

PaulOlteanu commented Jan 31, 2024

Bug Report

nazar-pc commented Jan 31, 2024

jmillan commented Feb 1, 2024

jmillan commented Feb 1, 2024 • edited

jmillan commented Feb 1, 2024

PaulOlteanu commented Feb 1, 2024

jmillan commented Feb 2, 2024

GEverding commented Feb 2, 2024

nazar-pc commented Feb 2, 2024

PaulOlteanu commented Feb 2, 2024 • edited

PaulOlteanu commented Feb 7, 2024

satoren commented Mar 4, 2024

PaulOlteanu commented Mar 4, 2024 • edited

satoren commented Mar 5, 2024 • edited

ibc commented Mar 5, 2024 • edited

ibc commented Mar 5, 2024

satoren commented Mar 5, 2024 • edited

ibc commented Mar 5, 2024 • edited

nazar-pc commented Mar 5, 2024

ibc commented Mar 5, 2024

jmillan commented Feb 1, 2024 •

edited

PaulOlteanu commented Feb 2, 2024 •

edited

PaulOlteanu commented Mar 4, 2024 •

edited

satoren commented Mar 5, 2024 •

edited

ibc commented Mar 5, 2024 •

edited

satoren commented Mar 5, 2024 •

edited

ibc commented Mar 5, 2024 •

edited