Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

channels: Avoid deadlocks when locking ao2_container *channels #311

Draft
wants to merge 1 commit into
base: 18
Choose a base branch
from

Conversation

intellasoft
Copy link
Contributor

One situation where deadlocks occur is when:
-> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks
ao2_container channels, and then waits for a lock on a channel
that's currently running IMPORT
At the same time
-> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a
channel lock and then waits on ao2_container channels

So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock
that will never release, because res_pjsip_sdp_rtp is currently
holding ao2_container channels, that the IMPORT needs after having
already locked the channel that res_pjsip_sdp_rtp is processing next.

Resolves: #307

cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master

@asterisk-org-access-app
Copy link

REMINDER: If this PR applies to other branches, please add a comment with the appropriate "cherry-pick-to" headers as per the Create a Pull Request process.

If you don't want it cherry-picked, please add a comment with cherry-pick-to: none so we don't keep asking.

If, after adding "cherry-pick-to" comments, you change your mind, please edit the comment to DELETE the header lines and add cherry-pick-to: none.

The currently active branches are now 18, 20, 21 and master.

Copy link
Contributor

@InterLinked1 InterLinked1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the cherry pick lines from the commit message and make a comment with those instead.

main/channel.c Outdated
@@ -712,12 +718,23 @@ static int does_id_conflict(const char *uniqueid)
{
struct ast_channel *conflict;
size_t length = 0;
struct ast_channel_callback_data callback_data = {.search = uniqueid, .final_match_result = 0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put each element on its own line? Not sure if that's explicitly in the coding guidelines but that's easier to read.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (ast_strlen_zero(uniqueid)) {
return 0;
}

conflict = ast_channel_callback(ast_channel_by_uniqueid_cb, (char *) uniqueid, &length, OBJ_NOLOCK);
try_again:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think goto is necessary here, this could be a do while or for (;;) with break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (callback_data.final_match_result == CMP_RETRY_NEEDED) {
/* We ran into a possible deadlock, try again */
ast_debug(3, "ast_channel_iterator_by_exten_new() Avoid deadlock trying to find channel by extension@context: %s@%s\n", exten, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ast_debug includes the function name, no need to repeat it.

This could all be simplified to:

`ast_debug(3, "Avoided deadlock trying to find channel by %@%s\n", exten, context);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (ast_strlen_zero(l_name)) {
/* We didn't have a name to search for so quit. */
return NULL;
}

chan = ast_channel_callback(ast_channel_by_name_cb, l_name, &name_len,
try_name_again:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, you can use 2 loops here and goto isn't needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are goto's that bad? It reduces nesting. The only thing that adding a loop does is introduce an extra level of nesting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not inherently bad, this just doesn't seem like a necessary or appropriate use for them, since a loop works perfectly fine. I don't think there's a problem with adding a level of nesting, it contains the logic in its own scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

main/channel.c Outdated

if (callback_data.final_match_result == CMP_RETRY_NEEDED) {
/* We ran into a possible deadlock, try again */
ast_debug(3, "ast_channel_get_by_name_prefix() Avoid deadlock trying to find channel by uniqueid: %s\n", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about log format

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@kobaz
Copy link

kobaz commented Sep 9, 2023

cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master

@InterLinked1
Copy link
Contributor

cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master

PRs should be raised against master, not 18 (unlike Gerrit, it makes a difference now).
And it should be cherry picked to 18, 20, and 21, not 19.

@kobaz
Copy link

kobaz commented Sep 9, 2023

master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch?

@InterLinked1
Copy link
Contributor

InterLinked1 commented Sep 9, 2023

master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch?

Ah, okay, then it's more complicated, then you might need to do multiple sets since GitHub can only cherry pick verbatim, might want to double check with George.

@kobaz
Copy link

kobaz commented Sep 10, 2023

master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch?

Ah, okay, then it's more complicated, then you might need to do multiple sets since GitHub can only cherry pick verbatim, might want to double check with George.

master no longer has support for macros, so the checks in ast_channel_by_exten_cb() will not cherry-pick clean with the change as-is.

main/channel.c Outdated
/* We found a match or the search fully completed through all channels */
break;
}
while (1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while should go on previous line
Though really I think for(;;) is preferred in Asterisk for infinite loops.

main/channel.c Outdated
/* We found a match or the search fully completed through all channels */
break;
}
while (1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

main/channel.c Outdated
/* We found a match or the search fully completed through all channels */
break;
}
while (1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@kobaz
Copy link

kobaz commented Sep 21, 2023

Update: Been running this in the wild with mixed results.

Some days it's been completely solid with no deadlocks (better than before with 3-5 deadlocks per day or more).

Some days it actually almost immediately deadlocked after handling just a few calls.

@jcolp jcolp marked this pull request as draft September 29, 2023 13:43
@jcolp
Copy link
Member

jcolp commented Sep 29, 2023

Due to the last comment I am transitioning this to a draft as it does not appear to be in a state for review as it does not resolve the underlying issue. If this is untrue and it should actually be reviewed, then it can be published.

@kobaz
Copy link

kobaz commented Sep 30, 2023

Fixed previous issue where it did not fully fix the problem. Small refactor to avoid breaking assumptions about value of 'arg' with respect to find channel callbacks.

It's now passing my tests regarding the original deadlock situation.

I would nominate this ready for review now.

* Will be passed to cb_fn as its arg param
* \param data Arbitrary data to be passed to cb_fn as its data param
*
* \note This function SHOULD NOT be called unlessfully understanding the implications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlessfully?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha... yeah I noticed that after I sent the last push. Should be 'unless fully'

*
* \param arg Must be a char *, Typically either a channel name or a channel uniqueid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically should not be capitalized

struct ast_channel *ast_channel_callback(ao2_callback_data_fn *cb_fn, void *arg,
void *data, int ao2_flags)
{
ast_log(LOG_WARNING, "Using ast_channel_callback() directly may lead to deadlocks. Please update your module.\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an existing function, right? Won't that start spewing warnings all over the place? If this really was a proper fix, then everything else should be updated at the same time too. But I would hold off on that for now until that is determined.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public api function. All internal usages across the core and modules have been updated in this patch. People who may have written 3rd party modules that iterate through channels with ao2_channel_callback would need to be rewritten to use the safe version (which needs to be trylock aware) so there's no 'drop-in' replacement for ast_channel_callback.

3rd party modules would get this warning. I was thinking about it that there should be an option to turn the warning off, but I believe avoiding deadlocks is important enough it should be pretty loud (with the ability to mute).

static int does_id_conflict(const char *uniqueid)
{
struct ast_channel *conflict;
char *l_uniqueid = (char *) uniqueid;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're pointlessly casting to the same thing here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dropping the const qualifier so that you don't have to cast it for the callback.

char *context = arg;
char *exten = data;
char *exten = (char *) callback_data->data_additional;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant cast

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same... dropping the const qualifier.


if (!(i = ast_calloc(1, sizeof(*i)))) {
return NULL;
}

i->active_iterator = (void *) ast_channel_callback(ast_channel_by_exten_cb,
l_context, l_exten, OBJ_MULTIPLE);
i->active_iterator = (void *) channel_find_by_exten(exten, context, OBJ_MULTIPLE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to cast to void

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original code that had the cast... but yeah I agree.

@@ -1388,9 +1487,10 @@ struct ast_channel_iterator *ast_channel_iterator_by_name_new(const char *name,
return NULL;
}

i->active_iterator = (void *) ast_channel_callback(ast_channel_by_name_cb,
l_name, &name_len,
i->active_iterator = (void *) ast_channel_callback_safe(ast_channel_by_name_cb,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I was thinking about this. Yes the void * cast is not required, but it does add code clarity. Normally ast_channel_callback returns a channel, but in this case with OBJ_MULTIPLE it returns an iterator containing the list of matched channels so (void *) is more applicable here for syntactic clarity.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this some more, the void * is required.

if (target) {
return target;
}

/* Now try a search for uniqueid. */
pickup_args.name = part;
pickup_args.len = 0;
return ast_channel_callback(find_by_uniqueid, NULL, &pickup_args, 0);

return ast_channel_callback_safe(find_by_uniqueid, "channel uniqueid, where pickup is possible", pickup_args.name, NULL, &pickup_args, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is identical to line 229. Shouldn't they be different if the goal is to be unique?
Additionally I think automatically include the filename and line number is more robust than hardcoding descriptions. In which case, there should be a macro that injects these automatically.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

One situation where deadlocks occur is when:
   -> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks
      ao2_container channels, and then waits for a lock on a channel
      that's currently running IMPORT
 At the same time
   -> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a
      channel lock and then waits on ao2_container channels

So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock
that will never release, because res_pjsip_sdp_rtp is currently
holding ao2_container channels, that the IMPORT needs after having
already locked the channel that res_pjsip_sdp_rtp is processing next.

UserNote: Module developers should discontinue use of ast_channel_callback()
and use ast_channel_callback_safe() instead.  (This will avoid causing deadlocks)

Resolves: asterisk#307
@kobaz
Copy link

kobaz commented Oct 8, 2023

Pushed a new one.. fixes channel reference leaks (introduced in the previous patch) when OBJ_MULTIPLE is used to return an iterator

@kobaz
Copy link

kobaz commented Oct 8, 2023

While trying to extract out components from our full platform into a standalone deadlock reproduction test, I've uncovered an additional deadlock related to ast_sip_push_task_wait() also related to IMPORT import_helper() (posted to #307). I realized this is not the same deadlock that this #311 pull is addressing. So I'll work on another test to trigger this specific one as well.

This is related, but probably should be split into a new issue since this other newly uncovered deadlock is not related to ao2_container *channels.

@gtjoseph
Copy link
Member

I'm converting this back to a draft again since there are still outstanding changes requested and the cherry-pick lines still aren't correct.

Speaking of which, are you saying this PR will cherry-pick cleanly from 18 to master but not the other way around?

@gtjoseph gtjoseph marked this pull request as draft January 23, 2024 16:42
@kobaz
Copy link

kobaz commented Jan 23, 2024

Correct, I don't think it goes the other way, but I can do some tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants