-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
channels: Avoid deadlocks when locking ao2_container *channels #311
base: 18
Are you sure you want to change the base?
Conversation
REMINDER: If this PR applies to other branches, please add a comment with the appropriate "cherry-pick-to" headers as per the Create a Pull Request process. If you don't want it cherry-picked, please add a comment with If, after adding "cherry-pick-to" comments, you change your mind, please edit the comment to DELETE the header lines and add The currently active branches are now 18, 20, 21 and master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the cherry pick lines from the commit message and make a comment with those instead.
main/channel.c
Outdated
@@ -712,12 +718,23 @@ static int does_id_conflict(const char *uniqueid) | |||
{ | |||
struct ast_channel *conflict; | |||
size_t length = 0; | |||
struct ast_channel_callback_data callback_data = {.search = uniqueid, .final_match_result = 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put each element on its own line? Not sure if that's explicitly in the coding guidelines but that's easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (ast_strlen_zero(uniqueid)) { | ||
return 0; | ||
} | ||
|
||
conflict = ast_channel_callback(ast_channel_by_uniqueid_cb, (char *) uniqueid, &length, OBJ_NOLOCK); | ||
try_again: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think goto is necessary here, this could be a do while or for (;;) with break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (callback_data.final_match_result == CMP_RETRY_NEEDED) { | ||
/* We ran into a possible deadlock, try again */ | ||
ast_debug(3, "ast_channel_iterator_by_exten_new() Avoid deadlock trying to find channel by extension@context: %s@%s\n", exten, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ast_debug includes the function name, no need to repeat it.
This could all be simplified to:
`ast_debug(3, "Avoided deadlock trying to find channel by %@%s\n", exten, context);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (ast_strlen_zero(l_name)) { | ||
/* We didn't have a name to search for so quit. */ | ||
return NULL; | ||
} | ||
|
||
chan = ast_channel_callback(ast_channel_by_name_cb, l_name, &name_len, | ||
try_name_again: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, you can use 2 loops here and goto isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are goto's that bad? It reduces nesting. The only thing that adding a loop does is introduce an extra level of nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're not inherently bad, this just doesn't seem like a necessary or appropriate use for them, since a loop works perfectly fine. I don't think there's a problem with adding a level of nesting, it contains the logic in its own scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
main/channel.c
Outdated
|
||
if (callback_data.final_match_result == CMP_RETRY_NEEDED) { | ||
/* We ran into a possible deadlock, try again */ | ||
ast_debug(3, "ast_channel_get_by_name_prefix() Avoid deadlock trying to find channel by uniqueid: %s\n", name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about log format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
8fa1b68
to
e200e31
Compare
|
PRs should be raised against master, not 18 (unlike Gerrit, it makes a difference now). |
master is slightly different. It will not cleanly cherry-pick to branches from master. Is the process still go to to master first? Make separate PRs for each branch? |
Ah, okay, then it's more complicated, then you might need to do multiple sets since GitHub can only cherry pick verbatim, might want to double check with George. |
e200e31
to
52184d3
Compare
master no longer has support for macros, so the checks in ast_channel_by_exten_cb() will not cherry-pick clean with the change as-is. |
main/channel.c
Outdated
/* We found a match or the search fully completed through all channels */ | ||
break; | ||
} | ||
while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while should go on previous line
Though really I think for(;;)
is preferred in Asterisk for infinite loops.
main/channel.c
Outdated
/* We found a match or the search fully completed through all channels */ | ||
break; | ||
} | ||
while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
main/channel.c
Outdated
/* We found a match or the search fully completed through all channels */ | ||
break; | ||
} | ||
while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Update: Been running this in the wild with mixed results. Some days it's been completely solid with no deadlocks (better than before with 3-5 deadlocks per day or more). Some days it actually almost immediately deadlocked after handling just a few calls. |
Due to the last comment I am transitioning this to a draft as it does not appear to be in a state for review as it does not resolve the underlying issue. If this is untrue and it should actually be reviewed, then it can be published. |
52184d3
to
520b377
Compare
Fixed previous issue where it did not fully fix the problem. Small refactor to avoid breaking assumptions about value of 'arg' with respect to find channel callbacks. It's now passing my tests regarding the original deadlock situation. I would nominate this ready for review now. |
include/asterisk/channel.h
Outdated
* Will be passed to cb_fn as its arg param | ||
* \param data Arbitrary data to be passed to cb_fn as its data param | ||
* | ||
* \note This function SHOULD NOT be called unlessfully understanding the implications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unlessfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha... yeah I noticed that after I sent the last push. Should be 'unless fully'
* | ||
* \param arg Must be a char *, Typically either a channel name or a channel uniqueid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typically should not be capitalized
struct ast_channel *ast_channel_callback(ao2_callback_data_fn *cb_fn, void *arg, | ||
void *data, int ao2_flags) | ||
{ | ||
ast_log(LOG_WARNING, "Using ast_channel_callback() directly may lead to deadlocks. Please update your module.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an existing function, right? Won't that start spewing warnings all over the place? If this really was a proper fix, then everything else should be updated at the same time too. But I would hold off on that for now until that is determined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a public api function. All internal usages across the core and modules have been updated in this patch. People who may have written 3rd party modules that iterate through channels with ao2_channel_callback would need to be rewritten to use the safe version (which needs to be trylock aware) so there's no 'drop-in' replacement for ast_channel_callback.
3rd party modules would get this warning. I was thinking about it that there should be an option to turn the warning off, but I believe avoiding deadlocks is important enough it should be pretty loud (with the ability to mute).
static int does_id_conflict(const char *uniqueid) | ||
{ | ||
struct ast_channel *conflict; | ||
char *l_uniqueid = (char *) uniqueid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're pointlessly casting to the same thing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dropping the const qualifier so that you don't have to cast it for the callback.
char *context = arg; | ||
char *exten = data; | ||
char *exten = (char *) callback_data->data_additional; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same... dropping the const qualifier.
|
||
if (!(i = ast_calloc(1, sizeof(*i)))) { | ||
return NULL; | ||
} | ||
|
||
i->active_iterator = (void *) ast_channel_callback(ast_channel_by_exten_cb, | ||
l_context, l_exten, OBJ_MULTIPLE); | ||
i->active_iterator = (void *) channel_find_by_exten(exten, context, OBJ_MULTIPLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to cast to void
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the original code that had the cast... but yeah I agree.
@@ -1388,9 +1487,10 @@ struct ast_channel_iterator *ast_channel_iterator_by_name_new(const char *name, | |||
return NULL; | |||
} | |||
|
|||
i->active_iterator = (void *) ast_channel_callback(ast_channel_by_name_cb, | |||
l_name, &name_len, | |||
i->active_iterator = (void *) ast_channel_callback_safe(ast_channel_by_name_cb, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I was thinking about this. Yes the void * cast is not required, but it does add code clarity. Normally ast_channel_callback returns a channel, but in this case with OBJ_MULTIPLE it returns an iterator containing the list of matched channels so (void *) is more applicable here for syntactic clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this some more, the void * is required.
apps/app_directed_pickup.c
Outdated
if (target) { | ||
return target; | ||
} | ||
|
||
/* Now try a search for uniqueid. */ | ||
pickup_args.name = part; | ||
pickup_args.len = 0; | ||
return ast_channel_callback(find_by_uniqueid, NULL, &pickup_args, 0); | ||
|
||
return ast_channel_callback_safe(find_by_uniqueid, "channel uniqueid, where pickup is possible", pickup_args.name, NULL, &pickup_args, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is identical to line 229. Shouldn't they be different if the goal is to be unique?
Additionally I think automatically include the filename and line number is more robust than hardcoding descriptions. In which case, there should be a macro that injects these automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
274087e
to
59b0d33
Compare
59b0d33
to
a000ddc
Compare
One situation where deadlocks occur is when: -> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks ao2_container channels, and then waits for a lock on a channel that's currently running IMPORT At the same time -> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a channel lock and then waits on ao2_container channels So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock that will never release, because res_pjsip_sdp_rtp is currently holding ao2_container channels, that the IMPORT needs after having already locked the channel that res_pjsip_sdp_rtp is processing next. UserNote: Module developers should discontinue use of ast_channel_callback() and use ast_channel_callback_safe() instead. (This will avoid causing deadlocks) Resolves: asterisk#307
a000ddc
to
3579cc5
Compare
Pushed a new one.. fixes channel reference leaks (introduced in the previous patch) when OBJ_MULTIPLE is used to return an iterator |
While trying to extract out components from our full platform into a standalone deadlock reproduction test, I've uncovered an additional deadlock related to ast_sip_push_task_wait() also related to IMPORT import_helper() (posted to #307). I realized this is not the same deadlock that this #311 pull is addressing. So I'll work on another test to trigger this specific one as well. This is related, but probably should be split into a new issue since this other newly uncovered deadlock is not related to ao2_container *channels. |
I'm converting this back to a draft again since there are still outstanding changes requested and the cherry-pick lines still aren't correct. Speaking of which, are you saying this PR will cherry-pick cleanly from 18 to master but not the other way around? |
Correct, I don't think it goes the other way, but I can do some tests. |
One situation where deadlocks occur is when:
-> res_pjsip_sdp_rtp is checking for rtp timeouts -> Which locks
ao2_container channels, and then waits for a lock on a channel
that's currently running IMPORT
At the same time
-> Dialplan asks for IMPORT(${some_channel},...) -> Which grabs a
channel lock and then waits on ao2_container channels
So in this situation, res_pjsip_sdp_rtp is waiting on a channel lock
that will never release, because res_pjsip_sdp_rtp is currently
holding ao2_container channels, that the IMPORT needs after having
already locked the channel that res_pjsip_sdp_rtp is processing next.
Resolves: #307
cherry-pick-to: 19
cherry-pick-to: 20
cherry-pick-to: master