[bug]: Long running channel iterator (in ChanSpy) blocks CEL HANGUP events of unrelated channels #68

wdoekes · 2023-05-08T12:08:08Z

Severity

Major

Versions

master,20,19,18,16

Components/Modules

chan_spy, channels

Operating Environment

Right now, testing with branch 18, but this should be on all versions.

Frequency of Occurrence

Frequent

Issue Description

The ChanSpy() channel iterator holds references to channels for the duration of the listening/spying. These referenced channels will get destroyed first when the channel iteration resumes (when done listening to that one channel). This is a problem for the CEL HANGUP event, because all channels still in the iterator will not get any HANGUP reporting until this resumption: anyone relying on CEL HANGUP time for call duration statistics will get wrong values.

A.k.a. CEL HANGUP times can be wrong (late) and appear "batched".

Components

ChanSpy -- iterates over all channels using (e.g. ast_channel_iterator_by_name_new) -- keeps the iterator until the spying is done:

asterisk/apps/app_chanspy.c

Lines 956 to 964 in 1a7866b

    
           		iter = ast_channel_iterator_by_name_new(ast_channel_name(unique_chan), 0); 
        
           		ast_channel_unref(unique_chan); 
        
           	} else { 
        
           		iter = ast_channel_iterator_by_name_new(spec, strlen(spec)); 
        
           	} 
        
           } else if (!ast_strlen_zero(exten)) { 
        
           	iter = ast_channel_iterator_by_exten_new(exten, context); 
        
           } else { 
        
           	iter = ast_channel_iterator_all_new();

asterisk/apps/app_chanspy.c

Lines 994 to 999 in 1a7866b

    
           for (autochan = next_channel(iter, chan); 
        
           	autochan; 
        
           	prev = autochan->chan, 
        
           		ast_autochan_destroy(autochan), 
        
           		autochan = next_autochan ?: next_channel(iter, chan), 
        
           		next_autochan = NULL) {

asterisk/apps/app_chanspy.c

Line 1146 in 1a7866b

res = channel_spy(chan, autochan, &volfactor, fd, user_options, flags, exitcontext);
asterisk/apps/app_chanspy.c

Line 1184 in 1a7866b

iter = ast_channel_iterator_destroy(iter);

CEL hangup events -- are generated because of a publish by the ast_channel destructor:

asterisk/main/channel.c

Lines 2206 to 2213 in 1a7866b

    
           ast_set_flag(ast_channel_flags(chan), AST_FLAG_DEAD); 
        
           if (ast_channel_internal_is_finalized(chan)) { 
        
           	/* A channel snapshot should not be in the process of being staged now. */ 
        
           	ast_assert(!ast_test_flag(ast_channel_flags(chan), AST_FLAG_SNAPSHOT_STAGE)); 
        
           	ast_channel_lock(chan); 
        
           	ast_channel_publish_final_snapshot(chan);

asterisk/main/stasis_channels.c

Lines 948 to 950 in 1a7866b

    
           ast_channel_snapshot_set(chan, NULL); 
        
           stasis_publish(ast_channel_topic(chan), message);

asterisk/main/cel.c

Lines 910 to 921 in 1a7866b

    
           was_hungup = ast_test_flag(&old_snapshot->flags, AST_FLAG_DEAD) ? 1 : 0; 
        
           is_hungup = ast_test_flag(&new_snapshot->flags, AST_FLAG_DEAD) ? 1 : 0; 
        
           if (!was_hungup && is_hungup) { 
        
           	struct ast_json *extra; 
        
           	struct cel_dialstatus *dialstatus = get_dialstatus(new_snapshot->base->uniqueid); 
        
           	extra = ast_json_pack("{s: i, s: s, s: s}", 
        
           		"hangupcause", new_snapshot->hangup->cause, 
        
           		"hangupsource", new_snapshot->hangup->source, 
        
           		"dialstatus", dialstatus ? dialstatus->dialstatus : ""); 
        
           	cel_report_event(new_snapshot, AST_CEL_HANGUP, event_time, NULL, extra, NULL);

To get accurate HANGUP timings from CEL, channel destruction should never be delayed for more than a short while.

But because some components (ChanSpy) can hold references to channels that are hung up, destruction can be delayed for an indetermined amount of time.

This results in inaccurate CEL HANGUP timestamps: those who rely on CEL HANGUP time for call duration statistics will get wrong values

Scenario

At t=0, we might have three channels:

channelA, ref (1)
channelB, ref (1)
channelC, ref (1)

At t=1 we start spying on channels using ChanSpy. The channels (hash) container is filtered and turned into a (list) container (multi_container ()) holding a snapshot of then active channels that match the filter: the multi_iterator (). This list/iterator is returned to the application (chanspy):

channelA, +1 ref (=2) (__ao2_link() -> rb_ao2_new_node() -> "Container node creation")
channelB, +1 ref (=2)
channelC, +1 ref (=2)

At t=1, we traverse the iterator. We might decide to skip channelA and start spying on channelB:

channelA, +1 ref /* Bump ref of returned object */ ("Next iterator object."), replace last_node (NULL), last_node = channelA, return channelA
ignore channelA: -1 ref, get next
channelB, +1 ref /* Bump ref of returned object */ ("Next iterator object."), replace last_node (channelA, -1 ref), last_node = channelB
start listening on channelB (channel_spy)

At this point, the references look like this:

channelA, +1-1-1 ref (=1)
channelB, +1 ref (=3)
channelC, +0 ref (=2)

Now, at t=2, channelA and channelC hang up:

channelA, -1 ref (=0) -> destroying
channelB, +0 ref (=3)
channelC, -1 ref (=1)

First at t=3, when either the spying channel hangs up, or the spyee hangs up, do we exit channel_spy and do we resume looking at the multi_iterator:

channelC, -1 ref (=0) -> destroying

This causes the channelC destroying event to fire at t=3 instead of t=2. And that means that the reported CEL time for the hangup event is wrong.

Solutions

Either the CEL HANGUP event should not be fired by the channel destructor,
or no component should be allowed to hold references to channels that are hung up (read: channel iterators should be short-lived).

Someone should make a design decision on this. I would say the least impactful is to make the chanspy iterator short-lived, by copying the IDs of the channels and iterating over the channel IDs instead.

But if we can make the HANGUP event fire sooner by moving the AST_FLAG_DEAD+publish to something earlier, that works for me too.

Cheers,
Walter Doekes
OSSO B.V.

(*) The channel iteration:

ast_channel_iterator_by_name_new() -> ast_channel_callback() -> ao2_callback_data() -> __ao2_callback_data() -> internal_ao2_traverse()

asterisk/main/astobj2_container.c

Lines 260 to 273 in 1a7866b

    
           	if ((flags & (OBJ_MULTIPLE | OBJ_NODATA)) == OBJ_MULTIPLE) { 
        
           		/* we need to return an ao2_iterator with the results, 
        
           		 * as there could be more than one. the iterator will 
        
           		 * hold the only reference to a container that has all the 
        
           		 * matching objects linked into it, so when the iterator 
        
           		 * is destroyed, the container will be automatically 
        
           		 * destroyed as well. 
        
           		 */ 
        
           		multi_container = ao2_t_container_alloc_list(AO2_ALLOC_OPT_LOCK_NOLOCK, 0, NULL, 
        
           			NULL, "OBJ_MULTIPLE return container creation"); 
        
           		if (!multi_container) { 
        
           			return NULL; 
        
           		} 
        
           		if (!(multi_iterator = ast_calloc(1, sizeof(*multi_iterator)))) {

asterisk/main/astobj2_container.c

Line 356 in 1a7866b

__ao2_link(multi_container, node->obj, flags, tag, file, line, func);

(an ao2_container list is created to hold the filtered channels)

next_channel() -> ast_channel_iterator_next() -> ao2_iterator_next() -> __ao2_iterator_next():

asterisk/main/astobj2_container.c

Lines 603 to 604 in 1a7866b

/* Bump ref of returned object */

__ao2_ref(ret, +1, tag ?: "Next iterator object.", file, line, func);

asterisk/apps/app_chanspy.c

Lines 994 to 999 in 1a7866b

    
           for (autochan = next_channel(iter, chan); 
        
           	autochan; 
        
           	prev = autochan->chan, 
        
           		ast_autochan_destroy(autochan), 
        
           		autochan = next_autochan ?: next_channel(iter, chan), 
        
           		next_autochan = NULL) {

(channels are popped from the ao2_container list, but not as immediate as we would like)

Relevant log output

Many charlies, some of whom are too late:

[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9b8021a58 'SIP/charlie-0000004c' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01eb458 'SIP/charlie-0000004b' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa027a608 'SIP/charlie-0000004e' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa0263a48 'SIP/charlie-0000004d' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9c001f878 'SIP/charlie-0000004f' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa022a6c8 'SIP/charlie-0000005b' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa0222048 'SIP/charlie-0000005d' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9dc01e7c8 'SIP/charlie-0000005e' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9e8020238 'SIP/charlie-0000005f' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01a34c8 'SIP/charlie-00000041' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc98000cc68 'SIP/charlie-00000042' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc97c00c418 'SIP/charlie-00000044' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc99c01f868 'SIP/charlie-00000047' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01cb7a8 'SIP/charlie-00000046' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01e3208 'SIP/charlie-00000049' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9bc021b18 'SIP/charlie-00000050' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9c8021078 'SIP/charlie-00000052' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa02a91b8 'SIP/charlie-00000053' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9d00211c8 'SIP/charlie-00000056' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9e800a908 'SIP/charlie-0000000e' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa009c888 'SIP/charlie-0000000c' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa00ed368 'SIP/charlie-0000001f' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa028ec78 'SIP/alice-00000060' destroying
[May  8 11:50:15] DEBUG[149070][C-0000001a] channel.c: Channel 0x7fcaa014ee18 'SIP/alice-00000031' destroying
[May  8 11:50:15] DEBUG[149081][C-0000001a] channel.c: Channel 0x7fc98800bad8 'SIP/alice-00000034' destroying
[May  8 11:50:15] DEBUG[149080][C-00000023] channel.c: Channel 0x7fcaa01e0d98 'SIP/bob-00000043' destroying
[May  8 11:50:15] DEBUG[149100][C-00000023] channel.c: Channel 0x7fc98400cb18 'SIP/bob-00000045' destroying
[May  8 11:50:15] DEBUG[149096][C-0000002a] channel.c: Channel 0x7fcaa0294298 'SIP/charlie-00000051' destroying
[May  8 11:50:15] DEBUG[149110][C-0000002a] channel.c: Channel 0x7fc9c4020f38 'SIP/charlie-00000054' destroying

Singling out a single Charlie:

[May  8 11:49:49] VERBOSE[149117][C-00000031] app_chanspy.c: Spying on channel SIP/alice-00000034
[May  8 11:50:07] DEBUG[149106][C-00000026] channel.c: Channel 0x7fc9b8021a58 'SIP/charlie-0000004c' hanging up.  Refs: 3
[May  8 11:50:15] VERBOSE[149117][C-00000031] app_chanspy.c: Done Spying on channel SIP/alice-00000034
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9b8021a58 'SIP/charlie-0000004c' destroying

As seen, the destroying is late.

Asterisk Issue Guidelines

Yes, I have read the Asterisk Issue Guidelines

The text was updated successfully, but these errors were encountered:

Because of upcoming planned changes/refactoring to app_chanspy, it is convenient to pass some of the many arguments as a struct. This changeset adds a channel_spy_context struct to pass around. Related: asterisk#68

…tion This moves the guts of common_exec into channel_spy_consume_iterator. This makes refactoring/changing the code easier because there are fewer function local variables to consider. Related: asterisk#68

Refactor channel spying so it never holds on to a channel iterator. Instead, we recreate the iterator when needed and skip channels that we've seen already, creating the illusion of using an iterator. This change was needed because the iterator caused the yet-unseen channels in the iterator to be referenced by the iterator. This reference ensured that the channel does not get destroyed. (Which is good, because the iterator needs valid channels to work on.) But, hanging on to a channel reference for longer than a short while conflicts with CEL logging. The CEL hangup logging is activated by the destruction of the channel. During chanspy activity, a bunch of channels would stay in limbo. First when the chanspying was done would those channels get their CEL hangup event logged. The fix here is to hang on to channel iterators for only a short while. An alternative fix that makes CEL hangup logging independent of channel destruction was deemed more invasive. This patch makes chanspy channel selection slightly more resource intensive. But that's a small price to pay for correct CEL hangup logging. Fixes: asterisk#68

slavonnet · 2023-09-11T09:40:14Z

#313 (comment)

its maybe related to some code logic?

wdoekes added bug triage labels May 8, 2023

jcolp added support-level-core Functionality with core support level and removed triage labels May 8, 2023

wdoekes linked a pull request May 24, 2023 that will close this issue

app_chanspy+cel: Release channel iterator before chanspying #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: Long running channel iterator (in ChanSpy) blocks CEL HANGUP events of unrelated channels #68

[bug]: Long running channel iterator (in ChanSpy) blocks CEL HANGUP events of unrelated channels #68

wdoekes commented May 8, 2023 •

edited

slavonnet commented Sep 11, 2023

[bug]: Long running channel iterator (in ChanSpy) blocks CEL HANGUP events of unrelated channels #68

[bug]: Long running channel iterator (in ChanSpy) blocks CEL HANGUP events of unrelated channels #68

Comments

wdoekes commented May 8, 2023 • edited

Severity

Versions

Components/Modules

Operating Environment

Frequency of Occurrence

Issue Description

Relevant log output

Asterisk Issue Guidelines

slavonnet commented Sep 11, 2023

wdoekes commented May 8, 2023 •

edited