Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Long running channel iterator (in ChanSpy) blocks CEL HANGUP events of unrelated channels #68

Open
1 task done
wdoekes opened this issue May 8, 2023 · 1 comment · May be fixed by #114
Open
1 task done
Labels
bug support-level-core Functionality with core support level

Comments

@wdoekes
Copy link
Contributor

wdoekes commented May 8, 2023

Severity

Major

Versions

master,20,19,18,16

Components/Modules

chan_spy, channels

Operating Environment

Right now, testing with branch 18, but this should be on all versions.

Frequency of Occurrence

Frequent

Issue Description

The ChanSpy() channel iterator holds references to channels for the duration of the listening/spying. These referenced channels will get destroyed first when the channel iteration resumes (when done listening to that one channel). This is a problem for the CEL HANGUP event, because all channels still in the iterator will not get any HANGUP reporting until this resumption: anyone relying on CEL HANGUP time for call duration statistics will get wrong values.

A.k.a. CEL HANGUP times can be wrong (late) and appear "batched".


Components

  • ChanSpy -- iterates over all channels using (e.g. ast_channel_iterator_by_name_new) -- keeps the iterator until the spying is done:

    • asterisk/apps/app_chanspy.c

      Lines 956 to 964 in 1a7866b

      iter = ast_channel_iterator_by_name_new(ast_channel_name(unique_chan), 0);
      ast_channel_unref(unique_chan);
      } else {
      iter = ast_channel_iterator_by_name_new(spec, strlen(spec));
      }
      } else if (!ast_strlen_zero(exten)) {
      iter = ast_channel_iterator_by_exten_new(exten, context);
      } else {
      iter = ast_channel_iterator_all_new();
    • asterisk/apps/app_chanspy.c

      Lines 994 to 999 in 1a7866b

      for (autochan = next_channel(iter, chan);
      autochan;
      prev = autochan->chan,
      ast_autochan_destroy(autochan),
      autochan = next_autochan ?: next_channel(iter, chan),
      next_autochan = NULL) {
    • res = channel_spy(chan, autochan, &volfactor, fd, user_options, flags, exitcontext);
    • iter = ast_channel_iterator_destroy(iter);
  • CEL hangup events -- are generated because of a publish by the ast_channel destructor:

    • asterisk/main/channel.c

      Lines 2206 to 2213 in 1a7866b

      ast_set_flag(ast_channel_flags(chan), AST_FLAG_DEAD);
      if (ast_channel_internal_is_finalized(chan)) {
      /* A channel snapshot should not be in the process of being staged now. */
      ast_assert(!ast_test_flag(ast_channel_flags(chan), AST_FLAG_SNAPSHOT_STAGE));
      ast_channel_lock(chan);
      ast_channel_publish_final_snapshot(chan);
    • ast_channel_snapshot_set(chan, NULL);
      stasis_publish(ast_channel_topic(chan), message);
    • asterisk/main/cel.c

      Lines 910 to 921 in 1a7866b

      was_hungup = ast_test_flag(&old_snapshot->flags, AST_FLAG_DEAD) ? 1 : 0;
      is_hungup = ast_test_flag(&new_snapshot->flags, AST_FLAG_DEAD) ? 1 : 0;
      if (!was_hungup && is_hungup) {
      struct ast_json *extra;
      struct cel_dialstatus *dialstatus = get_dialstatus(new_snapshot->base->uniqueid);
      extra = ast_json_pack("{s: i, s: s, s: s}",
      "hangupcause", new_snapshot->hangup->cause,
      "hangupsource", new_snapshot->hangup->source,
      "dialstatus", dialstatus ? dialstatus->dialstatus : "");
      cel_report_event(new_snapshot, AST_CEL_HANGUP, event_time, NULL, extra, NULL);

To get accurate HANGUP timings from CEL, channel destruction should never be delayed for more than a short while.

But because some components (ChanSpy) can hold references to channels that are hung up, destruction can be delayed for an indetermined amount of time.

This results in inaccurate CEL HANGUP timestamps: those who rely on CEL HANGUP time for call duration statistics will get wrong values


Scenario

At t=0, we might have three channels:

  • channelA, ref (1)
  • channelB, ref (1)
  • channelC, ref (1)

At t=1 we start spying on channels using ChanSpy. The channels (hash) container is filtered and turned into a (list) container (multi_container ()) holding a snapshot of then active channels that match the filter: the multi_iterator (). This list/iterator is returned to the application (chanspy):

  • channelA, +1 ref (=2) (__ao2_link() -> rb_ao2_new_node() -> "Container node creation")
  • channelB, +1 ref (=2)
  • channelC, +1 ref (=2)

At t=1, we traverse the iterator. We might decide to skip channelA and start spying on channelB:

  • channelA, +1 ref /* Bump ref of returned object */ ("Next iterator object."), replace last_node (NULL), last_node = channelA, return channelA
  • ignore channelA: -1 ref, get next
  • channelB, +1 ref /* Bump ref of returned object */ ("Next iterator object."), replace last_node (channelA, -1 ref), last_node = channelB
  • start listening on channelB (channel_spy)

At this point, the references look like this:

  • channelA, +1-1-1 ref (=1)
  • channelB, +1 ref (=3)
  • channelC, +0 ref (=2)

Now, at t=2, channelA and channelC hang up:

  • channelA, -1 ref (=0) -> destroying
  • channelB, +0 ref (=3)
  • channelC, -1 ref (=1)

First at t=3, when either the spying channel hangs up, or the spyee hangs up, do we exit channel_spy and do we resume looking at the multi_iterator:

  • channelC, -1 ref (=0) -> destroying

This causes the channelC destroying event to fire at t=3 instead of t=2. And that means that the reported CEL time for the hangup event is wrong.


Solutions

  • Either the CEL HANGUP event should not be fired by the channel destructor,
  • or no component should be allowed to hold references to channels that are hung up (read: channel iterators should be short-lived).

Someone should make a design decision on this. I would say the least impactful is to make the chanspy iterator short-lived, by copying the IDs of the channels and iterating over the channel IDs instead.

But if we can make the HANGUP event fire sooner by moving the AST_FLAG_DEAD+publish to something earlier, that works for me too.

Cheers,
Walter Doekes
OSSO B.V.


(*) The channel iteration:

ast_channel_iterator_by_name_new() -> ast_channel_callback() -> ao2_callback_data() -> __ao2_callback_data() -> internal_ao2_traverse()

  • if ((flags & (OBJ_MULTIPLE | OBJ_NODATA)) == OBJ_MULTIPLE) {
    /* we need to return an ao2_iterator with the results,
    * as there could be more than one. the iterator will
    * hold the only reference to a container that has all the
    * matching objects linked into it, so when the iterator
    * is destroyed, the container will be automatically
    * destroyed as well.
    */
    multi_container = ao2_t_container_alloc_list(AO2_ALLOC_OPT_LOCK_NOLOCK, 0, NULL,
    NULL, "OBJ_MULTIPLE return container creation");
    if (!multi_container) {
    return NULL;
    }
    if (!(multi_iterator = ast_calloc(1, sizeof(*multi_iterator)))) {
  • __ao2_link(multi_container, node->obj, flags, tag, file, line, func);

(an ao2_container list is created to hold the filtered channels)

next_channel() -> ast_channel_iterator_next() -> ao2_iterator_next() -> __ao2_iterator_next():

  • /* Bump ref of returned object */
    __ao2_ref(ret, +1, tag ?: "Next iterator object.", file, line, func);
  • asterisk/apps/app_chanspy.c

    Lines 994 to 999 in 1a7866b

    for (autochan = next_channel(iter, chan);
    autochan;
    prev = autochan->chan,
    ast_autochan_destroy(autochan),
    autochan = next_autochan ?: next_channel(iter, chan),
    next_autochan = NULL) {

(channels are popped from the ao2_container list, but not as immediate as we would like)

Relevant log output

Many charlies, some of whom are too late:

[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9b8021a58 'SIP/charlie-0000004c' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01eb458 'SIP/charlie-0000004b' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa027a608 'SIP/charlie-0000004e' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa0263a48 'SIP/charlie-0000004d' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9c001f878 'SIP/charlie-0000004f' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa022a6c8 'SIP/charlie-0000005b' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa0222048 'SIP/charlie-0000005d' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9dc01e7c8 'SIP/charlie-0000005e' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9e8020238 'SIP/charlie-0000005f' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01a34c8 'SIP/charlie-00000041' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc98000cc68 'SIP/charlie-00000042' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc97c00c418 'SIP/charlie-00000044' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc99c01f868 'SIP/charlie-00000047' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01cb7a8 'SIP/charlie-00000046' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa01e3208 'SIP/charlie-00000049' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9bc021b18 'SIP/charlie-00000050' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9c8021078 'SIP/charlie-00000052' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa02a91b8 'SIP/charlie-00000053' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9d00211c8 'SIP/charlie-00000056' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9e800a908 'SIP/charlie-0000000e' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa009c888 'SIP/charlie-0000000c' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa00ed368 'SIP/charlie-0000001f' destroying
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fcaa028ec78 'SIP/alice-00000060' destroying
[May  8 11:50:15] DEBUG[149070][C-0000001a] channel.c: Channel 0x7fcaa014ee18 'SIP/alice-00000031' destroying
[May  8 11:50:15] DEBUG[149081][C-0000001a] channel.c: Channel 0x7fc98800bad8 'SIP/alice-00000034' destroying
[May  8 11:50:15] DEBUG[149080][C-00000023] channel.c: Channel 0x7fcaa01e0d98 'SIP/bob-00000043' destroying
[May  8 11:50:15] DEBUG[149100][C-00000023] channel.c: Channel 0x7fc98400cb18 'SIP/bob-00000045' destroying
[May  8 11:50:15] DEBUG[149096][C-0000002a] channel.c: Channel 0x7fcaa0294298 'SIP/charlie-00000051' destroying
[May  8 11:50:15] DEBUG[149110][C-0000002a] channel.c: Channel 0x7fc9c4020f38 'SIP/charlie-00000054' destroying

Singling out a single Charlie:

[May  8 11:49:49] VERBOSE[149117][C-00000031] app_chanspy.c: Spying on channel SIP/alice-00000034
[May  8 11:50:07] DEBUG[149106][C-00000026] channel.c: Channel 0x7fc9b8021a58 'SIP/charlie-0000004c' hanging up.  Refs: 3
[May  8 11:50:15] VERBOSE[149117][C-00000031] app_chanspy.c: Done Spying on channel SIP/alice-00000034
[May  8 11:50:15] DEBUG[149117][C-00000031] channel.c: Channel 0x7fc9b8021a58 'SIP/charlie-0000004c' destroying

As seen, the destroying is late.

Asterisk Issue Guidelines

  • Yes, I have read the Asterisk Issue Guidelines
@jcolp jcolp added support-level-core Functionality with core support level and removed triage labels May 8, 2023
wdoekes added a commit to ossobv/asterisk that referenced this issue May 24, 2023
Because of upcoming planned changes/refactoring to app_chanspy, it is
convenient to pass some of the many arguments as a struct. This
changeset adds a channel_spy_context struct to pass around.

Related: asterisk#68
wdoekes added a commit to ossobv/asterisk that referenced this issue May 24, 2023
Because of upcoming planned changes/refactoring to app_chanspy, it is
convenient to pass some of the many arguments as a struct. This
changeset adds a channel_spy_context struct to pass around.

Related: asterisk#68
wdoekes added a commit to ossobv/asterisk that referenced this issue May 24, 2023
…tion

This moves the guts of common_exec into channel_spy_consume_iterator.
This makes refactoring/changing the code easier because there are fewer
function local variables to consider.

Related: asterisk#68
wdoekes added a commit to ossobv/asterisk that referenced this issue May 24, 2023
Refactor channel spying so it never holds on to a channel iterator.
Instead, we recreate the iterator when needed and skip channels that
we've seen already, creating the illusion of using an iterator.

This change was needed because the iterator caused the yet-unseen
channels in the iterator to be referenced by the iterator. This
reference ensured that the channel does not get destroyed. (Which is
good, because the iterator needs valid channels to work on.)

But, hanging on to a channel reference for longer than a short while
conflicts with CEL logging. The CEL hangup logging is activated by the
destruction of the channel. During chanspy activity, a bunch of channels
would stay in limbo. First when the chanspying was done would those
channels get their CEL hangup event logged.

The fix here is to hang on to channel iterators for only a short while.
An alternative fix that makes CEL hangup logging independent of channel
destruction was deemed more invasive.

This patch makes chanspy channel selection slightly more resource
intensive. But that's a small price to pay for correct CEL hangup
logging.

Fixes: asterisk#68
@slavonnet
Copy link
Contributor

#313 (comment)

its maybe related to some code logic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug support-level-core Functionality with core support level
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants