现象:近来,发现Asterisk SIP服务经常占用CPU 100%以上,而且都在服务启动1个小时后出现,但重启服务后就恢复正常

初步分析:通过tshark对网络发包测试,并未发现有明显变化,因此不是请求量变大引发的CPU紧张(从重启服务恢复正常也可以证实),因此可以确定是服务内部的问题,得益于 《Why does Asterisk consume 100% CPU?》 这篇文章,我对asterisk服务进行了深度剖析,过程如下

步骤1. 获取哪个线程(Light Weight Process)占用CPU最高

ps -LlFm -p pidof asterisk

F S UID        PID  PPID   LWP  C NLWP PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
4 - root     13090 13088     - 25   47   -   - - 522547 -     573636  - 08:43 pts/18   00:07:28 /usr/sbin/asterisk -f -vvvg -c
4 S root         -     - 13090  0    -  80   0 -     - poll_s     -   1 08:43 -        00:00:00 -
1 S root         -     - 13091  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13092  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:17 -
1 R root         -     - 13093  3    -  80   0 -     - -          -   0 08:43 -        00:01:01 -
1 S root         -     - 13094  1    -  80   0 -     - futex_     -   1 08:43 -        00:00:19 -
1 S root         -     - 13096  0    -  80   0 -     - poll_s     -   0 08:43 -        00:00:00 -
1 S root         -     - 13098  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:02 -
1 S root         -     - 13100  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13101  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:00 -
1 S root         -     - 13102  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13103  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13104  2    -  80   0 -     - futex_     -   1 08:43 -        00:00:48 -
1 S root         -     - 13105  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13106  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13107  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13108  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13109  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13110  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13111  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:00 -
1 S root         -     - 13112  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13113  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13114  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13115  0    -  80   0 -     - inet_c     -   1 08:43 -        00:00:00 -
1 S root         -     - 13116  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13117  0    -  80   0 -     - poll_s     -   0 08:43 -        00:00:00 -
1 S root         -     - 13118  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:00 -
1 S root         -     - 13119  0    -  80   0 -     - poll_s     -   0 08:43 -        00:00:00 -
1 R root         -     - 13120 16    -  80   0 -     - -          -   0 08:43 -        00:04:52 -
1 S root         -     - 13121  0    -  80   0 -     - futex_     -   1 08:43 -        00:00:00 -
1 S root         -     - 13122  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:00 -
1 S root         -     - 13123  0    -  80   0 -     - poll_s     -   1 08:43 -        00:00:00 -
1 S root         -     - 13124  0    -  80   0 -     - hrtime     -   0 08:43 -        00:00:00 -
1 S root         -     - 13125  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:00 -
1 S root         -     - 13126  0    -  80   0 -     - poll_s     -   0 08:43 -        00:00:00 -
1 S root         -     - 13127  0    -  80   0 -     - futex_     -   0 08:43 -        00:00:00 -
1 S root         -     - 13128  0    -  80   0 -     - poll_s     -   1 08:43 -        00:00:00 -
1 S root         -     - 13129  0    -  80   0 -     - hrtime     -   1 08:43 -        00:00:00 -
1 S root         -     - 13130  0    -  80   0 -     - hrtime     -   0 08:43 -        00:00:00 -
1 S root         -     - 13131  0    -  80   0 -     - poll_s     -   0 08:43 -        00:00:00 -

其中C列表示线程占用CPU的百分比(这个值是自服务启动到目前的平均值,而top命令是实时,因此与top会不一致),从以上我们可以确定是13120线程占用最高,接下来我们要分析13120到底执行了什么

步骤2. 获取具体线程执行堆栈信息

pstack pidof asterisk

Thread 21 (Thread 0x7fecd94ea700 (LWP 13120)):
#0  0x00000000005aee3f in tps_cmp_cb ()
#1  0x00000000004580a5 in internal_ao2_traverse ()
#2  0x0000000000458b51 in __ao2_find ()
#3  0x00000000005aef0d in ast_taskprocessor_create_with_listener ()
#4  0x00000000005b4f1e in ast_threadpool_serializer ()
#5  0x000000000059acc1 in internal_stasis_subscribe.clone.1 ()
#6  0x00000000005a6168 in stasis_message_router_create_internal ()
#7  0x00000000004ef227 in endpoint_internal_create ()
#8  0x00007fecf0ec76c6 in build_peer () from /usr/lib/asterisk/modules/chan_sip.so
#9  0x00007fecf0ed5a48 in realtime_peer () from /usr/lib/asterisk/modules/chan_sip.so
#10 0x00007fecf0ed6646 in sip_find_peer_full () from /usr/lib/asterisk/modules/chan_sip.so
#11 0x00007fecf0ed66b7 in sip_find_peer () from /usr/lib/asterisk/modules/chan_sip.so
#12 0x00007fecf0f00aa7 in register_verify () from /usr/lib/asterisk/modules/chan_sip.so
#13 0x00007fecf0f02cb5 in handle_request_register () from /usr/lib/asterisk/modules/chan_sip.so
#14 0x00007fecf0f0d166 in handle_incoming () from /usr/lib/asterisk/modules/chan_sip.so
#15 0x00007fecf0f0ed84 in handle_request_do () from /usr/lib/asterisk/modules/chan_sip.so
#16 0x00007fecf0f10786 in sipsock_read () from /usr/lib/asterisk/modules/chan_sip.so
#17 0x0000000000514b52 in ast_io_wait ()
#18 0x00007fecf0ed4c6b in do_monitor () from /usr/lib/asterisk/modules/chan_sip.so

从以上我们可以发现与创建taskprocessors有关,于是我们看下了taskprocessors数量

asterisk -rx ‘core show taskprocessors’

5d07f1f8-56be-4456-ac64-08dd619b09a9                   2            0            1
88498730-b675-433d-b71c-eaa51fde68b6                   2            0            1
eaabf337-81d5-4411-a0dc-a05f20071c7c                   2            0            1
9dc17b55-f7cb-4ff4-bbcd-44033893eb4f                   2            0            1
afa4cfff-4cb9-451b-a5d1-7ace2e4a9040                   2            0            0
67729e31-8e78-423b-b5d5-dd746190f374                   2            0            0
b9b5e393-f016-4d09-b8fe-03d7633ef174                   2            0            0
    +---------------------+-----------------+------------+-------------+
    55585 taskprocessors

通过以上推测,应该是大量的taskprocessors占用的CPU资源,而且taskprocessors会不停增加

步骤3:结合代码,确认原因

struct ast_taskprocessor *ast_taskprocessor_create_with_listener(const char *name, struct ast_taskprocessor_listener *listener)
{
    struct ast_taskprocessor *p = ao2_find(tps_singletons, name, OBJ_KEY);

    if (p) {
        ast_taskprocessor_unreference(p);
        return NULL;
    }
    return __allocate_taskprocessor(name, listener);
}

ao2_find是根据name从任务队列中查找,一旦这个队列非常大,那CPU需要不停比较,从而会占用大量CPU资源,至此,问题基本确定了;接下来就要进入苦活阶段

步骤4:解决代码BUG

通过分析代码发现如下关系: peer->endpoint->route->subscription->mailbox

mailbox就是”struct ast_taskprocessor”即任务对象,因此可以推断应该是这个依赖链中有对象内存泄露了导致mailbox没有从任务队列中移除,通过一级一级分析,发现是peer内存泄露了,看如下代码

    } else if (!strcasecmp(curi, "*") || !expire) { /* Unregister this peer */
        /* This means remove all registrations and return OK */
        struct sip_peer* peer_db = sip_find_peer_by_db(peer->peername, NULL, TRUE, FINDPEERS, FALSE, 0);
        //ast_log(LOG_WARNING, "unregister found name%s\n",peer->peername);
        if(peer_db!=NULL && peer->uuid!=NULL && peer_db->uuid!=NULL)
        {
            //ast_log(LOG_WARNING, "db_uuid%s,peer_uuid%s\n", peer_db->uuid,peer->uuid);
            if(strcmp(peer->uuid,peer_db->uuid) == 0)
            {
                AST_SCHED_DEL_UNREF(sched, peer->expire,
                    sip_unref_peer(peer, "remove register expire ref"));
                ast_verb(3, "Unregistered SIP '%s'\n", peer->name);
                expire_register(sip_ref_peer(peer,"add ref for explicit expire_register"));
                return PARSE_REGISTER_UPDATE;
            }
            else
            {
                ast_log(LOG_WARNING, "Invalid unregister db_uuid%s,but peer_uuid%s\n", peer_db->uuid,peer->uuid);
            }
        }

peer_db是sip_find_peer_by_db得到的,而sip_find_peer_by_db是有’加引用’的,因此在没用peer_db时需要执行’去引用’,因此此处需要加sip_unref_peer(peer_db, “remove”)

当然代码中还有其他地方也是类似,这里不一一列出,改完所有泄露,重新编译运行发现taskprocessors不会不断增加,维持在30个以内,通过几个小时的运行,CPU没有增高,此问题得到的最终解决。

总结

【授之以鱼,不如授之以渔】以上只是一种CPU过高的原因,各位很可能遇到的问题不一样,但都可以借鉴上面步骤对问题进行排查。