* Hint:Please note that some of the outputs in this post have been recreated in a lab

So on this particular issue, my task was to investigate why, even though there were multiple call-managers in the cluster,  all   external incoming  calls failed to connect to the call-manager cluster whenever the primary call-routing  call-manager server was unreachable. We noticed that   even though there were other call-managers in the cluster that could handle  calls  coming from  the PSTN or service provider, the calls would always fail just because one server was unreachable.  Let me show you a quick diagram of the topology below.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Internal phone <———- Call Manager <—(SipTrunk)–<—  2811 Voice Gateway<—-(ISDN Tunk)——-<—Telco or Service provider

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

As the voice gateway is the first point of entry into the network, I started my investigation from there. Here are the related configurations that I found on the voice-gateway.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 

dial-peer voice 1 pots

description  incoming dial-peer for pots leg
incoming called-number .
direct-inward-dial

:::::::::::::::::::::::::::::::::::::::::

dial-peer voice 5 voip
description primary dial-peer to cluster
destination-pattern ^1…$
session protocol sipv2
session target ipv4:192.168.0.99
incoming called-number .
dtmf-relay sip-kpml

:::::::::::::::::::::::::::::::::::::::::

dial-peer voice 3 voip
preference 1
description secondary dial-peer to cluster
destination-pattern ^1…$
session protocol sipv2
session target ipv4:192.168.0.55
incoming called-number .
dtmf-relay sip-kpml

:::::::::::::::::::::::::::::::::::::::::

sip-ua
timers trying 1000

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

  So based on the above configuration, I could tell that the intention was that  if a call came into  the gateway from the PSTN,  it would be routed by  dial-peer 5 to the call-manager server at  ip-address 192.168.0.99 – and if no response was gotten from that server, the call  would  be re-routed by dial-peer 3 which is pointing to another server in the cluster.

However, this was not working as expected  so I decided to enable  sip debugging (debug ccsip messages) and  also ISDN debugging (debug isdn q931 ) . After this, I made sure  that the primary call-manager was unreachable and then I  placed a test call into the cluster. This is what I saw in the debugs .

 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

An  ISDN setup message is received from the service provider. The calling number is 5555 and the called number is 1007.

:::::::::::::::::::::::::::::::::::::::::::

*Jun 13 20:55:56.528: ISDN Se0/3/0:23 Q931: RX <- SETUP pd = 8 callref = 0x0081
Bearer Capability i = 0x8090A2
Standard = CCITT
Transfer Capability = Speech
Transfer Mode = Circuit
Transfer Rate = 64 kbit/s
Channel ID i = 0xA98381
Exclusive, Channel 1
Progress Ind i = 0x8183 – Origination address is non-ISDN
Calling Party Number i = 0x2180, ‘5555
Plan:ISDN, Type:National
Called Party Number i = 0xA1, ‘1007
Plan:ISDN, Type:National

 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The voice gateway then tries to connect the call to the call-manager at 192.168.0.99 and because I had made sure that the primary server was unreachable, obviously there would be no response from the call-manager.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

INVITE sip:1007@192.168.0.99:5060 SIP/2.0
Via: SIP/2.0/UDP 10.10.10.3:5060;branch=
Remote-Party-ID: <sip:5555@10.10.10.3>;party=calling;screen=no;privacy=off

From: <sip:5555@10.10.10.3>;tag=1EBBA8-1ED
To: <sip:1007@192.168.0.99>
Date: Thu, 13 Jun 2013 20:55:57 GMT
Call-ID: 7A5AF3E6-D3A211E2-800D8BCB-56CC18E8@10.10.10.3
Supported: 100rel,timer,resource-priority,replaces,sdp-anat
Min-SE: 1800
Cisco-Guid: 2052580998-3550613986-2147680284-1483238696
User-Agent: Cisco-SIPGateway/IOS-12.x
Allow: INVITE, OPTIONS, BYE, CANCEL, ACK, PRACK, UPDATE, REFER, SUBSCRIBE, NOTIF Y, INFO, REGISTER
CSeq: 101 INVITE
Max-Forwards: 70
Timestamp: 1371156957
Contact: <sip:5555@10.10.10.3:5060>
Expires: 180
Allow-Events: kpml, telephone-event
Content-Type: application/sdp
Content-Disposition: session;handling=required
Content-Length: 232

v=0
o=CiscoSystemsSIP-GW-UserAgent 9433 8425 IN IP4 10.10.10.3
s=SIP Call
c=IN IP4 10.10.10.3
t=0 0
m=audio 19182 RTP/AVP 18 19
c=IN IP4 10.10.10.3
a=rtpmap:18 G729/8000
a=fmtp:18 annexb=no
a=rtpmap:19 CN/8000
a=ptime:20

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 

During the trace or debug output collection, I noticed that even though the gateway was not getting a response back from  the primary call-manager, it never used the secondary dial-peer to send a sip invite message to the backup call-manager. It just continued to send the same invite over and over again  to the primary call-manager until the call failed.  And whenever the call failed, I would see the trace output below:

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

*Jun 13 20:56:06.564: ISDN Se0/3/0:23 Q931: RX <- DISCONNECT pd = 8 callref = 0 x0081
Cause i = 0x82E6 – Recovery on timer expiry

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

 

 

So from the above, it is clear that the  call was dropped from the service provider side because the Recovery timer had expired on the ISDN circuit. This basically means that : while the voice gateway was busy continuously sending the same invite messages to the non-responsive call-manager, the ISDN timers expired because the call was not progressing forward because the voice gateway was not able to connect the ISDN call leg to the SIP call leg.

So basically, I needed to stop the gateway from continuously sending invite messages to a server that is never going to respond. I started looking at all the default sip related configurations on the gateway and I found this:

:::::::::::::::::::::::::::::::::::::::::::

Router#show sip-ua retry
SIP UA Retry Values
invite retry count = 6 response retry count = 6
bye retry count = 10 cancel retry count = 10
prack retry count = 10 update retry count = 6
reliable 1xx count = 6 notify retry count = 10
refer retry count = 10 register retry count = 6
info retry count = 6 subscribe retry count = 6
options retry count = 6

:::::::::::::::::::::::::::::::::::::::::::

As soon as I saw the output above, it was clear what the problem was.  As you can see, the sip user agent was configured to send 6 sip invites before  giving up.  And before it was done sending the 6 sip invites,  the ISDN timers from the service provider  had expired so the Cisco voice gateway never got to the point of sending the 6 sip invites before trying to reached the cluster using the secondary dial-peer.

In order to  resolve this problem, I  reduced the ‘Invite retry’  value to 2 so that the gateway would send two sip invites to the primary server and if it was not responding, the call would be forwarded to the secondary server using the secondary dial-peer.

:::::::::::::::::::::::::::::::::::::::::::

Router(config)#sip-ua

Router(config-sip-ua)#retry invite 2

:::::::::::::::::::::::::::::::::::::::::::

After the above configuration was added to the  gateway, I tested everything again and the fail-over worked perfectly.

Hope you’ve enjoyed this entry.

Cheers