Oct 182013
 

In a previous post, I talked about an experience I had in documenting for co-workers how to set up the CS1000E. The root cause of that documentation was the excessive amount of time I spent cleaning up after a field person who refused to install this properly and the subsequent complaints from customers on why the phone system went down during maintenance windows.

Having the ability to add system redundancy (or resiliency) does not necessarily mean that a customer requires said redundancy, but sometimes the lack of the redundancy is not a factor in the customer’s thinking. Sometimes, knowing you can prevent something (if you pay the associated costs) is not worth the money or time.

This is a different decision from arguing that something doesn’t work a particular way– and today I ran into this problem with a customer who did not have the necessary redundant network connections and experienced an outage as a result.

In this particular case, either the cable and/or data switch port went bad. Had the customer installed the necessary redundancy, the failure of a single port would not have been noticed and the system would have kept on trucking.

As part of the post event discussion, I walked them through how the architecture supported additional redundancy and the extent to which that redundancy can be expanded. I decided to work up a diagram to more fully explain what I was talking about.

This diagram shows a CPPM CS (Call Processor Pentium Mobile – Call Server) connected to the passthrough port on the MGC. The passthrough port permits you to simulate increased CS redundancy to the ELAN network by passing though to either active MGC ELAN interface.

The downside of this connection is that if the MGC undergoes maintenance, or the cable goes bad, you still have a single-point-of-failure.

I would do this primarily only when the environment is also going to deploy redundant TLAN/ELAN data switches for increased network resiliency. Otherwise, connecting the CS directly to the ELAN network makes more architectural sense to me. (That way if you’re installing loadware on the MGC associated with the CS, you don’t cause outages to the entire system when the MGC is rebooted– although there are architectural decisions that can be made to work around some of that as well but we’re not going to cover every possible scenario in this article. Please feel free to comment below to engage in a discussion if you have questions or want to share your observations.)

The diagram also shows the redundant connections from the MGC (faceplate & rear-chassis) connected to a redundant data network. NOTE: I do not show the data switch connectivity with the rest of the network. That’s sort of beyond the scope of the CS1000 resiliency feature. You can, I’m sure, get the gist of it from this article.

Oct 032013
 

Recently, a number of customers have been experiencing a rash of heartbeat issues between CS1000 components. In this article, I’m going to walk through some of the troubleshooting I’ve done recently and match symptoms to cause.

CS1000 RUDP Heartbeat behavior

CS1000 RUDP Heartbeat behavior

Thump, Thump: The Heartbeat

There are several heartbeats between different CS1000 components. Call Servers (CPUs) have their own hearbeat, an highly available redundant system will have a heartbeat over the high-speed pipe (HSP). CS using Geographical Redundancy (GR) Callpilot and Avaya Aura Contact Center (AML implementations) have a heartbeat between each of themselves and the Active CS. CS and IPMGs (IP Media Gateways, i.e., Media Gateway Controllers aka MGCs, or Voice Gateway Media Cards aka VGMCs aka MC32s) have a heartbeat.

Ports used

The RUDP Heartbeat uses port 15000 for source/destination. The 60 byte packet has a data payload of 6 bytes. While I haven’t worked out the meaning of the contents of all 6 bytes, I have worked out that the first 4 bytes are a sequence number that increments every successful round trip and the last 2 bytes are used as a send/receive flag. (i.e., 0x02ff for the originating side, 0x0100 for the responding side.)

RUDP

Reliable User Datagram Protocol (RUDP) was developed to support IP communication needs increasing reliability without the TCP overhead.

Heartbeat Process

The CS sends an RUDP Heartbeat over Port 15000 to the far end device (e.g., an MGC) to port 15000. The far end device repeats the pattern back to the originating device. This repeats every 1000 milliseconds.

The far end device sends an RUDP Heartbeat over Port 15000 to the CS to port 15000. The CS repeats the pattern back to the originating device. This repeats every 1000 milliseconds.

Each RUDP Heartbeat uses it’s own sequence number and each successful heartbeat causes the originating device to increment its sequence number by one. e.g., the CS sends its payload, the MGC replies back, the CS increments its sequence number and 1000 milliseconds later send another RUDP Heartbeat to the MGC. Meanwhile, on a separate 1000 millisecond timer and using a different sequence number, the MGC sends its payload, the CS replies back and the MGC increments its sequence number, sending another RUDP Heartbeat 1000 milliseconds after the previous one.

This means that there should be four (60 byte) packets passed between any two devices engaged in a RUDP Heartbeat exchange: A heartbeat from each side, as well as a response from each side.

NOTE: I don’t have a lab system to test what number is used to start the sequence– but I suspect that the number is either randomly generated or derived from the current unix time value. There is also probably additional handshaking that goes on during RUDP Heartbeat setup, I certainly see traffic on various ports during periods when I get SRPT0308 and ELAN009/ELAN014 messages indicating session closure/restart. Without documentation though, or a lab and lots of time to decrypt the process, this will remain a mystery for the foreseeable future.

Sherlocking the problem (How to Troubleshoot)

HBdebug

HBdebug can be used to enable additional diagnostic information output to the PDT rd logs (rd, rdall). While the documentation claims it will output heartbeat diagnostics, I have seen SRPT0308 and ELAN009 events without a corresponding increase in rd log data.

However, I have also seen SRPT016 and ELAN009 which provide additional diagnostic information. Based on this difference in behavior, I’ve come to an educated guess about what it means when you get an ELAN009 with HBdebug diag info and what it means when you don’t get the info. Without HBdebug diag info, the most likely cause of the error message is a Firewall (i.e., Stateful Packet Inspection closing RUDP Port 15000 heartbeat sessions for “idle session” reasons. Even though the RUDP Heartbeat transmits at least 120-240 bytes per second back and forth between the two devices.)

Packet Captures

Packet captures covering the event are helpful to Avaya for troubleshooting root cause. When performing a packet capture, it is best to obtain them from a mirrored port of the CS and another mirrored port of the remote device experiencing the connectivity problems. While I’ve become proficient at reading the pcap log, it will take time for any new troubleshooter to become familiar with what’s in the logs and to be able to use them to self-diagnose the problem.

I like using the Wireshark filter string ((ip.src==10.10.10.10 || ip.src==10.10.20.20) && (ip.dst==10.10.10.10 || ip.dst==10.10.20.20) && udp.port==15000) — If 10.10.10.10 is the CS and 10.10.20.20 is the remote MGC, this will find any packets coming from either of them that are also destined for either of them while also being on UDP port 15000. Adjust IP addresses as needed and the UDP port as needed (15000, 32779 or 32780).

Network Analysis

Get a network topology & configs. Look at intermediate devices like Firewalls, WAN networks and for everything from physical layer issues up to network prioritization (QoS).

Why it happens (Root causes)

  • ELAN heartbeat traffic is supposed to be treated as Real Time traffic. When you’re bridging the ELAN across multiple physical locations, or routing it across a WAN, it’s best to give this traffic Expedited Forwarding (DSCP 46) so that it is routed as one of the highest priority packets in your network. The Heartbeat is “network control” after all.
  • WAN (carrier) may not support QOS. Packet loss, jitter and latency can all be treated as “lost” heartbeat packets.
  • Firewalls may not support QOS. For intra-site Firewalls, the QOS tags may be dropped at the Firewall– for this the network team needs to evaluate prioritization rules on ingress from the firewall (either side) to re-tag the packets. The Firewall may not support QOS, but if the equipment on both sides is prioritizing the traffic sent to the Firewall and re-tagging it on egress, it doesn’t make as much of a difference.
  • Firewalls may block traffic (Access Control Lists)– engage the network security team to make sure all necessary packets are allowed.
  • Speed/Duplex mismatches (or Autonegotiation failures)– Physical layer trumps everything. Trace your cables and make sure everything is connected. Perform network testing to identify any potential
  • Congestion, high network utilization, broadcast storms, etc.– Make sure there is sufficient bandwidth to serve your applications. This is rarely an issue in LAN environments, but is certainly an issue for numerous WANs.
  • Outage- equipment goes down, cables are removed/cut, WANs suffer failures reducing available bandwidth.

Other survivability considerations

  • Geographically redundant systems use RUDP Heartbeat.
  • Highly Available systems use RUDP Heartbeat over HSP, and will fail over to ELAN if possible.
  • Each CS/CPU in a CS1000E system uses RUDP Heartbeat with every other device in the environment that participates in Heartbeat activity. Unlike legacy CS1000M systems, the redundant CPU is not inactive.
  • IPMGs (VGMCs & MGCs) use RUDP Heartbeat to each CS. IPMGs and CSs also participate in an IPMG heartbeat process (using UDP Port 32779 & 32780 with a different payload structure than the RUDP Heartbeat sent over Port 15000.)
  • Branch Office CS participate in RUDP Heartbeat with the Main Office CS.
  • Secondary NRS participate in ICMP Heartbeat with Primary NRS, as do different UCM and SS systems.
  • Callpilot & AACC communicate over AML IP Phones
  • Callpilot & AACC communicate with each other of a ACCESS port connection.
  • IP Phones participate in RUDP heartbeat with the SS (as noted above, this is somewhat configurable).

Symptoms to Cause

  • Alarm pattern: CS1000 SRPT0308, ELAN009 / No reboot of remote device – Firewall Stateful Packet Inspection timer closes RUDP Heartbeat between devices.
  • Alarm pattern: CS1000 ELAN009, SRPT016 IPMG DOWN / Remote device reboots – Use LastResetReason on IPMG. Most likely due to loss of network connectivity due to packet loss, latency or jitter. Use HBdebug to perform further diagnoses and/or obtain a packet capture from both ends. Network analysis may be required to resolve.
  • IP Phones reboot – use usiQueryResetReason to obtain last reset reason. If caused by RUDP Heartbeat retry exhaust, evaluate RUDP Retry settings on linuxbase (UCM) and network cause.

Useful commands & notes

  • IP Phone RUDP status display & toggling state – Mute Up Down Up Down Up * 2 – Current RUDP state appears and one softkey is available to switch state, another softkey is available to exit. (Not available on all UNIStim releases.)
  • Linux/VGMC pbxLinkShow – Show link state, including RUDP information
  • PDT/VGMC rudpShow – Show RUDP information
  • VGMC usiQueryResetReason – Show reason for last reboot of an IP Phone
  • IPMG LastResetReason – Show reason for last reboot of an MGC
  • SS/Linux usiGetPhoneRudpSettings (7.5 or later) – Show the Retry & Timeout settings for RUDP Heartbeat between IP Phones and TPS (Terminal Proxy Server, located on the SS)
  • VGMC/SS/IPMG TPS.INI contains a retry limited called rudpWindowSize

Reference

  • Troubleshooting Guide for Distributors contains a section entitled “IPMG Call Server heartbeat mechanism” which talks more fully about the heartbeat mechanism between the CS and IPMGs. It also provides several examples of outages and the alarm pattern.
  • Search for RUDP in the Documentation for other references

Apr 282013
 

While working for a Nortel distributor a few years ago, after answering repeat questions (sometimes the same question 4+ times from the same individual) I put together an internal WIKI and documented things for all employees. This information ranged from corporate policy, forms, etc. and went even so far as to document technical procedures. It was a one-stop-shop for information– any time anyone asked me a question, I put the answer on the Wiki and then linked them to the wiki. I called it, appropriately, RTFM. I could have used Sharepoint, but we were operating on a budget. So I commandered a directory on an IIS server, installed mySQL and mediawiki. I’m sure there was more to it than that, but the point of this article isn’t to talk about my mediawiki installation experience.

Anyway…

One of the features introduced with the Nortel CP-PM (Call Processor, Pentium Mobile) and CS1000E chassis (I forget the exact chassis shown in the pictures below) was the redundant Network Interfaces for the Call Server in the CS slot.

The Call Server (CS card in the middle of the picture above) supports a HSP (High Speed Pipe), TLAN (Telephony LAN) and ELAN (Equipment LAN) network interface, but the CS only uses the ELAN (and maybe the HSP if you were in a redundant CPU configuration). The CS does not use or need the TLAN, since it did not process any VoIP traffic.

The MGC (Media Gateway Controller– the card on the bottom of the picture above) supports both the TLAN and ELAN interface, as well as pass-through for both TLAN and ELAN.

The back of the chassis has a redundant TLAN and ELAN interface for the MGC card.

 

The reason that this particular wiki page came about is because I was sent to three or four sites over a month period to do clean up on sites that had been installed by our “lead installer.” I say cleanup because in each installation, the problem was system instability due to customer maintenance of their data network.

You see, if you connect both the primary and redundant ELAN connections on the MGC, you can take either data switch down and the system will keep on chugging without impact… and, by plugging the CS into the ELAN-passthrough port, the CS could simulate an increase in network redundancy (i.e., as long as the MGC remained up, the CS would retain communication with all other components over the ELAN through whichever ELAN port was still live and working on the MGC– of course, if the MGC rebooted or if both NICs went down, became unplugged, the data switches were rebooted, whatever– then the whole house of cards came down, same as if you didn’t have redundancy.)

So after going out to each of these and being asked to install the redundancy, I went back to our “lead installer” and asked him to make it part of his installs in the future. His response: It doesn’t work that way.

So I went to a co-worker, verified my understanding of how things worked, and then asked my co-worker to approach him and try again. The lead installer’s response: Yea, the senior remote engineer said the same thing to me, but it doesn’t work that way.

When asked how he thought it worked, he said that the redundant ports were not redundant and that the passthrough port didn’t work.

So I proceeded to document, with photos (the ones you see above), how it worked and put together a wiki page.

 

Then I involved management.

 

For a small distributor to have this kind of inefficiencies (the lead installer does a bad job, then someone else has to go out and clean up the work), it was not only killing budgets on projects and eating “warranty” hours after the customer went live… it was also putting me on-site, which meant less time to focus on my primary responsibilities.

 

I took our lab system and set it up the proper way, then I invited management to a demonstration where I demonstrated by unplugging and plugging in data cables how stability/redundancy worked. Then I showed them what happened when certain data cables were left unplugged or unused (the way the lead installer was doing it).

 

When management addressed the issue with the lead installer, he quit rather than amend his ways. Needless to say, I was shocked by this response.