Oct 032013
 

Recently, a number of customers have been experiencing a rash of heartbeat issues between CS1000 components. In this article, I’m going to walk through some of the troubleshooting I’ve done recently and match symptoms to cause.

CS1000 RUDP Heartbeat behavior

CS1000 RUDP Heartbeat behavior

Thump, Thump: The Heartbeat

There are several heartbeats between different CS1000 components. Call Servers (CPUs) have their own hearbeat, an highly available redundant system will have a heartbeat over the high-speed pipe (HSP). CS using Geographical Redundancy (GR) Callpilot and Avaya Aura Contact Center (AML implementations) have a heartbeat between each of themselves and the Active CS. CS and IPMGs (IP Media Gateways, i.e., Media Gateway Controllers aka MGCs, or Voice Gateway Media Cards aka VGMCs aka MC32s) have a heartbeat.

Ports used

The RUDP Heartbeat uses port 15000 for source/destination. The 60 byte packet has a data payload of 6 bytes. While I haven’t worked out the meaning of the contents of all 6 bytes, I have worked out that the first 4 bytes are a sequence number that increments every successful round trip and the last 2 bytes are used as a send/receive flag. (i.e., 0x02ff for the originating side, 0x0100 for the responding side.)

RUDP

Reliable User Datagram Protocol (RUDP) was developed to support IP communication needs increasing reliability without the TCP overhead.

Heartbeat Process

The CS sends an RUDP Heartbeat over Port 15000 to the far end device (e.g., an MGC) to port 15000. The far end device repeats the pattern back to the originating device. This repeats every 1000 milliseconds.

The far end device sends an RUDP Heartbeat over Port 15000 to the CS to port 15000. The CS repeats the pattern back to the originating device. This repeats every 1000 milliseconds.

Each RUDP Heartbeat uses it’s own sequence number and each successful heartbeat causes the originating device to increment its sequence number by one. e.g., the CS sends its payload, the MGC replies back, the CS increments its sequence number and 1000 milliseconds later send another RUDP Heartbeat to the MGC. Meanwhile, on a separate 1000 millisecond timer and using a different sequence number, the MGC sends its payload, the CS replies back and the MGC increments its sequence number, sending another RUDP Heartbeat 1000 milliseconds after the previous one.

This means that there should be four (60 byte) packets passed between any two devices engaged in a RUDP Heartbeat exchange: A heartbeat from each side, as well as a response from each side.

NOTE: I don’t have a lab system to test what number is used to start the sequence– but I suspect that the number is either randomly generated or derived from the current unix time value. There is also probably additional handshaking that goes on during RUDP Heartbeat setup, I certainly see traffic on various ports during periods when I get SRPT0308 and ELAN009/ELAN014 messages indicating session closure/restart. Without documentation though, or a lab and lots of time to decrypt the process, this will remain a mystery for the foreseeable future.

Sherlocking the problem (How to Troubleshoot)

HBdebug

HBdebug can be used to enable additional diagnostic information output to the PDT rd logs (rd, rdall). While the documentation claims it will output heartbeat diagnostics, I have seen SRPT0308 and ELAN009 events without a corresponding increase in rd log data.

However, I have also seen SRPT016 and ELAN009 which provide additional diagnostic information. Based on this difference in behavior, I’ve come to an educated guess about what it means when you get an ELAN009 with HBdebug diag info and what it means when you don’t get the info. Without HBdebug diag info, the most likely cause of the error message is a Firewall (i.e., Stateful Packet Inspection closing RUDP Port 15000 heartbeat sessions for “idle session” reasons. Even though the RUDP Heartbeat transmits at least 120-240 bytes per second back and forth between the two devices.)

Packet Captures

Packet captures covering the event are helpful to Avaya for troubleshooting root cause. When performing a packet capture, it is best to obtain them from a mirrored port of the CS and another mirrored port of the remote device experiencing the connectivity problems. While I’ve become proficient at reading the pcap log, it will take time for any new troubleshooter to become familiar with what’s in the logs and to be able to use them to self-diagnose the problem.

I like using the Wireshark filter string ((ip.src==10.10.10.10 || ip.src==10.10.20.20) && (ip.dst==10.10.10.10 || ip.dst==10.10.20.20) && udp.port==15000) — If 10.10.10.10 is the CS and 10.10.20.20 is the remote MGC, this will find any packets coming from either of them that are also destined for either of them while also being on UDP port 15000. Adjust IP addresses as needed and the UDP port as needed (15000, 32779 or 32780).

Network Analysis

Get a network topology & configs. Look at intermediate devices like Firewalls, WAN networks and for everything from physical layer issues up to network prioritization (QoS).

Why it happens (Root causes)

  • ELAN heartbeat traffic is supposed to be treated as Real Time traffic. When you’re bridging the ELAN across multiple physical locations, or routing it across a WAN, it’s best to give this traffic Expedited Forwarding (DSCP 46) so that it is routed as one of the highest priority packets in your network. The Heartbeat is “network control” after all.
  • WAN (carrier) may not support QOS. Packet loss, jitter and latency can all be treated as “lost” heartbeat packets.
  • Firewalls may not support QOS. For intra-site Firewalls, the QOS tags may be dropped at the Firewall– for this the network team needs to evaluate prioritization rules on ingress from the firewall (either side) to re-tag the packets. The Firewall may not support QOS, but if the equipment on both sides is prioritizing the traffic sent to the Firewall and re-tagging it on egress, it doesn’t make as much of a difference.
  • Firewalls may block traffic (Access Control Lists)– engage the network security team to make sure all necessary packets are allowed.
  • Speed/Duplex mismatches (or Autonegotiation failures)– Physical layer trumps everything. Trace your cables and make sure everything is connected. Perform network testing to identify any potential
  • Congestion, high network utilization, broadcast storms, etc.– Make sure there is sufficient bandwidth to serve your applications. This is rarely an issue in LAN environments, but is certainly an issue for numerous WANs.
  • Outage- equipment goes down, cables are removed/cut, WANs suffer failures reducing available bandwidth.

Other survivability considerations

  • Geographically redundant systems use RUDP Heartbeat.
  • Highly Available systems use RUDP Heartbeat over HSP, and will fail over to ELAN if possible.
  • Each CS/CPU in a CS1000E system uses RUDP Heartbeat with every other device in the environment that participates in Heartbeat activity. Unlike legacy CS1000M systems, the redundant CPU is not inactive.
  • IPMGs (VGMCs & MGCs) use RUDP Heartbeat to each CS. IPMGs and CSs also participate in an IPMG heartbeat process (using UDP Port 32779 & 32780 with a different payload structure than the RUDP Heartbeat sent over Port 15000.)
  • Branch Office CS participate in RUDP Heartbeat with the Main Office CS.
  • Secondary NRS participate in ICMP Heartbeat with Primary NRS, as do different UCM and SS systems.
  • Callpilot & AACC communicate over AML IP Phones
  • Callpilot & AACC communicate with each other of a ACCESS port connection.
  • IP Phones participate in RUDP heartbeat with the SS (as noted above, this is somewhat configurable).

Symptoms to Cause

  • Alarm pattern: CS1000 SRPT0308, ELAN009 / No reboot of remote device – Firewall Stateful Packet Inspection timer closes RUDP Heartbeat between devices.
  • Alarm pattern: CS1000 ELAN009, SRPT016 IPMG DOWN / Remote device reboots – Use LastResetReason on IPMG. Most likely due to loss of network connectivity due to packet loss, latency or jitter. Use HBdebug to perform further diagnoses and/or obtain a packet capture from both ends. Network analysis may be required to resolve.
  • IP Phones reboot – use usiQueryResetReason to obtain last reset reason. If caused by RUDP Heartbeat retry exhaust, evaluate RUDP Retry settings on linuxbase (UCM) and network cause.

Useful commands & notes

  • IP Phone RUDP status display & toggling state – Mute Up Down Up Down Up * 2 – Current RUDP state appears and one softkey is available to switch state, another softkey is available to exit. (Not available on all UNIStim releases.)
  • Linux/VGMC pbxLinkShow – Show link state, including RUDP information
  • PDT/VGMC rudpShow – Show RUDP information
  • VGMC usiQueryResetReason – Show reason for last reboot of an IP Phone
  • IPMG LastResetReason – Show reason for last reboot of an MGC
  • SS/Linux usiGetPhoneRudpSettings (7.5 or later) – Show the Retry & Timeout settings for RUDP Heartbeat between IP Phones and TPS (Terminal Proxy Server, located on the SS)
  • VGMC/SS/IPMG TPS.INI contains a retry limited called rudpWindowSize

Reference

  • Troubleshooting Guide for Distributors contains a section entitled “IPMG Call Server heartbeat mechanism” which talks more fully about the heartbeat mechanism between the CS and IPMGs. It also provides several examples of outages and the alarm pattern.
  • Search for RUDP in the Documentation for other references