Oct 182013
 

In a previous post, I talked about an experience I had in documenting for co-workers how to set up the CS1000E. The root cause of that documentation was the excessive amount of time I spent cleaning up after a field person who refused to install this properly and the subsequent complaints from customers on why the phone system went down during maintenance windows.

Having the ability to add system redundancy (or resiliency) does not necessarily mean that a customer requires said redundancy, but sometimes the lack of the redundancy is not a factor in the customer’s thinking. Sometimes, knowing you can prevent something (if you pay the associated costs) is not worth the money or time.

This is a different decision from arguing that something doesn’t work a particular way– and today I ran into this problem with a customer who did not have the necessary redundant network connections and experienced an outage as a result.

In this particular case, either the cable and/or data switch port went bad. Had the customer installed the necessary redundancy, the failure of a single port would not have been noticed and the system would have kept on trucking.

As part of the post event discussion, I walked them through how the architecture supported additional redundancy and the extent to which that redundancy can be expanded. I decided to work up a diagram to more fully explain what I was talking about.

This diagram shows a CPPM CS (Call Processor Pentium Mobile – Call Server) connected to the passthrough port on the MGC. The passthrough port permits you to simulate increased CS redundancy to the ELAN network by passing though to either active MGC ELAN interface.

The downside of this connection is that if the MGC undergoes maintenance, or the cable goes bad, you still have a single-point-of-failure.

I would do this primarily only when the environment is also going to deploy redundant TLAN/ELAN data switches for increased network resiliency. Otherwise, connecting the CS directly to the ELAN network makes more architectural sense to me. (That way if you’re installing loadware on the MGC associated with the CS, you don’t cause outages to the entire system when the MGC is rebooted– although there are architectural decisions that can be made to work around some of that as well but we’re not going to cover every possible scenario in this article. Please feel free to comment below to engage in a discussion if you have questions or want to share your observations.)

The diagram also shows the redundant connections from the MGC (faceplate & rear-chassis) connected to a redundant data network. NOTE: I do not show the data switch connectivity with the rest of the network. That’s sort of beyond the scope of the CS1000 resiliency feature. You can, I’m sure, get the gist of it from this article.

Apr 282013
 

While working for a Nortel distributor a few years ago, after answering repeat questions (sometimes the same question 4+ times from the same individual) I put together an internal WIKI and documented things for all employees. This information ranged from corporate policy, forms, etc. and went even so far as to document technical procedures. It was a one-stop-shop for information– any time anyone asked me a question, I put the answer on the Wiki and then linked them to the wiki. I called it, appropriately, RTFM. I could have used Sharepoint, but we were operating on a budget. So I commandered a directory on an IIS server, installed mySQL and mediawiki. I’m sure there was more to it than that, but the point of this article isn’t to talk about my mediawiki installation experience.

Anyway…

One of the features introduced with the Nortel CP-PM (Call Processor, Pentium Mobile) and CS1000E chassis (I forget the exact chassis shown in the pictures below) was the redundant Network Interfaces for the Call Server in the CS slot.

The Call Server (CS card in the middle of the picture above) supports a HSP (High Speed Pipe), TLAN (Telephony LAN) and ELAN (Equipment LAN) network interface, but the CS only uses the ELAN (and maybe the HSP if you were in a redundant CPU configuration). The CS does not use or need the TLAN, since it did not process any VoIP traffic.

The MGC (Media Gateway Controller– the card on the bottom of the picture above) supports both the TLAN and ELAN interface, as well as pass-through for both TLAN and ELAN.

The back of the chassis has a redundant TLAN and ELAN interface for the MGC card.

 

The reason that this particular wiki page came about is because I was sent to three or four sites over a month period to do clean up on sites that had been installed by our “lead installer.” I say cleanup because in each installation, the problem was system instability due to customer maintenance of their data network.

You see, if you connect both the primary and redundant ELAN connections on the MGC, you can take either data switch down and the system will keep on chugging without impact… and, by plugging the CS into the ELAN-passthrough port, the CS could simulate an increase in network redundancy (i.e., as long as the MGC remained up, the CS would retain communication with all other components over the ELAN through whichever ELAN port was still live and working on the MGC– of course, if the MGC rebooted or if both NICs went down, became unplugged, the data switches were rebooted, whatever– then the whole house of cards came down, same as if you didn’t have redundancy.)

So after going out to each of these and being asked to install the redundancy, I went back to our “lead installer” and asked him to make it part of his installs in the future. His response: It doesn’t work that way.

So I went to a co-worker, verified my understanding of how things worked, and then asked my co-worker to approach him and try again. The lead installer’s response: Yea, the senior remote engineer said the same thing to me, but it doesn’t work that way.

When asked how he thought it worked, he said that the redundant ports were not redundant and that the passthrough port didn’t work.

So I proceeded to document, with photos (the ones you see above), how it worked and put together a wiki page.

 

Then I involved management.

 

For a small distributor to have this kind of inefficiencies (the lead installer does a bad job, then someone else has to go out and clean up the work), it was not only killing budgets on projects and eating “warranty” hours after the customer went live… it was also putting me on-site, which meant less time to focus on my primary responsibilities.

 

I took our lab system and set it up the proper way, then I invited management to a demonstration where I demonstrated by unplugging and plugging in data cables how stability/redundancy worked. Then I showed them what happened when certain data cables were left unplugged or unused (the way the lead installer was doing it).

 

When management addressed the issue with the lead installer, he quit rather than amend his ways. Needless to say, I was shocked by this response.