digital.forest Technical Support
Breaking News

At approximately 12:53 PST today we experienced a network issue that slowed response times from servers here at digital.forest for about four minutes.

The source of the issue was one of our customers, who has redundant connections to our network core, accidentally disabling one of the protocols that makes this redundancy operate correctly. The result was our core seeing duplicate addressing and sorting out the problem. While we never lost any connectivity response times slowed considerably while the client involved recognized and corrected the issue.

We ask that clients contemplating maintenance or changes to their own network configurations please contact us ahead of time. Ideally several days ahead of time, so that we can review the proposed changes and note any potential problems that could be caused. This helps us avoid circumstances such as today's event and makes life better for us all.


posted by Chuck G. at 05:41 PM on Monday, November 10, 2008
Categories: Network

On Wednesday, October 22nd digital.forest experienced two electrical interruptions lasting between 6 to 8 milliseconds each. The first event occurred at 13:11, and the second at 19:30 Pacific Daylight Time. These interruptions were caused by a mechanical contact switch fault inside one of our UPS units. This fault occurred on a single phase of the three-phase power within that UPS. The fault caused a voltage drop to be passed into the datacenter along that particular phase. Most computers connected to that electrical phase experienced the voltage drop as a brief interruption of power. Roughly 17% of the servers at digital.forest were affected. Discovery of the root cause, repairing the UPS systems, and bringing the facility back to normal operations required 3 days of hard work by the digital.forest staff, VECA Electric, and MGE - the UPS manufacturer.

The events were triggered as we switched from Bypass mode (power routed around the UPS) to Protected mode (power routed through the UPS) following a scheduled preventative maintenance. This maintenance, which is performed by the UPS manufacturer twice each year, involves taking the UPS system offline, powering it down, inspecting all components, and checking each individual battery.

Following maintenance the UPS system must be transferred from Bypass mode to Protected mode - this switch is a near-zero risk operation. Switchover is handled in such a way that power is not interrupted, and failure during this operation is exceedingly rare. The MGE Service Manager noted that he has seen this operation fail only one other time in his career. digital.forest has performed this switch operation twice per year as a routine part of our maintenance procedures, without incident.

Upon completing the preventative maintenance our UPS vendor brought the system back online. During that process a mechanical contact switch inside one of the units, UPS 2, did not close completely to provide continuous electrical flow. The first time we performed the operation at 13:11, the UPS signaled a fault, and experienced the brief interruption of power on a single phase. The UPS system automatically went offline again, properly reverting to Bypass mode. Unfortunately the interruption on the single phase was long enough in duration to affect some servers downstream.

At this point neither digital.forest nor its vendors knew that a component had failed - only that the switch to Protected Mode was unsuccessful. According to the experts on-site, there was no apparent logical reason for the failure. MGE advised that we make some changes to our electrical distribution as a precautionary measure in preparation for a second transfer operation. At 19:30 power was again routed through the UPS system, and we experienced a second interruption identical to that of 13:11. At this point digital.forest ordered a stop to any further switch attempts and commenced a complete evaluation of UPS 1 & 2. MGE immediately dispatched a senior UPS engineer to our facility. Over the next two days comprehensive diagnosis and testing were performed on both UPS units, and the problems within UPS 2 were identified and repaired. After replacing an inverter and several control and communications cards, the root cause was traced to the fault in the contact switch.

You can view photographs of the faulty contact switch, and some of the damaged circuitry here:
An overall view of the contact switch mechanism.
A close-up view of the specific Phase-A contact that failed.
A close-up of a damaged communications circuit board in UPS 2.

UPS 2 is a relatively new unit, purchased in July of 2007. The physical failure of one of its contact switches is highly unusual. In fact, the manufacturer's specifications rate this component for ten million cycles, whereas we only engage it twice each year. The failed contact switch was inspected during every previous preventative maintenance and showed no signs of trouble, including the preventative maintenance performed earlier that same day.

Following the installation of new parts, we again closely inspected and tested every contact switch (there are 6 total) in both UPS 1 & 2. We also re-inspected and tested every other connection and circuit board inside both of these UPS units. After this comprehensive inspection we tested the UPS units with load banks at 100% power as well as tested the transfer operation under artificial load to validate the diagnosis and repair. At 22:10 on Friday, October 24th the UPS system was successfully brought online, and the datacenter was restored to normal operating conditions.

While this event was traced to a small component, many larger components of our facility, and our procedures performed as intended:


  • By design, the bypass equipment properly and automatically re-routed power when the UPS system faulted. This action contained the interruption to a very short duration, and to a limited portion of the datacenter.

  • High-level experts were immediately dispatched by our UPS vendor when it became clear that something was out of the ordinary, and parts were quickly flown in, reducing our repair time by days.

  • The backup power generation equipment carried our full electrical load continuously and flawlessly for three days.

  • Our contracted Diesel fuel vendor performed as we expected, making deliveries on demand with quality product. We topped our fuel tank on 3 separate occasions during the event.

  • Most importantly our staff remained on-site, responsive and available to assist you with your servers, as well as assist our vendors with the restoration of our facility to normal operations.


Digital.forest remains committed to providing superior service and to continually examining and maintaining all of the systems upon which our customers rely. We deeply regret any inconvenience or interruption of service this event may have caused. We appreciate the patience of our customers and close cooperation of our partners in working through this event, and welcome any additional questions or comments you might have.

Kind Regards,

The digital.forest Executive Team

posted by Chuck G. at 07:11 PM on Thursday, October 30, 2008
Categories:

UPDATE: 11/06/08 12:19 AM PST

At approximately 12:10 AM PST today we experienced 2 short outages of this same upstream (45 seconds and 38 seconds). We still have not received an RFO for the first outage and will be escalating both events for resolution. During this event as with the first our other upstreams handled all of our traffic.

10/30/08 12:03 PM PST
At approximately 12:03 PM PST today one of our upstream connections went down and came back up about 45 seconds later. During this event our other upstreams took over the traffic load. There should have been minimal impact from this event.

We are investigating with the upstream as to the cause of this event and will update as soon as we have more information.

posted by Kyle at 04:08 PM on Thursday, October 30, 2008
Categories: Network

At 10:10 PM PDT we threw the bypass switch and brought UPS 1 & 2 back online. The transfer went seamlessly and everything is working within normal parameters. The facility is back on grid power with fully functioning UPS protection.

Thank you again for your patience and understanding as we dealt with this emergency situation.

posted by Chuck G. at 01:25 AM on Saturday, October 25, 2008
Categories: Emergency Maintenance