[GRLUG] Linode, user expectations, problem solving

Wed Oct 28 10:03:00 EDT 2009

I don't know how many of you have Linode accounts, or heard about the
major Linode outage that happened last night (and is still apparently
going on for some people), but I figured here would be as good a place
to discuss it as any, and my I'm fascinated as I watch the technical
and social issues surrounding and following the outage.

First, the explanation:
http://www.linode.com/forums/viewtopic.php?t=4765

"During a shared library update distributed to our hosts, a number of
the hosts incorrectly have marked Linodes as being shut down. To
recover from this we may be issuing host reboots to upgrade their
software to our latest stack, and then bringing the Linodes to their
last state. We're working on this now and expect to have additional
updates shortly. We'll also be notifying those affected via our
support ticket system. Please stand by."

To be honest, that's all of their official news on the subject that
I've read.  They also have a Twitter feed (@linode), and they have IRC
channels (#linode) on irc.oftc.net (official; Linode even has a Mibbit
page for connecting to it.) and irc.freenode.net (unofficial).  A live
twitter search for "linode" has also been interesting to watch.

I've noticed a few things so far during all of this.  First, while
they have IRC, forums and twitter, the only communications medium that
seemed to actually have a good deal of useful information was the IRC
channel on OFTC.

Second, they rolled out the same update (after successful testing) to
four data centers at once--and all four data centers went down.  In my
own experience, the data center's local network was still accessible,
but my actual node was not. (This makes sense, seeing as it was the
Xen VMs themselves which went down, not the physical machines or the
network infrastructure.)  Probably not a wise move; A staged upgrade
would have been better, but hindsight is 20/20.

Third, while the New Jersey datacenter apparently has a reputation for
having issues, the rest of them have excellent reputations, and some
folks on twitter remarked that this was the first major issue they'd
seen in a year or more of being customers.  At the same time, tons of
people were *rabidly* outraged, and were talking about moving to
Slicehost, Red Point or other VPS providers.  (My analogy on the
subject was "If your car was already quirky, you won't think twice
about a bad start. If it's been perfect, you'll be in shock...")

Forth, to the best of my knowledge, all the VMs that went down (and
came back up, at least) underwent *commanded* shutdowns, meaning that
the host software signaled to the VM that it needed to shut down.  I
don't know if they do this via ACPI or what, but IME it does result in
a clean shutdown on a normally-configured guest.

Fifth, though, and possibly most interesting, is that a number of
people's VMs still haven't come back up. I don't know how many of them
are because Linode's software may still be broken, but at least one
guy on Twitter yesterday noted "The @linode crash has highlighted that
mysql wasn't set to start on boot on one of my personal boxes - good
to know (and sort out)!"

That leads me to wonder how many people on Linode are running VMs that
they fubared on their own, and which wouldn't have successfully
rebooted on their own had they commanded them themselves. And while
Linode is partly to blame for the VMs getting shut down unexpectedly,
those who misconfigured their own systems are at least partially to
blame. Several on IRC and Twitter hadn't even checked the AJAX console
before saying things like "Linode still down". I'm guessing they
didn't know the difference between "fully-managed" and not...

-- 
:wq