[GRLUG] Linode, user expectations, problem solving

Bill Littlejohn billl at mtd-inc.com
Wed Oct 28 11:08:04 EDT 2009


On Wed, Oct 28, 2009 at 10:03 AM, Michael Mol <mikemol at gmail.com> wrote:
> I don't know how many of you have Linode accounts, or heard about the
> major Linode outage that happened last night (and is still apparently
> going on for some people), but I figured here would be as good a place
> to discuss it as any, and my I'm fascinated as I watch the technical
> and social issues surrounding and following the outage.
>
> First, the explanation:
> http://www.linode.com/forums/viewtopic.php?t=4765
>
> "During a shared library update distributed to our hosts, a number of
> the hosts incorrectly have marked Linodes as being shut down. To
> recover from this we may be issuing host reboots to upgrade their
> software to our latest stack, and then bringing the Linodes to their
> last state. We're working on this now and expect to have additional
> updates shortly. We'll also be notifying those affected via our
> support ticket system. Please stand by."
>
> To be honest, that's all of their official news on the subject that
> I've read.  They also have a Twitter feed (@linode), and they have IRC
> channels (#linode) on irc.oftc.net (official; Linode even has a Mibbit
> page for connecting to it.) and irc.freenode.net (unofficial).  A live
> twitter search for "linode" has also been interesting to watch.
>
> I've noticed a few things so far during all of this.  First, while
> they have IRC, forums and twitter, the only communications medium that
> seemed to actually have a good deal of useful information was the IRC
> channel on OFTC.
>
> Second, they rolled out the same update (after successful testing) to
> four data centers at once--and all four data centers went down.  In my
> own experience, the data center's local network was still accessible,
> but my actual node was not. (This makes sense, seeing as it was the
> Xen VMs themselves which went down, not the physical machines or the
> network infrastructure.)  Probably not a wise move; A staged upgrade
> would have been better, but hindsight is 20/20.
>
> Third, while the New Jersey datacenter apparently has a reputation for
> having issues, the rest of them have excellent reputations, and some
> folks on twitter remarked that this was the first major issue they'd
> seen in a year or more of being customers.  At the same time, tons of
> people were *rabidly* outraged, and were talking about moving to
> Slicehost, Red Point or other VPS providers.  (My analogy on the
> subject was "If your car was already quirky, you won't think twice
> about a bad start. If it's been perfect, you'll be in shock...")
>
> Forth, to the best of my knowledge, all the VMs that went down (and
> came back up, at least) underwent *commanded* shutdowns, meaning that
> the host software signaled to the VM that it needed to shut down.  I
> don't know if they do this via ACPI or what, but IME it does result in
> a clean shutdown on a normally-configured guest.
>
> Fifth, though, and possibly most interesting, is that a number of
> people's VMs still haven't come back up. I don't know how many of them
> are because Linode's software may still be broken, but at least one
> guy on Twitter yesterday noted "The @linode crash has highlighted that
> mysql wasn't set to start on boot on one of my personal boxes - good
> to know (and sort out)!"
>
> That leads me to wonder how many people on Linode are running VMs that
> they fubared on their own, and which wouldn't have successfully
> rebooted on their own had they commanded them themselves. And while
> Linode is partly to blame for the VMs getting shut down unexpectedly,
> those who misconfigured their own systems are at least partially to
> blame. Several on IRC and Twitter hadn't even checked the AJAX console
> before saying things like "Linode still down". I'm guessing they
> didn't know the difference between "fully-managed" and not...
>
> --
> :wq
> _______________________________________________
> grlug mailing list
> grlug at grlug.org
> http://shinobu.grlug.org/cgi-bin/mailman/listinfo/grlug
>

I'm really kinda surprised at the reaction on the boards. I would have
expected more Linode customers to do some proper checking before
whining about it. Although, there's no doubt Linode fubared this by
doing all datacenters at once. I hope they have a reasonable
explanation for that soon.
FYI- My 2 Linode machines rebooted cleanly at Oct 27 23:31:35 and at
Oct 27 23:42:33.


More information about the grlug mailing list