[GRLUG] Linode, user expectations, problem solving

Adam Tauno Williams awilliam at whitemice.org
Wed Oct 28 11:20:32 EDT 2009


On Wed, 2009-10-28 at 10:03 -0400, Michael Mol wrote:
> I don't know how many of you have Linode accounts, or heard about the
> major Linode outage that happened last night (and is still apparently
> going on for some people), but I figured here would be as good a place
> to discuss it as any, and my I'm fascinated as I watch the technical
> and social issues surrounding and following the outage.
> First, the explanation:
> http://www.linode.com/forums/viewtopic.php?t=4765
> "During a shared library update distributed to our hosts, a number of
> the hosts incorrectly have marked Linodes as being shut down. To
> recover from this we may be issuing host reboots to upgrade their
> software to our latest stack, and then bringing the Linodes to their
> last state. We're working on this now and expect to have additional
> updates shortly. We'll also be notifying those affected via our
> support ticket system. Please stand by."

Well, VMware shut down VMs all over via an update awhile back - and
prevented them from rebooting!   And then there is the fiasco that is
gmail.  While this sucks, and superficially seems kind of dumb, I'm not
terribly surprised or upset - managing large systems is REALLY hard and
I expected a blib or two occasionally.

But, my VM didn't go down.

aleph:~ # uptime
 10:56:33 up 58 days,  4:55,  1 user,  load average: 0.44, 0.14, 0.09

> Second, they rolled out the same update (after successful testing) to
> four data centers at once--and all four data centers went down.  In my
> own experience, the data center's local network was still accessible,
> but my actual node was not. (This makes sense, seeing as it was the
> Xen VMs themselves which went down, not the physical machines or the
> network infrastructure.)  Probably not a wise move; A staged upgrade
> would have been better, but hindsight is 20/20.

If a staged update was possible,  it can also get thorny to have
hypervisors at different rev levels.  Which is a problem, AFAIK, none of
the hypervisor vendors have 100% (or I should say 98%) solved yet.

> Third, while the New Jersey datacenter apparently has a reputation for
> having issues, the rest of them have excellent reputations, and some
> folks on twitter remarked that this was the first major issue they'd
> seen in a year or more of being customers. 

I only moved to Linnode after hearing from several people about their
very solid experience with them.

>  At the same time, tons of
> people were *rabidly* outraged,

Of course,  many people enjoy being rabid.  I pay ~$20 a month - I can't
really justify rabid for such a pittance.  Although I have been known to
fly-off-the-handle on occasion. 

>  and were talking about moving to
> Slicehost, Red Point or other VPS providers.  (My analogy on the
> subject was "If your car was already quirky, you won't think twice
> about a bad start. If it's been perfect, you'll be in shock...")

Or some people are prone to overreacting and porting everything to
something they don't really know is any better.  

The same thing that creates LINUX distribution hopping - little in FOSS
land wastes more time and energy than that.

> Fifth, though, and possibly most interesting, is that a number of
> people's VMs still haven't come back up. 

If I was down that long I would be annoyed, but still just $20 a month
annoyed.   Given the quality of linnodes' customer service I have no
doubt those whose servers where down for an extended period of time will
be offered commensurate remuneration.

> I don't know how many of them
> are because Linode's software may still be broken, but at least one
> guy on Twitter yesterday noted "The @linode crash has highlighted that
> mysql wasn't set to start on boot on one of my personal boxes - good
> to know (and sort out)!"

Yeah, I've been bit by that one before.  I now habitually reboot test -
but it ruins my uptime numbers. :(

Or maybe not so much...

gourd-amber:~ # uptime
 11:11am  up 890 days 15:13,  1 user,  load average: 0.21, 0.19, 0.12

Yikes!

> That leads me to wonder how many people on Linode are running VMs that
> they fubared on their own, and which wouldn't have successfully
> rebooted on their own had they commanded them themselves. And while
> Linode is partly to blame for the VMs getting shut down unexpectedly,
> those who misconfigured their own systems are at least partially to
> blame.

Yep.

> Several on IRC and Twitter hadn't even checked the AJAX console
> before saying things like "Linode still down". I'm guessing they
> didn't know the difference between "fully-managed" and not...

I'd be willing to wager on it, at long odds preferably.



More information about the grlug mailing list