PDA

View Full Version : BAUT Outage



Fraser
2011-Apr-23, 10:09 PM
As many of you noticed, BAUT was unavailable for the last 3 days or so. This was caused by a massive failure by Amazon's EC2 hosting service - a system that shouldn't have failed (and the Titanic is unsinkable).

Anyway, we're back online now, and I suspect Amazon is going to be having some serious soul searching at this point. Their entire business rests on this mistake never happening again.

I'm sure a bunch of you are wondering if we're going to stick with Amazon after this disaster. And my answer right now is a provisional yes. BAUT had been working great on EC2 until a couple of days ago, and I'm able to provide the service for free on the Universe Today server.

Anyway, thanks for your support, and let me know if you see any more problems.

Nereid
2011-Apr-23, 10:11 PM
UT seems to be (still) down ...

Fraser
2011-Apr-23, 10:41 PM
Yeah, it was back and now it's gone again. We're working on it.

Trakar
2011-Apr-23, 10:41 PM
As many of you noticed, BAUT was unavailable for the last 3 days or so. This was caused by a massive failure by Amazon's EC2 hosting service - a system that shouldn't have failed (and the Titanic is unsinkable).

Anyway, we're back online now, and I suspect Amazon is going to be having some serious soul searching at this point. Their entire business rests on this mistake never happening again.

I'm sure a bunch of you are wondering if we're going to stick with Amazon after this disaster. And my answer right now is a provisional yes. BAUT had been working great on EC2 until a couple of days ago, and I'm able to provide the service for free on the Universe Today server.

Anyway, thanks for your support, and let me know if you see any more problems.

Actually, the service hasn't been up to the old standards in a long time and have been seriously degrading toward previous server standards for the last week or two culminating in this final lock-out of the last few days. I understand the difficulties of changing servers, but these types of delays and problems chase off regular posters and readers.

Jeff Root
2011-Apr-23, 10:57 PM
Where can I learn more about the Amazon problem?

I didn't see any problems with other websites, so I assumed it
was specific to BAUT and Universe Today. I Googled "BAUT"
and looked at Phil's blog on Discovery.com hoping to find news
elsewhere explaining the problem, but didn't find anything.

-- Jeff, in Minneapolis

slang
2011-Apr-23, 11:33 PM
Where can I learn more about the Amazon problem?

There are some links in this ApolloHoax forum thread (http://apollohoax.proboards.com/index.cgi?board=general&action=display&thread=3151).

KaiYeves
2011-Apr-23, 11:43 PM
By a very weird coincidence, this is the first time I've had time to go on BAUT since Wed., so I guess I missed the whole thing.

Van Rijn
2011-Apr-23, 11:47 PM
I'm glad to see BAUT back!


Where can I learn more about the Amazon problem?


Google "amazon cloud problems"

Also, this status page gives some of the technical details (though there is still a lot they haven't explained):

http://status.aws.amazon.com/

Fraser
2011-Apr-23, 11:47 PM
Hundreds of thousands of sites were affected:
http://bits.blogs.nytimes.com/2011/04/21/amazon-cloud-failure-takes-down-web-sites/?ref=amazoninc

Buttercup
2011-Apr-24, 01:48 AM
I'm just glad we're back!! :D

BAUT is a wonderful way for me to pass some time in a home office, wherein I work alone 40 hours a week.

p.s.: The "Universe Today" web page was also down during. You probably know that. Combined, I was a bit worried!

Swift
2011-Apr-24, 01:50 AM
I've looked at clouds from both sides now,
From up and down, but still somehow,
Its clouds' illusions I recall.
I really don't know clouds... at all...
:D

Noclevername
2011-Apr-24, 01:51 AM
It's back! I'm doing my happy dance! *



* I do not have a happy dance. Stop picturing me doing one.

Buttercup
2011-Apr-24, 01:51 AM
I've looked at clouds from both sides now,
From up and down, but still somehow,
Its clouds' illusions I recall.
I really don't know clouds... at all...
:D

I want what Swift is having! :D

Swift
2011-Apr-24, 01:53 AM
It's back! I'm doing my happy dance! *



* I do not have a happy dance. Stop picturing me doing one.
http://freeemoticonsandsmileys.com/animated%20emoticons/Dancing%20Animated%20Emoticons/happy%20dance%20penguin.gif

Noclevername
2011-Apr-24, 01:59 AM
http://freeemoticonsandsmileys.com/animated%20emoticons/Dancing%20Animated%20Emoticons/happy%20dance%20penguin.gif

:clap::lol::naughty:

Fazor
2011-Apr-24, 02:03 AM
I was really astonished at the severity and duration of Amazon's outage -- there were many sites much more popular than BAUT that were also affected. It was annoying; but what can you do.

Now I have to go check to see if the PSN (Playstation Network) outage has also been resolved -- as it seems very odd to me that two such major networks would have gone down on the same day and yet be totally unrelated . . .

Solfe
2011-Apr-24, 02:09 AM
Welcome back BAUTForum!

Noclevername
2011-Apr-24, 02:34 AM
I just clicked off BAUT and then back onto it again, just for the sheer joy of knowing I could. :)

baric
2011-Apr-24, 02:44 AM
I dunno. It never seems like a good idea to place all of your IT infrastructure in the hands of someone else.

For example, if they unexpectedly go belly up then you could permanently lose access to your data. I guess I'm an old-school Internet 2.0 guy.

Extravoice
2011-Apr-24, 02:58 AM
This was caused by a massive failure by Amazon's EC2 hosting service - a system that shouldn't have failed (and the Titanic is unsinkable.

I wonder if this will be a major setback in the acceptance of cloud computing?
A three day loss of service is an annoyance for BAUT, but if your business depended on it, the results could be a whole lot worse.

On the bright side, think of all the "papers" this incident will generate for computer science conferences. :)

baric
2011-Apr-24, 03:12 AM
I wonder if this will be a major setback in the acceptance of cloud computing?

Doubtful. Cloud computing is just a fancy marketing term for outsourcing your IT services. If it really took a PR hit from this, they'd just invent a new term.

pepiboy32
2011-Apr-24, 04:19 AM
thought it was just me - done a system restore and paid several visits to major geeks....

thank goodness!

Van Rijn
2011-Apr-24, 04:26 AM
I wonder if this will be a major setback in the acceptance of cloud computing?


I think this has been an eye opener for a lot of people. At my job, we're looking at some major moves into using cloud computing, so I've been very interested in this. It would be a disaster for some of our systems to be out of service for this long. I'm sure it won't stop our use of cloud computing, but it will be a great example for some of the folks that have gotten a bit too enthusiastic about it.



A three day loss of service is an annoyance for BAUT, but if your business depended on it, the results could be a whole lot worse.


I have seen some scary stories related to this. In the Amazon developer forum there was a post by a company pleading for help, claiming that they used cloud computing for at-home EKG monitoring for people at risk of heart attacks, and they apparently hadn't planned on non-Amazon backup. Apparently they had assumed Amazon already provided sufficient backup, an extremely poor assumption, but it probably wasn't helped by marketing hype. Hopefully, this will stop others from making the same kind of mistake.

NickW
2011-Apr-24, 06:22 AM
Now I have to go check to see if the PSN (Playstation Network) outage has also been resolved

I can tell you that as of about 2 minutes ago, it isn't. They have the "undergoing maintenance" screen still up. Glad I haven't been in the mood to play online lately.

WaxRubiks
2011-Apr-24, 06:59 AM
It was the start of the Skynet attack.


Thursday marked Judgement Day.

The day, according to the Terminator franchise, that military computer Skynet would become self-aware and turn the machines against mankind. While killer robots do not, as yet, appear to be travelling back in time, 21 April was marked by a few computer glitches, including problems at Amazon's web hosting service. On the company's web forum someone asked if it was Skynet's fault. In an outbreak of humour one representative apologised for the issues, before adding: "Skynet did not have anything to do with the service event at this time."http://www.independent.co.uk/news/business/news/business-diary-judgement-day-for-amazon-2273788.html

elizabeth25
2011-Apr-24, 07:48 AM
so glad BAUT is back, i was getting a bit frustrated, felt like i was getting withdrawals from it :shhh:

i like your penguin dance swift, very cool :D

jlhredshift
2011-Apr-24, 12:19 PM
I too wondered on Thursday if it was local. I felt cut off from the world, literally. I also realized how few members emails that I have that I would normally converse with. (Cold and alone in a cave by myself.) Well, I got some reading done.

Jim
2011-Apr-24, 03:03 PM
Last Wednesday I left a note for the other Mods saying I would be gone Thursday through Saturday.

It's not my fault! I did not mean to take BAUT with me!

slang
2011-Apr-24, 03:09 PM
BAUTages are terrible!

Tensor
2011-Apr-24, 03:14 PM
Last Wednesday I left a note for the other Mods saying I would be gone Thursday through Saturday.

It's not my fault! I did not mean to take BAUT with me!

Is so, you shoulda left the keys by the door when you left.

I was wondering on Thursday, then figured BAUT was on Amazon when I heard about the outage Thursday evening.

grav
2011-Apr-24, 04:49 PM
I found this site (http://isthatsitedown.com/www.bautforum.com.html) when trying to determine if there was a problem with BAUT or my own computer. It said BAUT was down at the time, and of course other sites can be found out this way, so it may come in handy.

emmylou
2011-Apr-24, 05:21 PM
Pleased ya back :) was having withdrawl systems ;)

Swift
2011-Apr-24, 06:48 PM
I found this site (http://isthatsitedown.com/www.bautforum.com.html) when trying to determine if there was a problem with BAUT or my own computer. It said BAUT was down at the time, and of course other sites can be found out this way, so it may come in handy.
Nice find grav.

One of the first things I check when BAUT seems to be down is the Universe Today site. Since they are on the same server, they rise and fall together. Second, I found it helpful to check the Universe Today Twitter feed. Fraser posted several updates on there, and they are independent of the BAUT/UT server.

HenrikOlsen
2011-Apr-24, 07:23 PM
One of those cases that can make you think.
Which is preferable, 2 tech guys to fix anything and you're customer #1 or 200 tech guys to fix anything but you're customer #100,000?

Gillianren
2011-Apr-25, 01:21 AM
Just as well. I wasn't home anyway. I always miss so much when I'm gone for a few days.

Fazor
2011-Apr-25, 03:42 AM
I can tell you that as of about 2 minutes ago, it isn't. They have the "undergoing maintenance" screen still up. Glad I haven't been in the mood to play online lately.

Shortly after I posted that, Sony made an official statement saying that due to a detected security breach, they had taken down the PSN, and are working at rebuilding a new network sans weakpoint. So, apparently, the outages were unrelated. And, from the sounds of it, it may be a bit longer before Playstationers get their networking back.

But that's unrelated to BAUT. Just glad I have this place back.

parallaxicality
2011-Apr-25, 07:53 AM
Do we know if the Amazon cloud and PSN outages are connected? If they aren't, that's a heck of a coincidence.

NickW
2011-Apr-25, 02:47 PM
Supposedly, they are not connected. PSN is saying they had some sort of intrusion in their system, and decided to rebuilt there system while it was down.

Jeff Root
2011-Apr-29, 01:31 AM
Are BAUT and Universe Today on the same physical computer?

Are they even limited to one computer each, or are they distributed
over several computers? Before this incident, I had understood
(or, apparently MIS-understood) "cloud computing" to mean that the
database is distributed in a cloud of computers. But it now appears
that the term for that may be "distributed computing", while "cloud
computing" means nothing more than that the database is located
somewhere other than the database owner's equipment.

-- Jeff, in Minneapolis

HenrikOlsen
2011-Apr-29, 02:15 AM
"Cloud" is a very fuzzy concept, 87% of which is marketing hype.

What it basically boils down in this case is that the physical server that runs things doesn't matter, it's a virtual server which can run on any of a large number of physical servers and can move around between them depending on resource need and which servers are running. The file system(s) related to the virtual server is run on other servers as well (actually they are presented as disk devices to the virtual server which can then do with it what it wants).
The database may or may not be on one or more other servers.

BAUT and Universe Today are on the same virtual server (I'm don't know about the database), which means that whichever physical server is running one is also running the other.

What happened at this crash was that a network outage triggered a cascade of administrative traffic, which is what controls what's running where, which wedged the mirroring processes that are supposed to ensure that the file systems are safe, which blocked the virtual servers from accessing their disks, causing them to fail to start, triggering even more administrative traffic, taking everything down in a spectacular fail.

From outside, it looks like they've been running at marginal capacity for a while, without having the capacity needed to take the full brunt of everything thrashing at the same time.
This can work for a while and is a situation that can be nearly impossible to notice you're getting in when things are going well, because everything is working well, right up until the time when something triggers a response from everything at the same time, then things lock up hard because suddenly the queue of things the servers have to do is so large that it slows them down enough to trigger even more things to do, resulting on a queue that grows faster than there are resources to cope with.

To be honest, I really don't think this was the last time this happens.
The whole point of a cloud is that by distributing virtual servers over a lot of physical ones, then since they don't (normally) all use peak resources at the same time, this smoothing of max resource use makes it possible to use fewer physical servers, thus making it competitive prize wise to running things on your own server.
This will nearly by definition mean that if something again manages to triggers simultaneous peak use of a resource, things will break badly.

pzkpfw
2011-Apr-29, 03:21 AM
Yeah, virtualisation of servers is sometimes touted as a "green technology". I've read brochures where it's the "main point" over and above scaling, fail-over and other benefits.

Basically, instead of having two servers, both fed with power and air con, running at 45% each, you can have one physical server running at 91%. That's saving power.

But each of those 45% servers has 55% of capacity "spare" just in case. That 91% server doesn't have much capacity left - and it's running two systems...

Jeff Root
2011-Apr-29, 04:16 AM
My sister just found out that her computer's hard drives were
configured as RAID-0 when she bought it five years ago. She
also just found out what "RAID-0" means...

-- Jeff, in Minneapolis