As you probably know, listnook was down or degraded for the last 36 hours. Right now we are still a bit degraded, but we have enough servers to handle the weekend traffic (we think). We hope to be at full capacity by Monday.
We want to tell you why listnook was down.
In short, Amazon had a failure of their EBS system, which is a data storage product they offer, at around 1:15am PDT. This may sound familiar, because it was the same type of failure that took us down a month ago. This time however the failure was more widespread and affected a much larger portion of our servers (and not just ours, many other companies were affected as well). Namely, most of our database slaves were disabled from this outage. Even though we are spread across multiple availability zones (data centers), it did us no good in this case, since the outage was so widespread and hit multiple availability zones.
Since that last failure, we have been doing everything we can to move ourselves off of the EBS product. We're about half way there. All of our Cassandra nodes are now using only local disk, and we hope to have all of postgres on local disk soon.
We will continue to use Amazon's other services as we have been. They have some work to do on the EBS product, and they are aware of that and working on it. The other services that we use are still performing as expected.
That being said, if you work for another hosting platform and believe you can make a compelling offering, please contact us at hosting@listnook.com, and we'll get back to you in a few days.
The team and I have been up the last two nights waiting for this issue to get fixed on the Amazon side so that we could bring the site up as soon as possible. Because of this, we probably won't be around much to answer questions in the comments here, but feel free to talk amongst yourselves. :)
As always, thank you all for your continued support. And to whoever sent us a pizza, thank you! It was much appreciated.
To end on a high note, I'd just like to mention that we are making excellent progress on the hiring front to bring on some new developers to help us implement long term fixes. We hope to have some exciting announcements in that area soon.
They were mirrored, which is why we were able to come up in read only mode after they cleared the outage in that zone.
Unfortunately we didn't have enough capacity to also allow writes.
110
MarderFahrerApr 23, 2011
+1
except for Gold members... right?
1
[deleted]Apr 23, 2011
+1
How long was it down before you could bring it back up in read-only mode? I took off from and landed in multiple airports while Listnook was down and wasn't able to access Listnook at any point in between.
1
csulokApr 23, 2011
+1
wouldn't it be possible to run listnook from multiple zones (as in east coast + west coast + europe)? that way no major issues like this would really affect it, only very major ones, like a complete system failure of aws
1
st_samplesApr 22, 2011
+49
Honestly /\ /\ /\ /\ this is why I love listnook. Its ran by real people who
will actually explain whats going on.
49
CrasyMikeApr 23, 2011
+3
^^^^^ didn't used to be a rare occurance. There used to be much more technical details! I'm complaining, but I understand why it is that way. Not enough time to be writing long speechs about every downtime.
3
pchristophelApr 22, 2011
+982
What other site has users that would send the admins pizza during an outage?
982
oSandApr 23, 2011
+9
/b/. But not in a good way.
Also, it probably was a mistake to mention this. Next outage, you're going to get several thousand pizzas from well-meaning listnookors.
9
jedbergApr 22, 2011
+870
A bacon pizza at that!
870
brmjApr 23, 2011
+1095
I was the guy who sent the bacon pizza. [Proof.](http://imgur.com/dunO2) [Proof 2.](http://i.imgur.com/ZRfWw.png)
Glad you guys appreciated it. It was partially because with this situation I figured you guys could use it and partially because I had meant to do something nice for you guys to thank you for partnering with the FSF but never got around to it.
1095
jedbergApr 23, 2011
+63
Finally someone claims responsibility! Thanks you for sending that, it was a much appreciated snack, and quite a surprise.
63
PepEyeApr 23, 2011
+13
I don't understand why you didn't know who sent it when it says his username on the additional instructions..
13
JoeThankYouApr 23, 2011
+8
I worked for a pizza place, and we occasionally got special requests like this. Usually, we did most of the things they requested, but small things we often said, "f*** this guy, he's not the boss of me!".
Or we just forgot.
8
Mutiny32Apr 23, 2011
+6
They were probably busy with the Amazon ass-chewing to really pay too much attention.
6
[deleted]Apr 23, 2011
+3
Need to give him some Listnook gold.
3
[deleted]Apr 23, 2011
+41
get that man a pizza trophy
41
MrValdezApr 23, 2011
+42
It was **YOU**! GET HIM!
42
[deleted]Apr 23, 2011
+474
As long as you understand that it's a one time thing, we can't have you 'flipping switches' every time you get hungry for bacon pizza.
474
snowydayApr 22, 2011
+40
Pardon my ignorance but, for future reference, to what address can I send food for you guys.
While I'd prefer to never have another outage I'd happily send some Thai or pizza your way.
40
GoneWildAccount1234Apr 23, 2011
+37
I think this would work, not sure.
listnook c/o Wired
520 Third St
Third Floor
San Francisco, CA
94107
37
hardeep1singhApr 23, 2011
+47
I have a feeling listnook will have a huge pile of pizzas from all over the world waiting at their door, the next time they go down.
47
TheyCalledMeMadApr 23, 2011
+143
Well, that is the gentlemanly thing to do when someone goes down on you.
143
jc4pApr 22, 2011
+217
Better than a narwhal pizza.
217
JazzbandrewApr 22, 2011
+228
How do you know that? ಠ_ಠ
228
doug3465Apr 22, 2011
+102
I have tasted narwhal pizza several times in my dreams
102
capgrassApr 22, 2011
+84
Narwhals taste like bacon in MY dreams.
84
ProbablyHittingOnYouApr 22, 2011
+108
4chan, but for them it would be meant as a bad thing.
108
[deleted]Apr 23, 2011
+89
The only difference is that the listnookor would send just one, and they'd pay for it.
89
[deleted]Apr 23, 2011
+19
I was making a cruel joke. I sent them an EBS pizza. (Extra Bacon, Sausage".
19
[deleted]Apr 22, 2011
+25
[deleted]
25
jedbergApr 22, 2011
+50
We enabled all gold users and 13.5% of general users to test our systems and get some fresh content.
50
delola3100Apr 23, 2011
+12
that's odd, I am a gold member, and I couldn't log on till this afternoon ಠ\_ಠ
12
[deleted]Apr 23, 2011
+24
All that glitters is not gold.
24
[deleted]Apr 23, 2011
+6
Why not consider a more fair metric like seniority (4-year club, then 3-year, etc?) or total karma? Why not reward those that contribute the most to the actual website?
6
superdugApr 23, 2011
+2
so, making the assumption that you wanted to test 15% of the user base ,,, that means 1.5% of the userbase is a gold subscriber ... which means conde doesn't want people to know you've about a quarter mill in listnook gold by now?
2
[deleted]Apr 23, 2011
+1
Out of curiosity, how could you tell someone was a gold user before the person logged in? During the down time, I didn't see the log in field.
1
formodeApr 23, 2011
+3
if (user.hasGold()){
enableLogin();
}else {
hideLoginField();
}
You do have a cookie with your account (likely) stored on your machine. They probably checked that.
3
[deleted]Apr 23, 2011
+4
[removed]
4
[deleted]Apr 23, 2011
-1
Is this going to be the new model for the site? Pay to post? Or is that only when things go wrong?
-1
Id3sApr 23, 2011
+8
Don't Listnook gold members get to test new features?
I'm just going to count testing whether or not the site is going to implode as good. I mean sheesh, this is a rare occurrence. Calm down.
8
[deleted]Apr 22, 2011
+27
Appreciate all your hard work keeping the site up and runnning, don't appreciate being lied to with regards to listnook gold though. I'll just quote my comment from [here](http://www.listnook.com/r/listnook.com/comments/gv7jp/anyone_else/c1qjb08):
>What I find disappointing isn't so much the gold people acting like cockmongers today, but the actions of the listnook admins. They assured us that gold subscribers wouldn't be treated any differently before they pulled the trigger on implementing it, yet it's happened at least twice now that I've seen. Perhaps more, I don't know, but these were the ones that really stuck out at me. The first time I saw it was with [listnook mold](http://i.imgur.com/J8CTh.png) where we were told it would be random, then they admit after the fact it was a gold thing at first (links and context in pic). Then it happened again today with only allowing gold subscribers to login while [claiming at the top of their site that it was random](http://i.imgur.com/Ol5qQ.png). Note that the thread is about only gold users being able to login and Jedberg came in to deny it, then admits to it further down.
>**I clicked on a gazillion different usernames of people posting during their supposed "now allowing limited random logins" and not a single one was a regular user. They were all gold subscribers**. So if you're telling me it was random then you're a liar. If you're going to have user tiers, meaning you're going back on your promise to not treat donators/gold people different from way back when, then stop f****** lying when you're caught with your hand in the cookie jar and just f****** do it already. Otherwise just post that you can't allow access to everybody yet and you're using gold users to test things out. Just don't f****** lie and act all "cutesy" about it.
EDIT: I apologize for coming across so crass. Definitely could have made the case better, but I was a little pissed.
27
kremmyApr 22, 2011
+63
While put kind of angrily/crudely, I agree with this guy. I have no problem whatsoever with you saying "we can only support a small chunk of people right now so we're letting gold users back on" but have the balls to actually say it.
63
jedbergApr 22, 2011
+181
But we didn't just allow gold users. We allowed all gold users, plus a random 13.5%. We needed a set of users that we knew would exercise our systems.
I'm sorry if you felt like we were lying, but the fact of the matter is that gold users are more active than others, so they were the ones doing most of the posting.
181
HelloMaxwellApr 22, 2011
+24
I'm an active member without gold. I'd say I'm more active then most gold members. I may not submit often but I certainly comment enough. I don't "subscribe" to listnook because I don't believe a website with more than 1 **billion** page views a month should need donations from it's users. I think listnook gold is a band-aid on a serious wound and that's why I don't subscribe.
**EDIT**: My initial apprehension has been replaced by supreme apathy. I have become the thing I detest: a listnookor who whines about listnook whilst on listnook.
24
[deleted]Apr 22, 2011
+3
Why 13.5%? Just for curiosity - no whining intended.
3
[deleted]Apr 22, 2011
+15
[deleted]
15
r2002Apr 22, 2011
+69
Here are my thoughts about this:
* The people who posted in [/r/suicidewatch](http://listnook.com/r/suicidewatch) should always get priority access regardless of membership status.
* If only a small subset of users can be serviced, I'm ok with using Gold membership as one of the criteria for selecting the group. But It would be nice if other things are taken into consideration as well, such as number of spammers reported, frequency of votes on new stories, or the average ratio of up/down ratio for submissions. People contribute to Listnook in many ways beyond just donating money.
* Even if you're not happy with how this was handled this time, you probably should give the admins a pass. They've had a tough couple of days.
* I want to thank the Gold members for donating money to keep Listnook running.
Full disclosure: I was a charter Gold member but haven't renewed my membership recently.
[Cross posted from this thread](http://www.listnook.com/r/listnook.com/comments/gvbyw/how_do_people_feel_about_giving_gold_members/)
69
MindStalkerApr 23, 2011
+20
Unfortunately if people know you get priority by posting to /r/suicidewatch then that sublistnook would get spammed by people trying to game the system. NOT what you want.
20
r2002Apr 23, 2011
+19
People who f*** around in suicide watch should be banned for life.
19
chengizApr 23, 2011
+5
On your first two items:
* Shortsighted. If suicidewatch remains open during downtime, listnookors will just post to it instead. "Halp I am suicide - ffffffffuuuuu".
* Who cares? If goldmembers are the only members who can post during a downtime, good for them - why, it's the bar of gold they have with them, naturally.
5
SicSemperHumanusApr 22, 2011
+38
>The people who posted in /r/suicidewatch should always get priority access regardless of membership status.
Thank you.
38
jwhardcastleApr 23, 2011
+7
The only problem is this: the people who need *help* on /r/suicidewatch are likely to be first-time posters, so you can't determine ahead of time who they are. The **awesome** listnookors who devote their time to helping people there should definitely get some love, though.
7
harrisonfireApr 23, 2011
+1
It doesn't really matter to me, but I can assure you that you did not allow all gold users.
1
[deleted]Apr 23, 2011
+6
> _But we didn't just allow gold users. We allowed all gold users, plus a random 13.5%. We needed a set of users that we knew would exercise our systems._
So allowing 100% of gold users and 13.5% of other users isn't considered treating them differently? How does that work? That's pretty much sealed my resolve to never renew my Listnook Gold ever and turn Adblock back on.
6
[deleted]Apr 23, 2011
+4
And, because of your actions I am now installing ad block and using only on this site.
Why? Because you decide to play word games / treat different classes of users different after saying you would not.
This makes you untrustworthy.
4
dkitchApr 22, 2011
+2
I've got Listnook Gold and I was still in read-only mode until just recently. Is this possibly due to expired/stale login cookies, or something else?
2
LinuxFreeOrDieApr 22, 2011
+5
Did you explicitly visit http://www.listnook.com/login?
Because the "sign in" button didn't show up for me either, but I could successfully log in if I went there.
5
damontooApr 23, 2011
Was the 13.5% of non-gold members allowed at the same time, or did you allow 100% of the gold users first, then after the fact add the non-gold? This makes a very, very big difference to me.
Just for the record, I don't care at all if gold members are allowed access first in these situations. I'd even go so far to say that they should. I just want honesty about it.
For hours and hours in the Listnook chat a small group of people were spouting crazy conspiracy theories about how Listnook is doing undercover promotion of ad content etc. I vigorously defended Listnook against those wacko's and others bashing Listnook's choice in service providers. In return I see your comment in the other thread which I felt was a lie and apparently a lot of others agree.
That said, I'm not going to hold anything that happened or was said the last couple days against anyone. Sleep deprivation and stress can be a b****.
0
[deleted]Apr 22, 2011
+15
Like my post said I was looking for a non-gold person and couldn't find one. I apologize if they were allowed but I just never found one. I did look at a good 100+ profiles and every single one was gold. Based on your follow up comment one could presume it was for gold members only. With listnook mold we were told it would be random and it wasn't based on the two blog posts in my screenshots.
I do appreciate gold subscribers supporting the site and ultimately I don't have much of a problem with them getting certain perks for it. My main issue is that it seems you are all reluctant to admit that instead of just telling us what's up. In the case of listnook mold we were told one thing and something else happened.
15
fazonApr 22, 2011
+2
Server noob: isn't more traffic during an outage a bad thing?
2
Kowai03Apr 23, 2011
+14
>but the fact of the matter is that gold users are more active than others, so they were the ones doing most of the posting
Did anyone else notice how all this wonderful Listnook Gold "content" was boring as shit? There was barely anything up and most of it was them all congratulating themselves on being able to post/get to the front page. I didn't bother reading anything at all.
14
saintlawrenceApr 22, 2011
+114
That kinda thing belongs in /r/firstworldproblems.
At the end of the day, who the f*** cares? Listnook's back, the karma's a-flowin', the bacon is cooked and the women still don't exist.
114
[deleted]Apr 22, 2011
+35
For f***'s sake, next time just come out and say Listnook gold members will have full access, and do not display the message on the front page which said that it would be random. People felt like you were lying, jedberg, because you posted [shit like this](http://i.imgur.com/0sMWK.jpg).
35
[deleted]Apr 22, 2011
+53
Who the f*** cares? Seriously, of all the pedantic nonsense to give a shit about, why the hell does it matter who got let back on the site first?
Besides, it's not like it's a huge amount of money anyway, Listnook Gold members didn't exactly sell their children to be part of this elite club.
How's this: I'll buy you a month of Listnook Gold if you shut the hell up about this, simply because I don't want to hear about it for the next 6 months.
53
[deleted]Apr 22, 2011
+3
Thanks, but can you explain to a non-techie why you can't do whatever google, twitter et al do to keep a 99% uptime?
3
jedbergApr 22, 2011
+25
They spend a truckload of money to do that. We don't. :) Also, they have a lot of people, we have 3.
TLDR: $$$$$$$$$$$$$$$
25
[deleted]Apr 23, 2011
+5
Yet you consistently say money isn't a problem and more people buying listnook gold would not help.
5
[deleted]Apr 23, 2011
+1
[deleted]
1
teemApr 22, 2011
+9
and despite all that money, Google services (gmail) and twitter have also had major outages in the past.
9
timdorrApr 22, 2011
+181
> That being said, if you work for another hosting platform and believe you can make a compelling offering, please contact us at hosting@listnook.com, and we'll get back to you in a few days.
I really hope someone does convince you to get off Amazon. EC2 is great for getting off the ground and/or certain types of workloads, but it's generally very costly in overhead and performance when run at scale. They are best for either edge of the bell curve: When you're just starting off and need to get something going easily and quickly; Or when you're at Netflix or Amazon scale, need a *huge* number of systems, and can effectively architect around these issues based purely on raw size.
The problem is listnook is still in the middle of that bell curve, and you guys don't have a budget to "overscale" the service to maintain performance. So, the overhead of virtualization and contention of sharing resources is starting to creep into your day-to-day operation. Native, unshared hardware is really the way you should be heading. You'll get drastically more bang for your buck, and with larger providers like Rackspace or Softlayer/ThePlanet, you can get new systems online at a rate competitive with EC2 (especially for the price). Also, given the scale you'd purchase at, they'd be willing to drop the prices listed on their sites 30-40% easily.
Or even better: Buy your own hardware and colocate. It is *stupid* c**** for transit nowdays. You can find amazingly good systems builders that are building for basically the same price as a 6-9 month rental cost. You'll get far more bang for your buck. And, side bonus, you'll have something you own with actual equity. Win, win, and win.
Credentials check: I used to be the owner of [A Small Orange](http://www.asmallorange.com), which owned and colocated all our systems.
181
crlarkinApr 23, 2011
+29
I don't think listnook has the man power or expertise to handle a colo situation. Overall I agree that dedicated hardware is a much better way to go in terms of reliability, the main issue will always be economy of scale. Dedicated hardware is never c****, and you don't not pay for it if it is not in use. I don't think the 30-40% mark down is realistic entirely though, maybe 20%. Providers on the level of Rackspace and Softlayer don't often "drop their pants" as we say when it comes to pricing, 10-20% off retail is much more feasible in my experience. My Credentials: I am a Senior Hosting Consultant with SingleHop.com, http://www.inc.com/inc5000/profile/singlehop, we are a managed hosting provider on the service level, but not size of SoftLayer and Rackspace, and I sell complex application clusters like this on a regular basis.
29
[deleted]Apr 23, 2011
+13
Softlayer will "drop their pants" when it comes to buying large inventory. Look at 100TB.
You can use their "cloud", or, if you want, use local machines/disks and setup your own private network across Seattle, DC, Dallas, and soon to be San Jose (if I'm not mistaken) and some others across the world.
Utilize larger CDNs like Internap or Akamai that have much greater uptime.
I've been with a number of hosts, but all of our important stuff is at Softlayer, with a 1gbps (soon to be 2) private network between DC and Dallas. With their new POPs, you can build a fairly robust system.
I have never experienced an unplanned outage with Softlayer. They have brought down their iSCSI and some switches for planned upgrades.
Listnook is one of the biggest sites on the Internet, and in the end they should be hiring some very senior level architects to set this up.
13
Prometheus2k2Apr 23, 2011
+18
I work in Bandwidth and I'd be happy to get competitive bids from 70 providers who can hit your location, or I can connect you directly to the leasing companies who own (and occasionally manage) the big west coast datacenters.
I <3 Listnook and anything I can do to prevent the calamity that is downtime is at your disposal. I'm willing to help, just shoot me a PM.
18
[deleted]Apr 23, 2011
+23
i love a small orange :) it was my host of choice when i still had websites :D
23
ShinhanApr 23, 2011
+2
Btw, do note the difference between EC2 and EBS. EBS is a piece of c***. And not because of the downtime but because of unreliable and inconsistent performance. [Percona compared EBS to SSD](http://www.mysqlperformanceblog.com/2011/02/21/death-match-ebs-versus-ssd-price-performance-and-qos/) and here's the summary (although do read the whole article):
>* Server one in the datacenter is maybe a $10k machine with a $3000 disk array (say $4000 total per year plus colo costs, if you buy the server and rent a rack), responding to the database in generally sub-millisecond latencies, at a throughput of 30-40MB/s with quite a bit of headroom for more throughput.
>* Server two in the cloud costs about $17k to run per year, plus about $1500 per year in disk cost (up to $3000 per year now that they’ve added 10 more volumes), and is responding to the database in the tens and hundreds of milliseconds — highly variable from second to second and device to device — and causing horrible database pile-ups.
>* We’re comparing apples and oranges no matter what, but put simply, **price is in the same order of magnitude, but performance is two to three orders of magnitude different**.
I dont see any way to overscale EBS. You will still randomly get degraded performance for some requests because EBS is just that unreliable.
2
[deleted]Apr 23, 2011
+12
Since when does server hardware build equity? It's a depreciating asset.
12
[deleted]Apr 22, 2011
+5
[deleted]
5
liganicApr 22, 2011
+190
I found it a little bit disappointing that there was no update at all on the [listnookstatus](https://twitter.com/listnookstatus) Twitter feed. Some updates would have gone a long way.
190
hueypriestApr 22, 2011
+315
uhhh. that's my fault. sorry.
315
[deleted]Apr 22, 2011
+218
nah, we can blame this one on amazon too.
Hueypriest was also hosted on Amazon's EC2, so he was in read-only mode as well
218
liganicApr 22, 2011
+57
I like that explanation! It keeps my view of the admins as infallible demi-gods intact.
57
digitalpencilApr 22, 2011
+61
what?! you mean to say you were too busy fixing the site to bother updating the twitter feed to say "it's still down, quit pressing f5 you fuckers!"?
priorities hueypriest, priorities..
61
insomniasexxApr 22, 2011
+16
There were updates on the top listnook.com, the site that the rest of us were f5-ing for almost 2 days.
16
FalloutApr 22, 2011
+402
Thanks guys.
Are you getting any compensation from Amazon? That was a hell of an outage and you must've lost quite some ad revenue..
402
busyasabeeApr 22, 2011
+157
My company had all our servers on ec2 in VA, we only just got back up and running completely. We provide a critical web service that can't go down and we went down, when all is said and done we might wind up giving our customers a free month of service, which will cost us $100k.
We don't expect any compensation from Amazon. Cloud computing isn't some magic black box, it's subject to uncertainties like any other solution. We fucked up by relying to heavily on Amazon and so did Listnook. This is a valuable lesson to all companies who rely on the cloud.
157
anonytrollApr 22, 2011
+143
wait a second. you provide a "critical web service that can't go down" and you relied on a company that does not advertise five 9 uptime? whose idea was that? you are aware that many other companies guarantee five 9 uptime, right? somebody on your end dropped the ball too.
143
busyasabeeApr 22, 2011
+77
No shit someone on our end dropped the ball, that's the point of my comment. We can cope fine with down time, what we can't cope with is down time on 100% of our servers. This was a black swan event and we weren't prepared for it.
77
[deleted]Apr 23, 2011
+54
I don't think you understand what a black swan is. That's an inconceivable scenario where you have to change a definition because you suddenly see something you never thought possible. Hence when swans were all thought to be white and then Europeans get to Australia and are all like "SWAN Y U NO WITE"
What you experienced is just called a typical violation of the 7 P's:
Proper prior planning prevents piss-poor performance.
54
HomerJuniorApr 23, 2011
+85
I like to think that's where it's so bad the techs just say "f*** it" and go search for clips of Natalie Portman and Mila Kunis making out till the system fixes itself.
85
SoensouApr 23, 2011
+28
That's what I do in response to all failures of any sort I experience.
28
IConradApr 23, 2011
+4
I know of at least one company which "allows" their techs to have CoD on a local share. For instances of just this nature.
4
NotSoFatThrowAwayApr 22, 2011
+63
Just for clarity, does five 9 uptime = 99.999%?
Thanks.
63
[deleted]Apr 22, 2011
+58
[deleted]
58
[deleted]Apr 23, 2011
+35
Hopefully all of it is on christmas morning around 0400
35
[deleted]Apr 23, 2011
+81
Unless it's a service that Santa relies on. Then it's the WORST time.
81
TenarethApr 22, 2011
+32
> We provide a critical web service that can't go down
If that is the case, don't rely on an external vendor.
32
[deleted]Apr 22, 2011
+35
Easier said than done. For example, To provide a critical web service and not depend of external vendor(s), would require them to build themselves geographically distributed data-centers, and run them... and so on and so forth. This can get extremely cost/time/effort prohibitive.
35
player2Apr 23, 2011
+24
Don't rely on a *single* vendor. Have No Single Point of Failure. There's a reason you get two or more independent links to the Internet; get two geographically disparate colos, each with a hot spare.
But the obvious truth is, the lessor your business you own, the more powerless you are to do anything when shit hits the fan.
And shit will hit the fan.
24
[deleted]Apr 23, 2011
+21
I dont disagree. That is why Huge companies like Google, FaceBook, Apple, Microsoft etc. run operations themselves, and control all/many components.
It just is not financially that viable at much lower scales. At the end of the day, its a business decision between risk vs. reward and effort vs. value.
21
player2Apr 23, 2011
+17
I'd argue it's moreso a lowballing of the actual cost of doing business. This underappreciation of necessary risk mitigation techniques and their associated costs is a direct result of the deliberately misleading marketing put forth by cloud service providers.
Microsoft is one of the worse offenders here. Their marketing is full of "just poof all of your critical enterprise IT infrastructure up to our cloud and you will save teh moneys, lay off your redundant staff, and score that big fat bonus check."
Sure, cloud providers *may* follow best practices within their walls (and how are you to know? Does your service agreement include an auditing clause?) but their organization represents a single point of failure in and of itself.
Decisionmakers need to think this way: Would you hand over your entire accounting department to a consulting firm without so much as an independent auditor overseeing their work? No, and that's illegal for a good reason. Then what makes mission-critical IT any different?!
17
foreverinaneApr 23, 2011
+4
This is a good point. The "cloud" is best used as a backup/disaster recovery hotspare to otherwise self-managed systems.
At the very least, go with two separate providers/datacenters etc.
Even google can have an issue that affects some customers but not all on their gmail product, and that's pretty simple really.
I wouldn't trust gmail for anything critical, and if you were going to use it would be a good idea to have an email archiving service setup to capture all incoming and outgoing messages to different servers that you could access if one morning you come in and can't get to gmail for some reason out of your control.
4
busyasabeeApr 22, 2011
+18
Our mistake was relying on a single vendor. Nothing wrong with outsourcing critical operations. The lesson we learned is never dependent on one company, or more generally don't allow single points of failure.
18
unclerummyApr 23, 2011
+4
The key takeaway being that, despite what cloud providers would have you believe, the providers themselves are single points of failure.
4
Robo-boogieApr 23, 2011
+3
Even having one vendor is still a single point of failure. companies can go bankrupt, the police can come down and confiscate all the hardware. or godaddy can disable a domain and knock a whole NOC down. this stuff has happened to other hosting providers. it may not be that but it could be a meteor or a power malfunction or even a cooling problem. shit just happens
3
MeritApr 22, 2011
+142
The Amazon EBS terms of service state that a customer will see a 10% reduction in their bill if the total yearly uptime falls below 99.5% of the year, I believe.
Edit: 99.95% maybe? Turns out I don't remember.
142
lectrickApr 23, 2011
+44
So basically they could provide 1% uptime and still collect 90% of the fees?
WHY DIDN'T **I** THINK OF THIS BRILLIANT BUSINESS PLAN?
44
[deleted]Apr 23, 2011
+16
Yes, you'd collect 90% of fees from your remaining customers.
16
wickedcoldApr 22, 2011
+245
99.95, a ship which sailed long ago.
245
schtumApr 22, 2011
+151
To anyone who doesn't want to do the math, that works out to about 4h20m downtime per year.
151
wastelanderApr 23, 2011
+74
EBS = Extremely Bad Service?
74
bananaheadApr 22, 2011
+21
As if anybody cares about the fee while their *entire website is down*. Which is why SLAs are basically meaningless.
21
frownyfaceApr 23, 2011
+19
It's a strong incentive for the service provider to not f*** up in the first place.
19
idiotthethirdApr 23, 2011
+34
It would be a better incentive if the the d******* was proportional to the outage. As soon as the limit of outage has occurred, the d******* no longer provides any incentive at all.
34
frownyfaceApr 23, 2011
+4
Well, it gets really bad and you start losing customers. Listnook has been defending Amazon for a long time, this event ended that, they are now looking for alternatives.
4
MertsAApr 22, 2011
+18
Amazon's SLA guarantees 99.95% uptime, the only catch is that it doesn't apply to their Relational DB service or EBS. Scumbag lawyers...
18
GoofyBoyApr 22, 2011
+6
Exactly, you could still reach listnook.com, just not its database. Its amazing that business people agreed to this. They might as well not have an SLA for the entire cloud service and just had a plan to quickly move a static version of the site to another company's infrastructure.
6
ryckmonsterApr 22, 2011
+329
And I didn't touch digg once!
329
patssleApr 22, 2011
+58
I looked at it and noticed all the top stories have less than 20 comments each.
I laughed and closed the window. What an epic fail they pulled.
58
skookybirdApr 22, 2011
+19
Went down there to check on it. Their top story, submitted 16 minutes ago, is a blogspam version of Listnookor creation [Otomata](http://www.earslap.com/projectslab/otomata).
19
AtarioApr 22, 2011
+50
No amount of downtime would be enough to drive me back there. I'd start Googling random Angelfire sites before I did that.
50
SA_not_JanitorApr 22, 2011
+375
Holy c*** - I completely forgot about digg. Never even considered it.
375
[deleted]Apr 22, 2011
+147
I broke down and spent about 20 minutes on Digg. It's laughably dead.
147
[deleted]Apr 22, 2011
+274
[removed]
274
[deleted]Apr 23, 2011
+91
[deleted]
91
[deleted]Apr 23, 2011
+46
"Who else is smokin' weeeed today, man?"
46
[deleted]Apr 23, 2011
+59
[deleted]
59
[deleted]Apr 23, 2011
+15
Top threads had like 20 comments. So strange!
15
KangalooneyApr 22, 2011
+62
Never been in to necrophilia.
62
GotTheHotsForMyAuntApr 22, 2011
+13
Shoot, I haven't been there since the Diaspora of 2010!
13
[deleted]Apr 22, 2011
+23
Why would you go to digg when you could go to `/.`? B-)
23
pianoconlatteApr 23, 2011
+4
Thanks to /. I found listnook a few years ago. It was innocent flirting in the first year but fast forward two years and I visit /. 3 times a month max. Same with El reg.
Yep, it guess It was my first love too.
4
[deleted]Apr 22, 2011
+239
[deleted]
239
freyrs3Apr 22, 2011
+77
Apparently Amazon only guarantees 99.95% uptime, I don't think [they've quite reached](http://www.awsdowntime.com/) that yet.
Edit: actually they have
77
oditogreApr 22, 2011
+35
>This calculation discounts the recent outage from a theoretical 365-day window of uptime.
**(i.e. assuming no other downtime has occurred, or will occur in the future).**
But other downtime *has* occurred, earlier this year, at least for Listnook. They may not be under 99.5%\* for anybody else, but they probably are for Listnook.
\*I assume 99.5 is what you meant, not 99.95; otherwise, your own link, currently showing 99.56, shows you wrong. :P
35
north0Apr 22, 2011
+22
99.95% is the guarantee. After it degrades below that Amazon offer a 10% d******* or something.
22
nothing_cleverApr 22, 2011
+24
Can somebody help me out here? I don't quite understand that counter. For starters, it's at 99.5%, which is less than 99.95%. Secondly, 0.05% of downtime in a year is about [four and a half hours](http://www.wolframalpha.com/input/?i=.05%25+of+365+days) right?
24
radekyApr 22, 2011
+40
Freyrs3 is incorrect. You are correct.
[Percentage Calc](http://en.wikipedia.org/wiki/High_availability#Percentage_calculation) Its about 4.38 hours technically.
Assuming the full Amazon downtime of 1 day, 14 hours from the uptime calc.. they owe 34 hours of downtime pro-rate.
However, Amazon is going to claim that as soon as the site was able to get "up", the downtime clock stops. Even if not every volume was accessible, etc. There are such loopholes in these contracts/SLAs. Listnook would be lucky to be compensated for half of that. (reading their SLA, its worse than I thought)
> *If the Annual Uptime Percentage for a customer drops below 99.95% for the Service Year, that customer is eligible to receive a Service Credit equal to 10% of their bill (excluding one-time payments made for Reserved Instances) for the Eligible Credit Period.*
It appears the max refund for any month is 10% of that month's service? Someone please tell me this isn't true. This is why I love Rackspace Hosting:
*Network: Five percent (5%) of the fees for each 30 minutes of network downtime, up to 100% of the fees;
Data Center Infrastructure: Five percent (5%) of fees for each 30 minutes of infrastructure downtime, up to 100% of the fees;
Cloud Server Hosts: Five percent (5%) of the fees for each additional hour of downtime, up to 100% of the fees;
Migration:Five percent (5%) of the fees for each additional hour of downtime, up to 100% of the fees.*
From: [SLA](http://www.rackspace.com/cloud/legal/sla/)
40
advanced4Apr 22, 2011
+13
Uh, that would mean they only allow a little over 4 hours of downtime. This was well over that.
13
otterdamApr 22, 2011
+118
Pretty sure 99.5% is less than 99.95%
118
JazzbandrewApr 22, 2011
+109
We need to do more tests to make sure, though.
109
yellow-mellowApr 22, 2011
+19
If it's bananas we're talking about they're exactly the same value.
19
[deleted]Apr 22, 2011
+10
0.05% of (1 year) = 4.38290639 hours. They fucked this one up by a long shot. F****** ridiculous, especially since it has not even been 4 months yet. Thanks for the link.
10
bobbo1701Apr 22, 2011
+142
Am I the only one that thought the headline was "On listnook's outrage?"
My outrage has yet to be addressed!
142
KinderSpiritApr 22, 2011
+212
Thank you. It's always nice to have an official explanation.
And thank you for your work everyday.
212
YunjeongApr 22, 2011
+65
And a prompt one, at that.
Thank you, admins!
65
jedbergApr 22, 2011
+253
I'll tell you a secret. I wrote it yesterday while I was waiting for things to happen. I just changed it a little today. ;)
253
krispykrackersApr 22, 2011
+123
You're not very good at keeping secrets.
123
jedbergApr 22, 2011
+91
It depends on the secret. ;)
91
kevingoodsellApr 22, 2011
+2
Remind me, which secrets are you good at keeping?
2
davidreiss666Apr 22, 2011
+20
How many licks does it take to get to the center of a Tootsie Pop?
20
tupleApr 22, 2011
+22
"As you probably know, listnook was down or degraded for the last [insert double digit number] hours."
22
IConradApr 23, 2011
+1
I just wanted to say -- after three months of being unemployed, I started a new job yesterday. Listnook's being down made me, personally, extra awesome.
So... everything went better than expected? (For me, that is. I'm sure it's little benefit to you guys.)
1
[deleted]Apr 22, 2011
+13
TL:DR - Fact is I love you guys to death for
1. Keep the site up to Read-Only mode (Even If I'm still subconciously trying to upvote)
2. Giving us links to enjoy while its down.
LIKE HONESTLY? mad <3
13
maxdApr 22, 2011
+20
I'm more shocked that a product like that can have a 36 hour outage than I am outraged that listnook was down. Sucks that you're so attached to them, and I'm sure many other companies are; that would probably drive away most of their customers otherwise.
20
StickApr 22, 2011
+59
You should find a hosting reseller than can give you unlimited bandwidth and disk space for $1 a year. I fail to see how it could go wrong.
59
[deleted]Apr 23, 2011
+21
[deleted]
21
kjcdudeApr 22, 2011
+11
Have you guys looked at rackspace? There the closest competitor to Amazon and arguably much more stable.
For those interested, here's a chart of the total EC2 East downtime - http://www.cloudclimate.com/ec2-us/
11
alienthApr 23, 2011
+10
My last job was at Rackspace Cloud. We are well aware of their offering :)
10
STEVEHOLT27Apr 22, 2011
+30
If listnook had worked, it would have been my listnook birthday yesterday.
ಠ_ಠ
EDIT: It could be a couple days from now, I don't know yet.
EDIT 2: Technical people say I'm a month off. Take back the belated karma!
30
AnEnglishGentlemanApr 22, 2011
+184
We don't need explanations yet, we're busy guillotining the listnook gold members...
184
MindStalkerApr 22, 2011
+20
Funny thing is, this will increase gold membership if people think gold buys them more uptime.
20
squatlyApr 22, 2011
+33
Luckily we're just using our body doubles to bear the brunt of the rebellion, whilst we watch on from the Lounge, sipping on our Champagne.
33
[deleted]Apr 22, 2011
+229
Thank you, so very much. I couldn't drink another ounce of piss.
229
voice_of_experienceApr 23, 2011
+5
As another system administrator who got hit (and continues to get hit) by this outage... where are you planning to put your databases now? You mention "local storage" - you don't mean the instance's root device, do you? Because IIRC the special 100GB /mnt directory is actually an EBS too, just one that's less exposed to the admin tools.
I've been looking at glusterfs or an equivalent, spread across multiple availability zones. Unfortunately, bandwidth between zones is not c****, so I'm still looking at other options.
5
facetiouslyApr 22, 2011
+6
Thanks for the 411 and for all of your hard work. I am, and always will be, your friend, jedberg.
That goes for all the Listnook admins. You may not be many, but you are the best of the best.
And to whoever sent the pizza, you're doing it right.
6
BornAgainGropagaApr 22, 2011
+87
Don't worry: in spite of the outage, you didn't lose any users to Digg.
87
[deleted]Apr 22, 2011
+82
I visited digg yesterday, just to see. It reminded me of one of those "what you need, when you need it" placeholder-cybersquatting sites.
82
[deleted]Apr 22, 2011
+32
It's like watching p*** on an iPod. It's not the best, but it does the job.
32
superhyphyApr 22, 2011
+17
As a former digg user, not once did I even consider visiting digg while listnook was down.
17
[deleted]Apr 22, 2011
+19
[deleted]
19
[deleted]Apr 22, 2011
+5
One thing I don't understand is isn't all this cloud stuff meant to be distributed, no single point of failure, always available etc etc? So how come listnook ends up with a face full of wing-wong every time a single node goes down? Isn't this just the same as having it at one data centre anyway? Is there a plan to get it onto something actually resilient soon?
5
roguebluejayApr 22, 2011
+17
I started and finished an essay. Thank you Listnook.
17
[deleted]Apr 23, 2011
+30
[deleted]
30
1RedOneApr 23, 2011
+11
Can you explain this like I'm a seven year old?
* What does being degraded mean?
* What do Cassandra's do? Are they a type of server hardware/appliance?
* Why does an EBS affect listnook?
I'm not asking because I feel entitled or like you owe me anything, I am just curious so please teach me things.
11
redditacctApr 23, 2011
+37
Degraded is when you have a distributed system and some percentage of the parts are down. So if EBS is a distributed filesystem and uses 200 servers but 120 of them are "re-mirroring" then the whole system can serve maybe 12% of its normal total capacity - ie "not down, but not 100% up".
Cassandra is a way to store data that is not a traditional database, it is written in java so it uses a ton of CPU and memory when run by people who hate java but uses a negative amount of CPU and memory when run by people who love java.
Elastic Block Storage is a fancy name for networked disk space, it is just a way for amazon to have some machines with lots of big disks and share that space over the network with many customer apps using different chunks of the total space.
37
[deleted]Apr 23, 2011
+11
> "Cassandra is a way to store data that is not a traditional database, it is written in java so it uses a ton of CPU and memory when run by people who hate java but uses a negative amount of CPU and memory when run by people who love java."
I lol'd. <3
11
[deleted]Apr 23, 2011
+3
Just curious, how are devs gonna fix the hosting issue (oh you said long term issues?). You do realize that any data center under the sun whom has been around for at least 5years and has some decent stats will be better than what you guys are living with. Is it really about cost anymore anyways? I hate to be so brash, but, WTF? You wont get to code if you're stressin' that the site is down.
3
kelekellApr 23, 2011
+8
I have an old laptop laying around, you could store some shit on that if you want.
8
dghughesApr 22, 2011
+8
>it was the same type of failure that took us down a month ago. This time however the failure was more widespread and affected a much larger portion of our servers (and not just ours, many other companies were affected as well).
Listnook we went down before it was cool, the hipster website.
8
krispykrackersApr 22, 2011
+60
Thanks for all your hard work. This wasn't really your fault, no need for apologizing :)
60
ProbablyHittingOnYouApr 22, 2011
+27
Aren't you an admin now? Or something like that?
27
mrlrApr 23, 2011
+6
Oh well. It is traditional for something to crash then come back in a few days at Easter.
6
heytherejesusApr 22, 2011
+739
Thanks, admins. <3
739
BallsOfDisapprovalApr 22, 2011
+1073
ಠ_ಠ - i almost went outside today. that's some bullshit.
<|>
/ω\
1073
pingvenoApr 22, 2011
+122
Were you going to put on clothing before doing so?
Edit: If your immediate thought was "I would if I were him/her", your presence is requested at your local World Naked Bike Ride. Even better, make your way to [Portland's World Naked Bike Ride]!
122
happybadgerApr 22, 2011
+91
[In picture form.](http://i.imgur.com/CB9Z8.jpg)
91
The_Book_Of_RedditApr 22, 2011
+313
**“For it was during the great period of uncertainty that the Listnooks searched to the edges of the Internet for another to join unto the Listnooks and ensure that they would be accessible unto all, and it was unto this end that the Listnooks did strike an accord with the Amazon who proclaimed they did knoweth that which the Listnook sought, for they had the AWS, EC2 and the RDS and they shall become the keepers of the Listnooks and ensure that they would be accessible unto all.**
**Yet it was shown that the Amazons promises were hollow and that the mighty Listnooks were subject to the whims of the Amazon. And there was much lamentation for it was unjust that the mighty Listnooks were to be at the whim of any who should try to slow it.**
**And as the many were forced from communion with the Listnooks, the abandoned ones did cluster in the Freenodes, and there was much despair as they waited for the Listnooks to be returned unto them for it was felt as if millions of voices suddenly cried out in terror and were suddenly silenced**
**And lo, Nevercomment did despair that within the Freenodes it was troublesome, as if herding cats. Then there was much discussion on this for cats are good.**
**And there were also lamentations for those of the Listnooks who were forced to perform the menial duties at their places of toil for which they were employed and that there were no longer images of Brave Wolves, buxom women and four and score to comfort them through the long hours.**
**And so it was that all was as it is usually and the Listnooks continued on its course to its destiny
uninterrupted” **
--The Book of Listnook Chp 21 pg 722 “The Broken Covenant of the Amazon”
313
saintlawrenceApr 22, 2011
+21
We know it takes a lot of time to construct additional pylons. Thank you for your efforts.
21
InteriorAlligatorApr 22, 2011
+7
Glad you're finally back up and running. I was running out of lambs to sacrifice.
7
ByeujiApr 22, 2011
+9
You guys should [switch to Azure!](http://www.listnook.com/r/listnook.com/comments/gv7nh/microsoft_has_a_solution_to_the_listnook_downtime/). It's obviously a superior service! >.>
Glad you guys are moving back to local storage for Cassandra and so forth. Keep up the good work!
9
k34m0nApr 23, 2011
+4
Downvote me for this if you feel the need or recommend whoever else you feel necessary too.
But ,I say Listnook should migrate over to one of our hosting platforms at Rackspace Managed Hosting. I could go on a massive tirade about the reasons why but I can honestly say we already have a pretty straight and to the point explanation as to why you should choose us!
[Rackspace Managed Hosting](http://www.rackspace.com/whyrackspace/support/index.php)
DO IT LIVE!
4
[deleted]Apr 23, 2011
+6
This has got to be some of the worst PR Amazon could hope for.
Good.
6
inodeApr 22, 2011
+6
That was the longest two days of my life...good work Admins. We appreciate the work you guys put in!
6
ithunkApr 23, 2011
+3
Imgur is hosted by voxel.
Duplicate atleast a part of the site there and see how it goes.
We use voxel at work, but our site doesnt get the number of hits that listnook does, so I really cant give you assurances that voxel will be able to handle your traffic, but we've had good service from them in the past 5 years.
3
316nutsApr 22, 2011
+2
I don't want this to sound too much like the typical witch hunt rant, but I'll try anyway. Keep in mind I know zero about behind the scenes @ listnook, or the nature of webhosting a billion-hit-a-month website.
Amazon is huge, no? How do they stay in business? They tout their robust servers for specific high demand moments like this specifically to avoid holiday season outages that cost online retailers millions of dollars. How does Amazon proper never seem to fail, but listnook always does? What is so special about Amazon's service/price/etc that makes them the "only game in town"? Surely you've given other hosting services serious consideration during this. After a 36 hour down time, I don't know Amazon can look you straight in the eye. Good luck and best wishes moving forward.
I've been in business meetings where "shit has hit the fan" like this. It's not fun for anyone at the table because everyone is at risk.
196 Comments