ListNook

Hey folks, As you may have noticed, the site is back up and running. There are still a few things moving pretty slowly, but for the most part the site functionality should be back to normal. For those curious, here are some of the nitty-gritty details on what happened: This morning around 8am PST, the entire site suddenly ground to a halt. Every request was resulting in an error indicating that there was an issue with our memcached infrastructure. We performed some manual diagnostics, and couldn't actually find anything wrong. With no clues on what was causing the issue, we attempted to manually restart the application layer. The restart worked for a period of time, but then quickly spiraled back down into nothing working. As we continued to dig and troubleshoot, one of our memcached instances spontaneously rebooted. Perplexed, we attempted to fail around the instance and move forward. Shortly thereafter, a second memcached instance spontaneously became unreachable. Last night, our hosting provider had applied some patches to our instances which were eventually going to require a reboot. They notified us about this, and we had planned a maintenance window to perform the reboots far before the time that was necessary. A postmortem followup seems to indicate that these patches were not at fault, but unfortunately at the time we had no way to quickly confirm this. With that in mind, we made the decision to restart each of our memcached instances. We couldn't be certain that the instance issues were going to continue, but we felt we couldn't chance memcached instances potentially rebooting throughout the day. Memcached stores its entire dataset in memory, which makes it extremely fast, but also makes it completely disappear on restart. After restarting the memcached instances, our caches were completely empty. This meant that every single query on the site had to be retrieved from our slower permanent data stores, namely Postgres and Cassandra. Since the entire site now relied on our slower data stores, it was far from able to handle the capacity of a normal Wednesday morn. This meant we had to turn the site back on very slowly. We first threw everything into read-only mode, as it is considerably easier on the databases. We then turned things on piece by piece, in very small increments. Around 4pm, we finally had all of the pieces turned on. Some things are still moving rather slowly, but it is all there. We still have a lot of investigation to do on this incident. Several unknown factors remain, such as why memcached failed in the first place, and if the instance reboot and the initial failure were in any way linked. In the end, the infrastructure is the way we built it, and the responsibility to keep it running rests solely on our shoulders. While stability over the past year has greatly improved, we still have a long way to go. We're very sorry for the downtime, and we are working hard to ensure that it doesn't happen again. cheers, alienth tl;dr Bad things happened to our cache infrastructure, requiring us to restart it completely and start with an empty cache. The site then had to be turned on very slowly while the caches warmed back up. It sucked, we're very sorry that it happened, and we're working to prevent it from happening again. Oh, and [thanks for the bananas](/r/downtimebananas).

196 Comments

maxd Dec 8, 2011 +63

Software engineer here, although not one who is at all good at databases. Could you have a redundant memcached instance which instead of serving pages to the internet serves data to a disk backup, the idea being that when you spin back up the main memcached instances there is something to recover them from instead of having to start them from scratch? Or would that be no better than recovering it from Postgres and Cassandra? I don't envy your problem; as a video game engineer I have a difficult job but it's one I understand very well. :)

alienth Dec 8, 2011 +79

So, in the end, a big part of the solution is to move a lot of this to Cassandra, which periodically saves a copy of its cache to a disk. Cassandra should be plenty fast for the data as well, once we can get everything upgraded to 1.0. We have a bunch of junk that is stuck on an 0.7 ring, which is quite slow. Unfortunately we're in the process of migrating things around our Cassandra ring, so we're stuck for a bit :/ Edit: I should also note, we're using memcache for locking. Once we move locking elsewhere, we can be much more flexible with adjusting the memcache infra.

[deleted] Dec 8, 2011 +24

That was the solution 6 months ago. And 6 months before that. You've been moving to Cassandra for YEARS now.

alienth Dec 8, 2011 +27

Unfortunately we ran into several brick walls on the pre-1.0 releases of Cassandra, thus the delay. We already host a lot of stuff on Cassandra, but we can't move much more to it until we roll out 1.0.

JonLim Dec 8, 2011 +2

I'm not too well versed on the subject, but what made you guys choose Cassandra over some of the other alternatives like Redis and Hadoop? Just curious, and I want to learn!

alienth Dec 8, 2011 +6

Cassandra is very handy in terms of availability. We can define the replication level of our data, and we can define the consistency level we want to read/write our data at. For example, our replication factor(RF) is set to 3, meaning that every piece of data is replicated to 3 machines. When we write out data, we ask for QUORUM level consistency, meaning that the data is written to to at least RF/2 + 1 nodes before the write command is returned. Additionally, Cassandra supports more complex replication placement strategies. If we were to split our Cassandra cluster into two separate, geographically distant locations, we can define a placement strategy that ensures data integrity without bumping into latency heavily. In this case, we can write out using LOCAL_QUORUM, meaning that the write ensures that it has quorum before it returns, but only in the *local* datacenter. I should note that even though the writes are set to QUORUM, Cassandra ensures that they are eventually replicated everywhere. QUORUM write just defines what Cassandra will guarantee before returning a request.

gman2093 Dec 8, 2011 +2

So is that to say Cassandra was chosen for scalibility more so than its sequential-read big O (read:max time) ? edit clarity

[deleted] Dec 8, 2011 +1

[deleted]

alienth Dec 8, 2011 +8

Londiste statement-based replication.

coolmanmax2000 Dec 8, 2011 +5

...Not a computer scientist, but I think you just made that up

maxd Dec 8, 2011 +23

Thanks for the reply. I'm working on an MMO so I get to see an inkling of network and db engineering but I'm an AI engineer so I'm nowhere near that whole layer. Suffice to say I find it interesting and awesome. :)

274Below Dec 8, 2011 +17

memcached sits inbetween the database later and the rest of the app. The app sends the request to memcached which either returns the results from memory (hence the term "memcached") or queries the database, stores it in memory, and then returns it to the app. memcached is "thin" enough that it doesn't even have any authentication or similar -- you can either hit the port, or you can't. I don't believe that it has any facilities to write to the disk and recover from the disk either. Given the purpose and function, though, it may not be a huge help given the read-only mode (which would almost instantly build the data back). Of course, I don't run the website, so who knows! edit: or alienth can reply and say that yeah, it'd help. Answers that.

marcman84 Dec 8, 2011 +644

Reading that explanation, all I could think of was the scene from Jurassic Park where Ellie had to turn on all the fences manually. Was it like that? Please say yes.

644

A_Doctor_ Dec 8, 2011 +77

You can't throw the main switch by hand. You've got to pump up the primer handle in order to get the charge. It's large, flat and gray.

[deleted] Dec 8, 2011 +22

Now I'll worry some admin is being eaten by a raptor every time the site goes down.

alienth Dec 8, 2011 +759

Sure. Why not. It's Unix, I know this.

759

[deleted] Dec 8, 2011 +181

181

thanks_for_the_fish Dec 8, 2011 +275

Or sudo Please work now. I hear that works. I'm not a coder, so you might have to use all caps.

275

SarcasticGuy Dec 8, 2011 +20

> sudo Please work now. "User not in sudoers file. This incident will be reported. Violators will be shot." Uh oh...

[deleted] Dec 8, 2011 +56

The "please" is important. You do not want to make UNIX angry.

IRBMe Dec 8, 2011 +79

[dave@localhost]# alias Please= [dave@localhost]# alias work= [dave@localhost]# alias now.="echo \"I'm afraid I can't do that, Dave\"" [dave@localhost]# Please work now. I'm afraid I can't do that, Dave

[deleted] Dec 8, 2011 +48

A wee bit shorter and a bit more flexible: [dave@localhost]# Please() { echo "I'm afraid I can't do that, Dave."; } [dave@localhost]# Please open the pod bay door, Hal. I'm afraid I can't do that, Dave. TMTOWTDI...

ICanSayWhatIWantTo Dec 8, 2011 +7

> TMTOWTDI... Oh god, did that Perl bug just get ported to Bash?

[deleted] Dec 9, 2011 +3

Heh... Perl was the conglomeration of C + shell, which is also what makes it the best system administrator language around. There's a reason why the `grep` command is built directly into Perl. It's also why there are so many "strange" sigils... they're (mostly) all from Unix shell and awk -- `$?` as process status as one example.

jsshouldbeworking Dec 8, 2011 +6

Love the idea. Quote is actually: "I'm sorry, Dave. I'm afraid I can't do that. " http://www.youtube.com/watch?v=kkyUMmNl4hk (if it's worth quoting, it's worth quoting accurately.)

60177756 Dec 8, 2011 +117

> `rm -rf /*` FTFY. `rm -rf /` actually refuses to run (it complains that you're and idiot and does nothing - *try it!*), but this version works. Edit: did someone send me listnook gold for *this* ‽ Thanks!

117

Razor_Storm Dec 8, 2011 +19

Depends on your unix distribution. For instance, ubuntu absolutely disallows you to remove root unless you type --no-preserve-root, whereas my centos distro doesn't seem to care at all when I accidentally typed sudo rm -rf / instead of sudo rm -rf .

60177756 Dec 8, 2011 +6

Well `--no-preserve-root` takes forever to type; just `rm`ing `/*` has the same effect. When I f*** my life I like to do it efficiently.

Infra-red Dec 8, 2011 +44

Uhm, yeah, don't try that. That may be true now (not going to test it), but it certainly wasn't always the case. I've accidentally done a rm -rf / and it was quite messy about 20 years ago now, but still.

GibletHead2000 Dec 8, 2011 +16

This is why I always type my command, and then press 'home' and add the 'sudo' afterwards... Because _some idiot decided to put backspace right next to enter_

[deleted] Dec 8, 2011 +4

"GNU rm refuses to execute rm -rf / if the --preserve-root option is given, which has been the default since version 6.4 of GNU Core Utilities was released in 2006." http://en.wikipedia.org/wiki/Rm_%28Unix%29

user2196 Dec 8, 2011 +221

You b******. *written from my second computer*

221

bradxism Dec 8, 2011 +38

I read this during breakfast and had orange juice come out of my nose in front of the grandkids.

CantHearYou Dec 8, 2011 +81

"Mom, why did orange juice come out of Grandpa's nose?" "Well, son, your grandpa is one cool dude and he reads listnook at the breakfast table instead of socializing with the rest of the family."

[deleted] Dec 8, 2011 +9

That actually sounds kind of handy. "More juice, kids?" \*sploot\*

[deleted] Dec 8, 2011 +3

That's what Live CDs are for. I think I'm going to put in a request for the devs so that when rm is used in this fashion you get a message like "Self destruct sequence activated! You have 5 seconds to copy or unmount anything you hold dear, or press Ctrl+C to cancel."

[deleted] Dec 8, 2011 +5

You know this joke, which is enough to know that this joke is strictly taboo in proper nerd culture. Cheers, */r/spacedicks subscriber annoyed with you making an off-color joke*

TheyCallMeRINO Dec 8, 2011 +20

It will cause you to stop worrying about memcached, that's for sure.

GrannyBacon81 Dec 8, 2011 +8

Hehe I freaked the IT guy out at work with this. I sent him an IM asking if rm - rf / Was the right command to use in vim. About 2 seconds later he bust through the door in a panic.

berlin_priez Dec 8, 2011 +25

>rm -rf / read mail -really fast/ ?

Serinus Dec 8, 2011 +19

>rm Delete > / Everything > -r And everything in it > -f Do what I say without asking questions.

Skid_Marx Dec 9, 2011 +4

Upvote for this guy. For the rest of you, "read mail really fast" is a joke, guys. A really old joke.

[deleted] Dec 8, 2011 +11

[deleted] Dec 8, 2011 +105

So, 4Chan wasn't DDoSing it?

105

alienth Dec 8, 2011 +157

Nope. Well, if they were, it wasn't enough for us to notice. A DDoS would have been much easier to address than what actually happened :/

157

sje46 Dec 8, 2011 +55

I'm just wondering though...what is the deal with the sticky on /b/? It seems as though moot--or some mod--is really pissed at listnook for some reason.

[deleted] Dec 8, 2011 +14

Probably not moot, maybe a mod though. moot thinks Listnook is ok, he even did an AMA once. It was probably just a joke.

alienth Dec 8, 2011 +98

Nah, moot is cool :)

EvilAce Dec 8, 2011 +20

the sticky went up at 6am. the site started having issues at 8am. I'm no expert, but that's a little suspicious. I agree there's very little chance moot had something to do with it, but a pissed off hacker from /b/ seems like a valid possibility. Especially since the site is open source, a good black hat hacker (which aren't in short supply on 4chan) could easily have found a hole in the security. that's my two cents anyway.

alienth Dec 8, 2011 +72

Not discounting the coincidence. All I can say is that based on the piece of the infrastructure that was having issues, and the symptoms of the issues, it is *highly* unlikely an external attack would have caused this. Additionally, the issues were consistent even when the site was completely detached from the public internet.

[deleted] Dec 8, 2011 -1

-1

alienth Dec 8, 2011 +10

Well, we have 70k people viewing the site right now. The listnook tech team consists of 7 people. I think that might make us the .01%.

[deleted] Dec 8, 2011 +2

just 7 people...wow, that is amazing. could you guys do a group-style AMA?

scribbling_des Dec 8, 2011 +80

It's obviously a double agent. You should put everyone to the question.

[deleted] Dec 8, 2011 +61

Couldn't have been a double agent. All double agents were caught. Every. Single. One.

Galaxyman0917 Dec 8, 2011 +18

That part of the title of that post pissed me off.

[deleted] Dec 8, 2011 +570

I think I know why it went down [today](http://i.imgur.com/yZYNt.jpg).

570

znk Dec 8, 2011 +102

Personally I suspect a MythBusters cannon ball.

102

Bramsey89 Dec 8, 2011 +160

I'm not saying it was 4chan, but it was 4chan.

160

SPACE_LAWYER Dec 8, 2011 +61

I love how after Listnook goes down 4chan claims LOIC like Ansar al-Jihad al-Alami

shillbert Dec 8, 2011 +33

So basically, it wasn't regular aliens, it was aliens with a lisp. Got it.

Osthato Dec 8, 2011 +55

But Listnook is written in Python...

[deleted] Dec 8, 2011 +28

but it was written in lisp before that.

Mythbro Dec 8, 2011 +4

intelligent groovy cough rock enter grandfather sleep reply support gold *This post was mass deleted and anonymized with [Redact](https://redact.dev)*

alienth Dec 8, 2011 +14

Yeah, I'm well aware. 'Twas unrelated to this. They were attempting a DDoS, but the issue we actually had was a failure of an internal-facing service.

iHelix150 Dec 8, 2011 +3

Question- in the past, much of Listnook's downtime was caused by generic Amazon unreliability. Is Listnook still hosted on Amazon? (you mention 'our hosting provider...) Either way though, thanks. Your efforts are most appreciated, and Listnook has been rock solid reliable lately. Kudos.

We're still on Amazon. We've had issues in the past where issues at Amazon triggered very bad things to happen in our infra. We've mostly worked around those issues (dropping EBS was a big part of that). Also, in general, we're now more protected against hosting failures than we have been in the past.

lonnyk Dec 8, 2011 +1

What are you using instead of EBS?

davidreiss666 Dec 8, 2011 +6

I have decided to blame Jedberg. Cause, you know, he's always at fault. Always. But that chromakode guy is kind of shifty too.

immerc Dec 8, 2011 +2

The important thing to take away from this: The practice of adding a 'd' to the end of the name of something to indicate that it is a daemon works well with things like "httpd" and "imapd" and "logind", but when the word ends in an "e" and the "ed" ending can be interpreted as a past participle the convention breaks down. Instead of interpreting things like "memcached" as "memory cache daemon", it is more natural to interpret them as "memory cached", which makes no real sense. This leads to real confusion when people use phrases like "to restart each of our memcached instances", which *sounds* like "to restart each of our instances that are memcached", but in fact means "to restart each of our memcache-daemon instances". So if you're thinking of writing a "hire daemon" or a "fire daemon" or a "bake daemon", please be careful how you name it.

alienth Dec 8, 2011 +3

Yeah. I have similar peeves for things named after very common words, like Go :P What is funny is the last time I made a post regarding memcacheD, I just used "memcache", and more than a handful of people were extremely displeased with me. *shrug*. I vote we just refer to everything by numbers. There are plenty of those available.

myho Dec 8, 2011 +1

i know a bit of this and that about websites creation and programming generally, but I have NO idea what you just said.. the code behind listnook must be enormous and super awesome.. that's all

Architectural suggestion: Deploy two clusters of memcached servers (I don't know the technical specifics on how it works, but I'm assuming you can group them together in serverfarms or something similar), and deploy these as virtual machines on ESX hosts, two per box. Set affinity rules in VMware so that each ESX host is running one VM in cluster A, and one VM in cluster B. Only allow two VMs per ESX host. Now my thought is that since VMware does transparent page sharing, assuming that both VMs have similar memcached RAM caches, you can have both VMs using the same memory for the cache. This means that you can theoretically use the same bare metal hardware you have now, but have twice as many memcached servers. You can individually reload an entire single cluster, but still have 50% of your memcached servers up, and since you've oversubscribed the existing ones, 50% of the future state is actually 100% of your current state. Wait... I don't know if this actually solves anything, but I already typed this all out and it seems like it would be wasteful to select+A, delete, so I'll just post it anyway and see how people reply. I shouldn't post while on Ambien

kremmy Dec 8, 2011 +135

Let me share a story with you, random Listnook admin. I'm frantically waiting to hear back from a DBA specialist while they look at a server that went down earlier and took down production across three multimillion dollar manufacturing facilities. The reason? A database had to be restarted and didn't want to come back up. Sure, we have backups, but erasing 18 hours of production would f*** things up more than not being able to ship for a few hours. It's a proprietary database format too because my predecessors just kind of said "what the f***, why not?" and management has a largely "leave it alone until it breaks, then it's your fault for not upgrading it already with the money we didn't give you" mentality. Point is, shit happens. You're doing your best.

135

livefromheaven Dec 8, 2011 +48

Gotta love that mentality. "Just let IT deal with it, they're good with that stuff!"

farhannibal Dec 8, 2011 +26

That works if you give them the resources to handle it.

autotom Dec 8, 2011 +3

quite seriously, the best things i've done at work have been while im bored out of my mind twiddling my thumbs.. its in my nature to just entertain myself by making useful things

argv_minus_one Dec 8, 2011 +1

>Since the entire site now relied on our slower data stores, it was far from able to handle the capacity of a normal Wednesday morn. This meant we had to turn the site back on very slowly. We first threw everything into read-only mode, as it is considerably easier on the databases. We then turned things on piece by piece, in very small increments. What would have happened if you didn't do this and just turned the whole site back on in full and let the databases deal with it? Would it be atrociously slow, fail outright, or what? Also, I'll be curious to know what you find out about why your `memcached`s failed. Will you be announcing the results of your investigation?

chodeking Dec 8, 2011 +2

So, Listnook HQ isn't full of cats?

thermality Dec 8, 2011 +2

How many memcached instances is Listnook running?

[deleted] Dec 8, 2011 +20

forgetmenow Dec 8, 2011 +772

The downtime should have helped with my studying for exams. *Should have.* I still spent a considerable amount of time checking to see if the site was back up.

772

[deleted] Dec 8, 2011 +29

And now that it's back up, I have to make up for lost time by Listnooking even harder.

JStarx Dec 8, 2011 +118

There should be a support group for people like us... we could make our own sublistnook!

118

swaggle Dec 8, 2011 +126

r/procra?

126

IllThinkOfOneLater Dec 8, 2011 +460

We'll do it later.

460

TheeLinker Dec 8, 2011 +25

I'm pretty sure there literally isn't a single user on this entire website for whom it would be more appropriate to have made this comment. Exquisite.

[deleted] Dec 8, 2011 +138

MAKE THIS MAN A MOD ASAP. or tomorrow, whatever

138

rockerlkj Dec 8, 2011 +412

I went on 4chan and found [this](http://i.imgur.com/ADx8B.jpg).

412

TKInstinct Dec 8, 2011 +48

There was some discussion on /b/, surrounding someone who mentioned that they found an exploit on the servers. They said they were planning some sort of attack or something of the like. Not sure if anyone else saw that.

Yeah I saw that. I thought the problem was people in that thread doing a ddos attack.

[deleted] Dec 8, 2011 +16

I was seriously surprised, after seeing that thread stickied and so many posts on it, that barely anyone on listnook was talking about it as a possible cause. Seems like a weird coincidence, in any case.

[deleted] Dec 8, 2011 +18

The thread is actually still stickied. And I totally agree, it's at least an odd coincidence that the thread was full of people wanting to take Listnook down and then it went down just after that.

[deleted] Dec 8, 2011 +23

The power of prayer!

TKInstinct Dec 8, 2011 +10

It could have been, I didn't think much of it until after I saw listnook in read-only mode.

[deleted] Dec 8, 2011 +19

I read that in Jeremy Clarkson's voice, just as he's about to show something he found on the internet that the BBC has to censor...

foreverandalways Dec 8, 2011 +284

Sometimes things need to stay on 4chan and never leave.

284

letsRACEturtles Dec 8, 2011 +54

like cute cat pics?

foreverandalways Dec 8, 2011 +22

Like fast turtles.

jeckles Dec 8, 2011 +1

I hope you guys are all smoking lots of weed now! That sounds like a rougher than usual day at the office.

[deleted] Dec 8, 2011 +239

thanks for the fairly detailed technical explanation, i can appreciate that a lot. it's impressive the site works as well as it does actually.

239

centralbanker Dec 8, 2011 +17

This is true. If I could find a way to volunteer that would be useful, I'd do it -- alas I posses no technical programming skills, only the ability to make theories based on academic "research".

burnte Dec 8, 2011 +67

I assumed it was because Listnook is hosted on a Motorola XOOM and it went down with Verizon's LTE outage.

[deleted] Dec 8, 2011 +401

I didn't understand a word of that, but I read it to the bitter end. I think I got smarter?

401

[deleted] Dec 8, 2011 +736

736

backbob Dec 8, 2011 +54

I don't know if you care, but "memcache" is a piece of software that basically stores data and webpages in memory, which can then be retrieved very quickly. http://en.wikipedia.org/wiki/Memcached

2percentright Dec 8, 2011 +13

Memory *IS* RAM!

NothingsShocking Dec 8, 2011 +200

something something downtime something something reboot something something sorry.

200

[deleted] Dec 8, 2011 +68

Now you know how I feel when reading most of the math and science threads on this site. OH LOOK THE SMART PEOPLE ARE TALKING ABOUT THINGS.

gigitrix Dec 8, 2011 +19

**THE MEME CACHE IS UNSTABLE! IF WE DON'T ACT SOON WE WON'T EVEN BE ABLE TO "*SHUT. DOWN. EVERYTHING*"!**

somecallmemike Dec 8, 2011 +11

Haha, I like your definition better than what memcached actually does.

Jorgeragula05 Dec 8, 2011 +73

Cache all the memes!

That's how I feel reading textbooks.

[deleted] Dec 8, 2011 +31

Ha! Sometimes I think, "We're ... just going to go on to the next page here and hope that something stuck."

[deleted] Dec 8, 2011 +340

340

[deleted] Dec 8, 2011 +174

But what about the people *without* finals.

174

jc4p Dec 8, 2011 +260

Do you know how much I worked today?!?! Actually, not that much. But do you know what I had to do to waste time? TALK TO CO-WORKERS. I've learned some of their names! The horror :(

260

[deleted] Dec 8, 2011 +118

YEAH! I had to socialize with this cute girl, I ended up getting her number AND NOW WE'RE GOING OUT ON A DATE! The f*** is this shit? When I signed up to Listnook I signed my social and romantic life away, and I am dedicated to that cause.

monkeyx Dec 8, 2011 +67

> EAH! I had to socialize with this cute girl, I ended up getting her number AND NOW WE'RE GOING OUT ON A DATE! This never happened.

[deleted] Dec 8, 2011 +42

[removed]

chamantra Dec 8, 2011 +15

Or was it disruptive durden? We will never know...

Howard_Campbell Dec 8, 2011 +2646

2646

awesomekaptain Dec 8, 2011 +204

If that doesn't work, try unplugging it, waiting 10 seconds, then plugging it back in. Still not working? Oh, well f*** you then. Love, Comcast

204

rulsky Dec 8, 2011 +48

no, you're doing it wrong that's why it doesn't work.... you gotta unplug it for 30 seconds.

S_FrogPants Dec 8, 2011 +64

And if that doesn't work try licking it. I know it sounds crazy but trust me.

seagramsextradrygin Dec 8, 2011 +8

I figured this out when I was a kid, and when my brother saw me do it he was repulsed. He told me "You know if you do that 100 times, you die." I had no idea how many times I had done it already, but I completely believed him and this terrified me. From then on, I only did it when I *really* wanted to play.

apadula Dec 8, 2011 +7

This is exactly what I do as well! But everyone is always disgusted when I tell them.

rulsky Dec 8, 2011 +19

licking what? ಠ\_ಠ

PompousAss Dec 8, 2011 +24

You've got to lick it, before you stick it!

[deleted] Dec 8, 2011 +1518

**HIRE THIS MAN ADMINS! HE KNOWS HIS SHIT.**

1518

[deleted] Dec 8, 2011 +33

FirstRyder Dec 8, 2011 +556

Ah, this is why you should leave IT to the professionals. This will never work. You have to turn it **off** and **on** again, not **on** and **off** again.

556

letsRACEturtles Dec 8, 2011 +384

on an unrelated note, are we going to be reimbursed for lost karma? i calculate my losses at 17,900 karma

384

FoxtrotBeta6 Dec 8, 2011 +148

Does that account for the Listnook Karma Inflationary Index? The incident created a huge downturn in the karma market resulting in a massive move to make up karma upon the return of the site. Although you lost karma during downtime, the likely karma inflation caused by the returning userbase likely compensated for the loss. Nonetheless, fill out form 47-Alpha and send it off to the admins.

148

letsRACEturtles Dec 8, 2011 +190

my grandfather didn't work in the dirty karma mines just so that i could go and lose everything i have in the karma markets... surely there must be some sort of... bailout... we, the listnookors, deserve

190

FoxtrotBeta6 Dec 8, 2011 +77

Pfft, only 28282 karma? Not until you reach 500,000 comment karma like the big boys high up in the Listnook hierarchy will you be able to get free karma. Get back to work prole, and don't you even think of protesting.

[deleted] Dec 8, 2011 +51

gotrees Dec 8, 2011 +14

Pssssh. You only have 12,500 comment karma. What a phoney.

FoxtrotBeta6 Dec 8, 2011 +56

I have 750,000 karma stored away offshore. It's the wave of the future.

philmardok Dec 8, 2011 +16

there is no bailout. your account is going to have to go into foreclosure. we'll all probably starting getting calls from Bank of America soon.

ntr0p3 Dec 8, 2011 +3

>there is no bailout. your house and family are going to have to go into foreclosure. we'll all probably starting getting calls from Bank of America soon. ftfy you should have been more responsible with your karma

TheyCallMeRINO Dec 8, 2011 +3

>Does that account for the Listnook Karma Inflationary Index? Wait - inflation? Is Listnook devaluing our karma by printing more karma and introducing it into the market through some sort of "karma easing"? End the FED!!

[deleted] Dec 8, 2011 +795

795

CtrlAltDemolish Dec 8, 2011 +44

Don't forget select and start, otherwise only one person will be able to use it.

pentium4borg Dec 8, 2011 +58

From the description of what they did to fix listnook, I think that's basically what they did.

[deleted] Dec 8, 2011 +34

Also, remove the battery for 20 - 30 seconds. That should do the trick.

KadruH Dec 8, 2011 +27

Guys... you forgot to unplug and replug the GODAMN PLUG!!!

[Relevant I.T. Crowd](https://www.youtube.com/watch?v=PtXtIivRRKQ).

swaggle Dec 8, 2011 +288

Make sure the channel's on AUX.

288

[deleted] Dec 8, 2011 +15

And check that RCA cable. It could be a little frayed right there where the thingie connects to the metal bits.

Legoandsprit Dec 8, 2011 +26

I thought it was channel 03? Maybe that's why I can't get it done.

BeliefSuspended2008 Dec 8, 2011 +401

I thought it had to be 3 or 4

[deleted] Dec 8, 2011 +268

Yep, we're old.

268

smile_e_face Dec 8, 2011 +90

Feels good.

axrael Dec 8, 2011 +23

yes if you were using an rf adapter it would. n64 did use vga tho *edit: i am being corrected in the comments, n64 had s video. thanks guys

sacwtd Dec 8, 2011 +20

Composite, you mean. VGA is a tad more complicated.

woofiegrrl Dec 8, 2011 +52

N64?! Why you whippersnapper!

Don't forget to buy [this](http://www.bestbuy.com/site/AudioQuest+-+Diamond+3.3%27+High-Speed+HDMI+Cable+-+Dark+Gray/Black/2383276.p?id=1218324437192&skuId=2383276), should help

doodleydoo Dec 8, 2011 +8

I really love how the admins feel obliged to notify us and really explain what happened. It's kind of like the company-wide emails I'd have to construct when a server crashed, or a database went haywire. I knew that most of it would sound like "flux capacitors" and "transmogrifiers" to the casual user but I felt better that *they* knew (or trusted) that I at least sounded like I knew what was talking about.

I totally went out and passed a Cisco certification thanks to the downtime. Seriously.

theborgs Dec 8, 2011 +19

Just before the site went down, a lot of post from /r/bondage showed up in the **default** RSS feed (http://listnook.com/.rss). They were not marked as NSFW. I personally don't give a f*** but I imagine some people (like people at work) don't like to have p**** links without any warnings. Can you explain why it happened and what correction you will take to make sure it won't happen again ?

flyryan Dec 8, 2011 +8

Yep. I noticed this too. About 20 posts in there of chicks tied up. Thumbnails and all.

diamond Dec 8, 2011 +12

Some time tomorrow morning, just when it looks like everything is running smoothly, you'll realize that you have been running on backup generators for the last 12 hours. Then everything will come to a halt, and the velociraptors will get out, and OH MY GOD! AAAAAH! RUN!

>Memcached stores its entire dataset in memory, which makes it extremely fast, but also makes it completely disappear on restart. After restarting the memcached instances, our caches were completely empty. This meant that every single query on the site had to be retrieved from our slower permanent data stores, namely Postgres and Cassandra. [Uhh huh, I see. That's what I thought happened.](http://www.youtube.com/watch?v=BVECpIxrR_0&feature=related)

[deleted] Dec 8, 2011 +13

Limerick time... = My cubicle mate, Mr. Kevin Who logged on today on 12/7 He said, "yo, listnook's down" and I said with a frown "yea, it's been that way since 12:11" ಠ\_ಠ

tophat02 Dec 8, 2011 +3

I REALLY think memcached needs a dump/restore feature. The official reason listed on the FAQ for why it isn't there is that non-persistence to disk is the whole reason memcache exists, but I think that ignores at least TWO very important use cases: 1. Situations like this. You run a huge site, you know you have to bring the whole memcached cluster down, and you're pretty sure the data itself in the cache isn't the problem. In this case, it would be nice to be able to do a "memcached -dump > somehugefile.dmp" and then load it back in with a "memcached -load < somehugefile.dmp". Maybe you could have a way to limit what gets dumped based on key name regexes or metadata just in case it would be toxic to restore some of the data 2. Developers. I want to dump the contents of memcached to examine it in a text editor for errors. Or maybe I am maintaining a site that has to connect to a remote database and it takes FOREVER everytime I have to restart memcached for it to repopulate, so for the love of god why can't I just restore the previous state? EDIT: To be clear, I completely agree that memcached persistence should not be a normal FEATURE. I just think it should be provided as a utility to be used when extenuating circumstances call for it.

Pravusmentis Dec 8, 2011 +25

#**MARK MY WORDS** In 9 months from today there will be babies. So I thought you might like this: [The sleep-wake cycle of newborn human babies.](http://i.imgur.com/NRx6K.png)

But... it's listnook.

blackeagle613 Dec 8, 2011 +29

So basically [you tried turning it off and on again?](http://www.youtube.com/watch?v=nn2FB1P_Mn8)

madcowga Dec 8, 2011 +15

It's because I bought gold this week isn't it....knew it!

[deleted] Dec 8, 2011 +27

Now the joys of post-mortem debugging can begin! Enjoy the next week of hellish self-hatred.

throwaway123454321 Dec 8, 2011 +155

I almost went outside today... ಥ_ಥ (╯°□°）╯︵ ┻━┻

155

TeknOtaku Dec 8, 2011 +41

I was gonna but then I remembered - Google maps street view!

cpuenvy Dec 8, 2011 +77

Shit was close.

roy1990 Dec 8, 2011 +5

meanwhile shit got real on listnook's facebook page! I was there all night, refreshin' commentin' and likin'

KeytarVillain Dec 8, 2011 +55

TIL about /r/downtimebananas

oijoijoijasef Dec 8, 2011 +84

http://colonyworlds.com/wp-content/uploads/wereback.jpg

MatthiasII Dec 8, 2011 +483

homeless degree axiomatic toothbrush pet door hard-to-find consider fine selective *This post was mass deleted and anonymized with [Redact](https://redact.dev)*

483

It_does_get_in Dec 8, 2011 +38

"If you cache it, they will come". Kevin Costner Field of Listnooks.

OddAdviceGiver Dec 8, 2011 +2

I do memcache a lot, before it was "the thing" (slower servers back in the day, heavy traffic), and usually it was from collisions or bottlenecks at the wire/switch level that caused issues. A blast of too many requests and it'd start to spill over. At first it was null data, but then I put in a hook to put at least something in there to hunt for. Then I realized I could timestamp it. Probably not at the same scale. One of the things I coded in, however, was the ability to be warned when it happens, and code to start wiping out entries right as it happened by using the timestamp. Yea, I timestamp the cache entries using an entry that looks strange to some, but I had the ability to do it from the start. Might take a while to run, but as its running from a remote station, targeting and hitting the wipe from when the error started, normal cache can rebuild after whatever timestamp instead of the whole thing whacking the wires on a total rebuild. I built my system from scratch, tho, so I know it's different than yours, but it was because it was all I had to keep a particular client afloat who couldn't afford resources yet was getting slammed with high spike peak traffic during a particular time of the year. It supports a million impressions a day, with peak only within working hours at that during that peak. They just couldn't afford pizza boxes or round-robin or clustering and the back-end SQL was always pegged, this was a solution that I literally just gave them... But sometimes it would crash and damn I share your pain. I think my biggest problem was some servers on a switch that was battling the old autosense war with another switch because of some f'd up routing rule or somesuch. But I remember those days of pain: wipe the cache, then omg shit just crawls for hours and hours and there's nothing you can do and you can't even hit the bar so you just sit and wait or watch BSG for an episode. But I have maintenence and "watch" scripts that look out for the nulls and bottlenecks and alert, then I can either automate the partial wipes (instead of restarting) by direct memory address or do it manually; I still don't trust the automatic but I let it run when I'm on "vacation".

oorza Dec 8, 2011 +2

It's probably way too late into this thread for an admin to see this but... I've spent a lot of time and thought energy on the problem of memcache dependent sites like listnook (and a few other sites I've worked on). On the one hand, developing memcache dependent sites is incredibly easy and requires so little server hardware to operate at crazy volume. On the other hand, single points of failure are never good, but in a system as large as listnook is, I feel like they should be avoided at all costs. Like I said, I spent a lot of time thinking about this problem and did eventually arrive at what I feel like is a perfectly acceptable solution. Keeping in mind that I'm not sure what usage pattern listnook has against memcache or what you guys are doing to partition keys and whatnot, but the site that I was building for had roughly 10% write load against memcached, so the extra cost of writes wasn't significant. What I wound up doing was writing a thin application that accepted memcache connections, then determined the request type. Any request that performed a write (SET, CAS, etc.) was reverse-proxied to both the the memcache server *and* a memcachedb server. Read requests were just immediately reverse-proxied to the memcache server. The application had one other killer function: restoring a "backup." Once you had restarted your memcache server, you would issue another command that would request the values from the memcachedb server and set them in memcache. I didn't finish working on it, but I had planned to do things like have it proxy key expiries against memcachedb (which at the time didn't support key expiration and I don't know if it still does or not), looking at key substrings for command, etc. I'm not sure if any of this is useful, but it's an idea I had.

damontoo Dec 8, 2011 +206

I don't know what to comment so here's [a picture of a pony](http://i.imgur.com/OxPdL.jpg).

206

dopplex Dec 8, 2011 +19

Pony? [](/b32 "This... wasn't what I was expecting!")

Lil' Sebastian! I love that f****** horse!

thatsnotthemike Dec 8, 2011 +151

Lil' Sebastian!

151

Cptn_Janeway Dec 8, 2011 +75

TREAT YO SELF!

nimofitze Dec 8, 2011 +12

That pony is Kurt Cobain.

the_mariner Dec 8, 2011 +56

this is why I love listnook: accountability.

[deleted] Dec 8, 2011 +40

[deleted] Dec 8, 2011 +36

Notice how alienth refused to blame it on Amazon by not even naming them: "Last night, our hosting provider had applied some patches to our instances [...]." Alienth is the definition of professionalism. That said, I don't think I trust Amazon yet.

TheyCallMeRINO Dec 8, 2011 +8

Unless I'm mistaken, Amazon doesn't patch their customer's server instances. They operate more like dedicated hosting than managed hosting. Which leads me to believe Listnook now has infrastructure somewhere other than EC2.

iamichi Dec 8, 2011 +17

I'm particularly fond of messages like the one I got today... "We have noticed that one or more of your instances is running on a host degraded due to hardware failure."

[deleted] Dec 8, 2011 +6

ill be waiting to see a post like this nine months from now: "listnook was down 9 months ago...who just had a baby?"

josephanthony Dec 8, 2011 +3

"....in the rear of our main server, we found the remains of a hamster. It was dragging two feet of copper wire that was tied round it's waist, and wearing a 4Chan t-shirt. There was a tiny gun still grasped in it's paw, and an expression of triumph on it's little face."

Zebidee Dec 8, 2011 +4

This is a free service, and you're apologising to *us* that it didn't work flawlessly for a couple of hours?!

We're back

💬 Send a Message

196 Comments