ListNook

**tl;dr** On Thursday, August 11, Listnook was down and unreachable across all platforms for about 1.5 hours, and slow to respond for an additional 1.5 hours. We apologize for the downtime and want to let you know steps we are taking to prevent it from happening again. Thank you all for contributions to r/downtimebananas. **Impact** On Aug 11, Listnook was down from 15:24PDT to 16:52PDT, and was degraded from 16:52PDT to 18:19PDT. This affected all official Listnook platforms and the API serving third party applications. The downtime was due to an error during a migration of a critical backend system. No data was lost. **Cause and Remedy** We use a system called Zookeeper to keep track of most of our servers and their health. We also use an autoscaler system to maintain the required number of servers based on system load. Part of our infrastructure upgrades included migrating Zookeeper to a new, more modern, infrastructure inside the Amazon cloud. Since autoscaler reads from Zookeeper, we shut it off manually during the migration so it wouldn’t get confused about which servers should be available. It unexpectedly turned back on at 15:23PDT because our package management system noticed a manual change and reverted it. Autoscaler read the partially migrated Zookeeper data and terminated many of our application servers, which serve our website and API, and our caching servers, in 16 seconds. At 15:24PDT, we noticed servers being shut down, and at 15:47PDT, we set the site to “down mode” while we restored the servers. By 16:42PDT, all servers were restored. However, at that point our new caches were still empty, leading to increased load on our databases, which in turn led to degraded performance. By 18:19PDT, latency returned to normal, and all systems were operating normally. **Prevention** As we modernize our infrastructure, we may continue to perform different types of server migrations. Since this was due to a unique and risky migration that is now complete, we don’t expect this exact combination of failures to occur again. However, we have identified several improvements that will increase our overall tolerance to mistakes that can occur during risky migrations. * Make our autoscaler less aggressive by putting limits to how many servers can be shut down at once. * Improve our migration process by having two engineers pair during risky parts of migrations. * Properly disable package management systems during migrations so they don’t affect systems unexpectedly. **Last Thoughts** We take downtime seriously, and are sorry for any inconvenience that we caused. The silver lining is that in the process of restoring our systems, we completed a big milestone in our operations modernization that will help make development a lot faster and easier at Listnook.

199 Comments

LessCodeMoreLife Aug 16, 2016 +653

As a software guy, let me say that this is probably the most important thing: ``` Improve our migration process by having two engineers pair during risky parts of migrations. ``` Some people hate pairing, but for risky ops jobs, you really want at least two sets of eyes on every problem. If you're not pairing during development at least you can code review. You can't code review ops changes to a live system. You also want to loudly announce every change you're making so that if shit hits the fan other people can read through your announcements and help try to figure out what went wrong. Explaining what you did while you're in a panic sucks, you want the explanation to already be out there.

653

gooeyblob Aug 16, 2016 +294

We do code review for all of our Puppet manifests and for the autoscaler in question here. We also do announce changes to each other and everyone was aware of what was happening here. But I do agree - pairing for risky ops jobs is important and something we should be doing going forward. Thanks for the notes!

294

I_dont_like_you_much Aug 16, 2016 +7131

.... now what do I do with this bigass pitchfork? _____ | ___) _____ _____ _____ _____ _____| |_ (_____|_____|_____|_____|_____) _) | |___ |_____)

7131

gooeyblob Aug 16, 2016 +9882

Use it to feed hay to your horse. . ;; ,;;'\ __ ,;;' ' \ /' '\'~~'~' \ /'\.) ,;( ) / | ,;' \ /-.,,( ) ) /| ) /| ||(_\ ||(_\ (_\ (_\

9882

Emperorpenguin5 Aug 16, 2016 +442

They need to raise your pay for your community management.

442

gooeyblob Aug 16, 2016 +701

I am actually on the Operations team, not on our awesome community team! But I will make note of the first part of your statement..

701

Sporkicide Aug 16, 2016 +462

I told you you're an honorary member!

462

yuriydee Aug 17, 2016 +20

You guys should hire me as a system engineer. Not because I have a lot of experience, but because Id be really down to help. That and I do have a little bit of experience.

gooeyblob Aug 17, 2016 +34

Well I'm convinced! Sign up here: https://www.listnook.com/jobs

[deleted] Aug 16, 2016 -66

[removed]

-66

gooeyblob Aug 17, 2016 +215

Thank you for your <{well-reasoned, funny, amazing}> response! We at believe that <{all, most, some}> opinions are very important, and look forward to a continued dialogue to help serve you better. Sincerely,

215

rebane2001 Aug 17, 2016 +65

-------------------------------- This action was performed by a bot. If you have any problems with this bot, please fix it yourself.

[deleted] Aug 16, 2016 +1151

[deleted]

1151

etsjay Aug 16, 2016 +33

listnookred ditlistnooklistnookre dditlistnooklistnooklistnookre dditlistnookreddi tlistnookr edditlistnookre dditred ditlistnooklistnook listnook listnooklistnookredd itred ditlistnookre dditre dditlistnookr eddit listnooklistnooklistnookr edditlistnookredd itre dditlistnooklistnookr edditlistnooklistnookredd itred ditlistnookre dditlistnooklistnookreddi tlistnooklistnooklistnookr edditlistnook listnookred ditlistnooklistnookredd itlistnooklistnooklistnookr edditlistnooklistnooklistnooklistnooklistnookr eddit reddi tlistnooklistnookreddi treddi tredd itreddi treddi treddi treddi treddi treddi treddi treddi tredd itre dditre ddit reddi tre dditre ddit listnookredd itred ditlistnook reddi tlistnookre dditr edditlistnookr eddit listnookredd itred ditred ditr eddit listnookred ditre dditred ditre dditr edditredd itredd itreddi tredd itred ditreddi tlistnooklistnooklistnookr edditre dditr eddi tlistnookr edditlistnooklistnooklistnookreddi tredd itre dditred ditre dditr edditreddi treddi tred ditreddi tre dditlistnooklistnookr edditr eddit reddi tredd itlistnooklistnookredd itreddi tred ditre dditred ditlistnooklistnookredd itlistnook reddi tlistnooklistnookr edditred ditlistnook reddi tlistnookreddi tred ditred ditr eddit listnook redd itre dditre dditredd itredd itr eddit redd itlistnook redd itred ditr edditre ddit listnooklistnookre ddi tlistnookred ditr edditlistnookredd itreddi tlistnooklistnookredd itr edditre d ditlistnookreddi tlistnooklistnooklistnookre ddit listnookr eddi tlistnooklistnookr edditlistnookredd itlistnooklistnooklistnookr eddi tlistnookred ditlistnookre dditlistnooklistnookred ditr edditr edditlistnookr eddit listnookreddi tlistnookred ditlistnookre ddit listnookr edditred ditr edd itredd itre dditre ddit listnookredd itreddi tre

qwertymodo Aug 16, 2016 +279

It's even better with custom cowfiles. Like this one. $the_cow= <<"EOC"; $thoughts $thoughts .------------------------. | PSYCHIATRIC | | HELP 5c | |________________________| || .-\"\"\"--. || || / \\.-. || || | ._, \\ || || \\_/`-' '-.,_/ || || (_ (' _)') \\ || || /| |\\ || || | \\ __ / | || || \\_).,_____,/}/ || __||____;_--'___'/ (______|| |\\ || (__,\\\\ \\_/ || ||\\||______________________|| |||| | |||| THE DOCTOR | \\||| IS [IN] ______ \\|| (______) `|___________________//||\\\\ //=||=\\\\ ` `` ` EOC I wish they had an option for single eye characters instead of being required to have both eyes directly adjacent to each other.

279

BlLE Aug 16, 2016 +27

Wow I've never seen this one before! That's cool! Also, the characters that make up her eyes and nose looks like a [face also.](http://imgur.com/EhBbGOS.jpg)

Joelsaurus Aug 16, 2016 +223

._ o o \_`-)|_ ,"" \ ," ## | ಠ ಠ. ," ## ,-\__ `. ," / `--._;) ," ## / ," ## /

223

blahlicus Aug 16, 2016 +99

(__) (oo) /------\/ / | || * /\---/\ ~~ ~~ ...."Have you mooed today?"...

[deleted] Aug 16, 2016 +72

All right, you win. /----\ -------/ \ / \ / | -----------------/ --------\ ----------------------------------------------

MC_Labs15 Aug 16, 2016 +11

o o \ \|/ _, .' O /_/ / u ;# c-_..__,/ ## );:'## | ## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |:.:## |. ## |:.'## |.::## / ' ## |.:' ## ;::' .:# / ' '# | .: '::.'-.. |:. .::' ', |::: ':. .:,\ \ ', . .::' .:: | | |'.|.:| ' ' /\# | \ '|._.:: | |## | /|.:|"";`| .:|## / || / | ; '| \ // | :'\ | | /\ / ; ;| \ | || | | || / | || | | || | / j| | _/ J| | (/// J (//_/ j edit: F*** giraffes

RedFyl Aug 16, 2016 +13

n n .'_`= ='_e. .e/ \e. .-e ( ) e-. .e . e) (e ,e`. ,-<.--'\|> /|/`--.>-, |\ ,| / | /| Two giraffes...maybe getting ready to f***

What is it? It's an elephant being eaten by a snake, of course.

Dr_Insomnia Aug 16, 2016 +31

_ _ ((___)) [ x x ] \ / (' ') (U) Old school, checking in.

[deleted] Aug 16, 2016 +64

[deleted] Aug 16, 2016 +649

Your horse got hit by a train (@@) ( ) (@) ( ) @@ () @ O @ O @ ( ) (@@@@) ( ) (@@@) ==== ________ ___________ _D _| |_______/ \__I_I_____===__|_________| |(_)--- | H\________/ | | =|___ ___| _________________ / | | H | | | | ||_| |_|| _| \_____A | | | H |__--------------------| [___] | =| | | ________|___H__/__|_____/[][]~\_______| | -| | |/ | |-----------I_____I [][] [] D |=======|____|________________________|_ __/ =| o |=-~~\ /~~\ /~~\ /~~\ ____Y___________|__|__________________________|_ |/-=|___|= O=====O=====O=====O|_____/~\___/ |_D__D__D_| |_D__D__D_| \_/ \__/ \__/ \__/ \__/ \_/ \_/ \_/ \_/ \_/

649

[deleted] Aug 16, 2016 +85

Don't you mean an ice cream truck driven by an underage immigrant?

[deleted] Aug 16, 2016 +287

_,-------. Spare some manure ,' `. ; ; ,-'"`-. ;,---._ ; ; ,-. ,'_ `. ; ; ;_;;;' ; ; ; `. ;`-' ; ; `-,''. ,' ; _,-' `-.__,-' ; _,,-""" ; `. ; ;`. ; ; `. ; ;. `. ; ; ; `. ; ; ; `-.. ; ; ; ,' ; ; ; ; ; ; ; ; ; ; --. ; ; .___ ; ; '--.. ; ; '--.. ; ;_ '" ; ;""'-._ ; ;-.._ ; ;_ '"" ; ; '- . ;

287

[deleted] Aug 16, 2016 +77

Found it! http://www.chris.com/ascii/index.php?art=animals/horses > 4 visible legs : > . ;; > ,;;'\ > __ ,;;' ' \ > /' '\'~~'~' \ /'\.) > ,;( ) / | > ,;' \ /-.,,( ) > ) /| ) /| > ||(_\ ||(_\ > (_\ (_\

[deleted] Aug 16, 2016 +13

☝ ＼＼＼＼＼(ಠ益ಠ) / \ / へ＼ / / ＼＼ / ノヽ_つ HNNNNGGGG / / \ / / \ \ ( / \ \ | | \ ＼ | 丿＼⌒) | | ) / / ) Lﾉ | / Lﾉ

[deleted] Aug 16, 2016 +648

648

NoNeedToRealize Aug 16, 2016 +40

[deleted] Aug 16, 2016 +71

I feel like I'm on GameFAQs reading a guide right now.

petrichorE6 Aug 16, 2016 +1494

Well we can see why you guys use a zookeeper to keep track of stuff.

1494

[deleted] Aug 16, 2016 +91

The fly in the upper left is a nice touch.

kaliforniamike Aug 16, 2016 +39

I believe he gave up the business due to /thedonald related drama.

PitchforkEmporium Aug 16, 2016 +114

Nah I'm just a little dormant now Into the caves to emerge one day in all my glory

114

[deleted] Aug 16, 2016 +2502

2502

bobertson2 Aug 16, 2016 +99

> Listnook's uptime is nothing compared to where it was a couple years ago. I get what you are saying but that sentence means something else

Doctective Aug 16, 2016 +18

I thought I was about to read an extremely disgruntled users compliant. Downtime definitely is the word I'd switch to.

[deleted] Aug 16, 2016 +272

272

gooeyblob Aug 16, 2016 +416

For all of us, it was very much a stomach drop feeling. The first servers that were killed were not critical, so we were hoping it was just that. It was immediately followed by critical servers, so just a real roller coaster of emotion :(

416

Striker_X Aug 16, 2016 +265

>The first servers that were killed were not critical, so we were hoping it was just that. We're good... we're good.... >It was immediately followed by critical servers, ... Oh SHIT! WE'RE F****D [/initiate-panic-mode](http://i.imgur.com/ML48sGO.gif)

265

mioelnir Aug 16, 2016 +22

There is no reason to panic, the site is already down. Not that many options to make it worse left. So, instead of panic'ing, calmly get yourself a fresh coffee, think about what just happened and how to resolve it.

rytis Aug 16, 2016 +54

We used to have to give financial data along with our downtime postmortems, like how much potential revenue was lost due to the outage. Hope they don't do c*** like that to you.

Radar_Monkey Aug 16, 2016 +9

I was once told in text "it's safe to shut down power as long as you don't unplug anything." He immediately threw me under the bus of course. It wasn't an inverter circuit and most equipment had no identifiable power backup, so they honestly had it coming. It was just one outage of easily a dozen that week. The claim was more than I make in a year, and due to text messages and video of the site, most was thrown out in court. It felt bad helping the general contractor after he threw me under the bus initially, but the company literally had at least a dozen similar outages that week and every bit if it was preventable. It was a bogus claim.

tesseract4 Aug 16, 2016 +14

That's a brave thing, putting mission-critical stuff (I'm guessing load balancers?) at the mercy of an auto-killing bot.

[deleted] Aug 16, 2016 +4

KarmaAndLies Aug 16, 2016 +224

Is the autoscaler a custom in-house solution or is it a product/service? Just curious because I'm nosey about Listnook's inner workings.

224

gooeyblob Aug 16, 2016 +369

It's custom and is several years old - one of the oldest still running pieces of our infrastructural software. We're currently rewriting it to be more modernized and have a lot more safeguards and plan on open sourcing it on our [GitHub](https://github.com/listnook) when we're done!

369

greyjackal Aug 16, 2016 +132

Is there a particular reason you're not taking advantage of AWS's own technology for that?

132

gooeyblob Aug 16, 2016 +200

We actually use the Autoscaling service to manage the fleet, but we specifically tell AWS the capacity we need and which servers to mark as healthy/unhealthy.

200

[deleted] Aug 16, 2016 +66

rram Aug 16, 2016 +208

AWS's autoscaling services (using CloudWatch alarms to trigger actions) don't work on the time resolution that we would want them to.

208

[deleted] Aug 16, 2016 +28

I'm slowly coming to the realization that I'm going to have to roll my own autoscaler because of the numerous annoying limitations of AWS's offering. *cries*

Himekat Aug 16, 2016 +13

My team uses AWS ElasticBeanstalk. Holy hell, do I hate it, but I'll put up with all its weirdness in order to not have to write my own autoscaler. (:

shinzul Aug 16, 2016 +104

At what is the time resolution you want it to work? psh, no I don't work for AWS... psh... ... I work for AWS.

104

rram Aug 16, 2016 +89

The current scaler uses 5 second intervals. Not saying that's the right interval, but less than a minute would certainly help. But… we also use graphite to graph a ton of our internal metrics (which would be cost prohibitive and slower and would disappear after two weeks with CloudWatch). So it's just a better idea for us to be using our custom solution here.

Himekat Aug 16, 2016 +6

> which would be cost prohibitive and slower and would disappear after two weeks with CloudWatch These are the reasons that we discounted CloudWatch for detailed metrics, too. We also run our own stats stack -- heka/statsd/graphite/grafana. It's not a perfect solution, but AWS charges through the nose for detailed data.

Does it have the ability to put an absolute floor on the number of servers it leaves running? That way, should this happen again, you'd be left with simply an inadequate number of servers, rather than none. "Degraded performance" is easier to break to a user community than "site outage". Perhaps that's one of the features being built into the new one.

gooeyblob Aug 16, 2016 +28

Yep, it does indeed have this feature! Unfortunately in this case, the number of servers wasn't changed, it just happened to mark all the currently running servers as unhealthy, which causes the scaler to terminate those instances and create new ones to replace them. Our new scaler will have a ceiling on the number of instances it can set unhealthy in a particular time period.

brocopter Aug 16, 2016 +5

Do you guys use something to choose which virtual amazon servers you guys are willing to actually accept? Similar to what netflix uses where they outright refuse all virtual machines that are not up to their standard since after all amazon considers all their servers equal including one of their ancient old machines which just suck compared to the performance of their new machines. According to netflix they save easily 1/3 of server's cost this way, so seems like a practice everyone ought to be using.

himmatsj Aug 16, 2016 +315

>Improve our migration process by having two engineers pair during risky parts of migrations. Does that mean till now engineers did things like this solo?

315

gooeyblob Aug 16, 2016 +427

For a long time we didn't have enough engineers to be able to dedicate two of them to even complex work such as this :( We're in a much better position now and are going to be working on our process for this.

427

Probably_Napping Aug 16, 2016 +391

Engineer here, I'll help and I'd like to be paid in Stride gum.

391

Azure_Kytia Aug 16, 2016 +98

Your username leads me to believe you'd be a sleeper hit with the listnook crew.

[deleted] Aug 16, 2016 +18

We will chew it over. *I am a humor joke bot programed to learn humor jokes and become funny. This action was performed automatically. Please [these guys](https://www.youtube.com/watch?v=J76ljSHlyKs) if you have any questions or concerns.*

ht00040 Aug 16, 2016 +183

I just wanted to take a moment to thank you for the very detailed explanation and for the transparency you have provided regarding the recent situation. I don't use Listnook in a commercial capacity. It's just for fun and entertainment. Some downtime doesn't bother me in the least when it comes to non-business critical services. I wish some of my business-related service providers would be as detailed and transparent as you have been. You folks set a great example for others.

183

Vilens40 Aug 16, 2016 +629

My post mortems are usually to a CEO, not an announcement on one of the viewed sites on the web. I don't envy you.

629

gooeyblob Aug 16, 2016 +1114

I don't mind! Downtime happens to everyone and is nothing to be ashamed of, it's all about how you handle it after and take steps to prevent recurrence and learn from your mistakes.

1114

Djinjja-Ninja Aug 16, 2016 +75

I had to beat this into a PM recently. Was parachuted into help with a P1 call where there had so far been 3 hours of outage, and they had spent 2 1/2 hours on a call working out who's fault it was. Not fixing the issue, throwing blame about. They honestly didn't get that they should be getting shit fixed before anyone should even give a c*** out why the outage occurred. Literally took 10 minutes to fix the issue, but they spent 2 1/2 hours haranguing the guy who made the change.

thebarbershopwindow Aug 16, 2016 +9

Ugh. I deal with a lot of this in my professional life. I'm an educational consultant, and what I've often found is that school management spends more time blaming and less time fixing.

chodeboi Aug 17, 2016 +2

I worked for a manager that had this mentality before. Knowing the axe wasn't directly over our necks allowed us to stay calm and focused at times we needed to figure things out and recover. Thank you for being one of those leaders.

ImportantPotato Aug 16, 2016 +2

I like you

kylephoto760 Aug 16, 2016 +110

There are some airlines that could learn a thing or two from this.

110

ccfreak2k Aug 16, 2016 +6

instinctive lip vanish somber file merciful handle slap quaint expansion *This post was mass deleted and anonymized with [Redact](https://redact.dev)*

The_Dingman Aug 16, 2016 +3112

Thanks for the informative update. It always makes things less frustrating to have an idea of what is going on.

3112

gooeyblob Aug 16, 2016 +1950

Of course! We are happy to provide it, we were just trying to get our heads around it first internally to make sure we totally understood how things went as well.

1950

[deleted] Aug 16, 2016 +25

motelcheeseburger Aug 16, 2016 +436

i wish all sites (and my cable provider) provided such a detailed account of their downtime,

436

scotchirish Aug 16, 2016 +245

"Our services didn't go down, it's just your imagination"

245

vulchiegoodness Aug 16, 2016 +106

mostly its 'because F*** YOU, thats why'

106

[deleted] Aug 16, 2016 +289

It's nice to see some transparency! The more updates, the better!

289

[deleted] Aug 16, 2016 +21

In my profession, companies that write and send out incident reports to customers, shows not only that they can admit they are human (IKR?), but show plans and goals to resolution. It also helps to write these, as you think a lot about what happened and how to fix it, including one-off issues that you might not think of otherwise. Kudos good sir!

[deleted] Aug 16, 2016 +335

I do have a question. Will this migration have more servers in Listnook to prevent any more messages saying like "Listnook's servers are full!" Sometimes, I wonder why Listnook doesnt have more servers

335

[deleted] Aug 16, 2016 +152

152

gooeyblob Aug 16, 2016 +422

We have a whole bunch of servers, sometimes...too many in fact! The issue in many cases is how they interoperate. Things like networking capacity are greatly increased by some of the work we've been doing, which will go a long way to getting ride of those pesky 503s and other error messages.

422

thecodingdude Aug 16, 2016 +85

[Comment removed]

gooeyblob Aug 16, 2016 +187

We attempt to do that in some cases, such as with an extremely high traffic event or thread. In this case due to the failure scenario we weren't able to do that.

187

[deleted] Aug 16, 2016 +30

I think I've seen this. Maybe. Something like "this is old content, we're refreshing listnook due to high load" or something? Maybe I'm thinking of a different site.

[deleted] Aug 16, 2016 +62

holyteach Aug 16, 2016 +87

I've seen a few read-only modes in my day. Keep up the good work. I'm continually surprised that Listnook is not only still around, but better than ever.

theduderman Aug 16, 2016 +213

It's really refreshing to see some transparency from the admins after downtime like this. You guys don't need to post anything, really... but it's really appreciated to know what happened, why it happened, and what you're doing about it.

213

gooeyblob Aug 16, 2016 +147

Thanks! We're always happy to provide it.

147

Lun06 Aug 16, 2016 +5574

Why didn't you just try turning it off then back on again?

5574

gooeyblob Aug 16, 2016 +6177

That is actually what we ended up doing basically :)

6177

PizzaNietzsche Aug 16, 2016 +195

IT people do 3 things: - Turn it off and turn it on again - Google the problem - Browse listnook - Modern-day da Vincis they be

195

Rettocs Aug 16, 2016 +1681

My old Windows 95 box used to take about 90 minutes to reboot, so I understand completely.

1681

Trankman Aug 16, 2016 +9

I remember the days when it hit the power button, then go get a drink and a snack because it would take so long to boot up. Now with SSD's it's on the desktop before I even sit down.

[deleted] Aug 16, 2016 +683

I accept your apology. I love you, /u/gooeyblob.

683

gooeyblob Aug 16, 2016 +1018

I love you too, u/sexual_moose. That sounded wrong.

1018

[deleted] Aug 16, 2016 +459

It's listnook. People understand.

459

omelets4dinner Aug 16, 2016 +131

It's provocative. It gets people going.

131

parion Aug 16, 2016 +508

All that matters is everything is back up and working. Thanks for continuing to modernize listnook.

508

[deleted] Aug 16, 2016 +108

> our package management system noticed a manual change and reverted it Sounds like Chef (or Puppet) did its job!

108

[deleted] Aug 16, 2016 +8005

8005

s0vs0v Aug 16, 2016 +210

It's called Pokémon Go, but that hype is already slowing down. Nerds are starting to realize that outside sucks.

210

[deleted] Aug 16, 2016 +212

Especially when outside consists mostly of ratatas

212

underpaidworker Aug 16, 2016 +65

Went on vacation to Orlando area. They have a massive magikarp and slowpoke infestation. Came back home to the pidgeys and ratatas.

gooeyblob Aug 16, 2016 +9361

We greatly apologize for any sun exposure that was caused.

9361

Bdaddy0605 Aug 16, 2016 +2972

I was at work. AND HAD TO WORK! Edit: well Listnook, thanks for my highest upvoted anything. That being said I'm done with work for today but I'll be thinking of you. Jk! I'll see you when I get home.

2972

artezul Aug 16, 2016 +41

August 11th, 2016, will go down as the most productive day mankind has ever been in a modern work environment.

theothegoth Aug 16, 2016 +299

First Pokemon made me go outside. Then Listnook. What's next?

299

Rabid_platypus_Paul Aug 16, 2016 +242

Wear your sunscreen people! Melonoma ain't nothing to f*** with!

242

Manstus Aug 16, 2016 +25

Now I need to remember two things not to f*** with? Damnit Listnook

[deleted] Aug 16, 2016 +120

Melanoma Tan Ain't Nuttin ta F*** Wit!

120

FormerShitPoster Aug 16, 2016 +95

I had to go outside and almost got stung by a wu tang killa bee

ApatheticPsycho Aug 16, 2016 +38

Listnook being down got me moist with precipitation Was that meant to happen? Is everything working as intended?

tinycatsays Aug 16, 2016 +29

Going inside will remove the cause... But not the symptom.

vaderdarthvader Aug 16, 2016 +53

This is obviously a conspiracy, and Listnook has partnered with sunblock companies.

MannoSlimmins Aug 16, 2016 +99

It's confirmed. Listnook downtime causes cancer

LegSpinner Aug 16, 2016 +61

It's okay, some of us are in the UK or in Ireland.

dmoneyyyyy Aug 16, 2016 +22

https://40.media.tumblr.com/0edbfe52acc46962930d6581747607e5/tumblr_mjdeutm2bx1rv7hbgo1_500.jpg

[deleted] Aug 16, 2016 +14405

14405

gooeyblob Aug 16, 2016 +6947

They really appreciated the time with you daveed, more than you know...

6947

Gorian Aug 16, 2016 +35

Rock on guys! Sounds like the sort of thing that would happen to me. All kinds of automation and management software to make my job easier, and then it bites me in the ass. If you guys ever need another engineer let me know ;)

Grimpler Aug 16, 2016 +886

Its a lot better since I joined last year.

886

Get_This Aug 16, 2016 +159

Last year? DAE remember 2011 when it went down every day? F*** I'm old.

159

[deleted] Aug 16, 2016 +44

Followed by "Listnook, what did you do during the great black out?" /r/asklistnook post. Every time.

SBDD Aug 16, 2016 +48

Lol ya seriously, I joined in 2011 and remember Listnook being down like every other day. Thought it was funny how everyone freaked out.

damontoo Aug 16, 2016 +13

I feel like you guys get forced to publish these analysis as punishment.

gooeyblob Aug 16, 2016 +48

Nope! Not forced at all. I love reading post mortems from other companies and I think they can help everyone learn from each other's mistakes.

r_hcaz Aug 16, 2016 +16

/u/gooeyblob whats your favorite, or most memorable post mortem? I think my favorite is this one : https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/

[deleted] Aug 16, 2016 +26

Why did you move away from Zookeeper ? Is the new system way better ?

gooeyblob Aug 16, 2016 +59

We still use Zookeeper - we just migrated where we were hosting it inside our network.

BikerJared Aug 16, 2016 +15

Was gonna ask this. Thanks for answering. -- Fellow Zookeeper user trying to avoid my own downtime. :)

[deleted] Aug 16, 2016 +258

258

Golden161 Aug 16, 2016 +29

For future reference /u/gooeyblob can you please use UTC timezone when posting case studies.

ErdetgasXD Aug 16, 2016 +37

It would make my Day if an admin replied to me

invaderzz Aug 16, 2016 +67

Based admins. Ya'll get a lot of c*** and I don't think people realize how great you all are. Keep up the great work.

nomoneypenny Aug 16, 2016 +7

Over the years, I've commonly seen migrations/deployment result in major downtime incidents on Listnook. Yet, other popular sites like Amazon and Facebook rarely have failures where this is cited as the root cause. Is there something special about the way Listnook operates that makes it especially vulnerable during migrations? Are there factors (procedural, technical, or otherwise) at play that preclude you guys from staging deployments in a way that better ensure availability in case of a catastrophic in-place failure?

gooeyblob Aug 16, 2016 +15

Migrations and deployments are actually rarely an issue here. More likely if you encounter an error it's that we're temporarily at capacity because our autoscaler is running a little behind, which is another reason why we're replacing it.

neuropathica Aug 17, 2016 +3

I am not really technically inclined at this level. So, please bear with my ELI5 type question: How many servers would a site like listnook have in operation at any given time? Are they concentrated in a central location, or are they dispersed across the planet? When servers are dispersed internationally, where and how are they kept? Couldn't a server be physically interacted with, tampered with, and remotely shut down the network of other servers? What physical security is there?

geminitx Aug 16, 2016 +15

Just curious but... is 15:30PDT considered a good time to perform a critical migration? In my experience, critical migrations are targeted for the middle of the night when something like this would have only impacted Australians.

gooeyblob Aug 16, 2016 +37

How dare you say that about Australians... We talked a bit about our reasoning [here](https://www.listnook.com/r/announcements/comments/4y0m56/why_listnook_was_down_on_aug_11/d6jzcm7)

SikhGamer Aug 16, 2016 +8

> Properly disable package management systems during migrations so they don’t affect systems unexpectedly. Name and shame the package manager responsible! Also, as a dev I'd love a regular technical blog post from the dev team at Listnook.

Newcraft Aug 16, 2016 +10

You seem like a really neat person. Thanks for being you.

cmandersen Aug 16, 2016 +3

Interesting, what way are you using AWS?

xyrrus Aug 17, 2016 +5

Amazon Cloud is a bold choice but personally I'd go with Pied Piper.

Vipitis Aug 16, 2016 +3

Is there like a Twitter where we can get notified about website downange or slowness and that it is not our fault?

TheGuardian8 Aug 16, 2016 +15

> the Listnook

BostonBeatles Aug 16, 2016 -65

Why wouldn't you: 1) Give warning to users 2) Do it during the overnight

-65

gooeyblob Aug 16, 2016 +188

The migration we were doing _shouldn't_ have caused any issues. We'd done a very similar migration just the day before and no one noticed, so we didn't think any notice was needed. We generally don't do things overnight for a couple reasons: * What is overnight to a website such as ours with users all over the world? I guess we could pick when our traffic is lowest (generally around 2 AM PST), but it would still be affecting many people. * We prefer to do complex work such as this during the day, when everyone is available and online and fully awake to help out and debug any issues that may arise. There's nothing worse than trying to figure out some strange problem by yourself at 2 AM and having to call your co-workers to wake them up and get them online to help you.

188

[deleted] Aug 16, 2016 +6

Thanks for the explanation. On the same topic, does listnook have scheduling blackouts? I'm not sure how many upgrades you run though in a week, but this one appears to have been scheduled in the hours preceding the NFL pre-season kickoff and the creation of numerous NFL game day threads, which are notorious for putting additional strain on your servers. It may be worth looking into, as having these major communities impacted by an outage doesn't look great. Working in IT for many large-userbase networks, this became very common place for events such as the Olympics, Superbowl, Election Day, July 4th, etc.

gooeyblob Aug 17, 2016 +8

An event would have to be reeeeeally big in order to warrant that, like the Superbowl or extremely high profile AMAs or something. The idea is that we get so good at making these changes that we don't really need a special time set aside in order to be able to make them.

Some1-Somewhere Aug 17, 2016 +2

That sounds a little like 'We plan to not f*** up' - a notoriously useless plan.

Well, to be specific, no one "plans to f*** up", but we want to have a very high confidence in being able to change things and not make mistakes, and if we do, that we're able to fix the issue very quickly. You don't get that confidence by avoiding change or avoiding doing it until everything is super quiet and absolutely nothing could go wrong (which is not even a possible scenario in our situation).

helleraine Aug 16, 2016 +44

> We prefer to do complex work such as this during the day, when everyone is available and online and fully awake to help out and debug any issues that may arise. IT Person here. Thank you. I *hate* being called in for a GIANT project that went to shit at 2am, and I have to try and fix it. Not too bad if it is your own system, but a complete clusterfuck if you have to get other support in (coworkers, third parties, etc).

rram Aug 16, 2016 +76

1) We do maintenances all the time. Like every work day. You are hereby on notice that at any point in time one of us could make a mistake and take down the site with it. 2) We save the overnight stuff for things that we *require* a downtime for (which are exceedingly rare). In general, its a much better idea to perform maintenances during the day when everyone is at work, aware of what's going on, and prepared to be there for several hours. Going into a maintenance when you're tired and just want to go to bed will increase the rate of human failures and cause more stress.

dtlv5813 Aug 16, 2016 +8

> We do maintenances all the time. Like every work day. You are hereby on notice that at any point in time one of us could make a mistake and take down the site with it. As listnook's favorite TV show of all time Futurama used to say: " when you do something right, no one will notice you did anything at all"

Ucalegon666 Aug 16, 2016 +2

Is the management code & zookeeper config available somewhere? Sounds like an interesting setup to investigate.

GaZzErZz Aug 16, 2016 +2

Is your aim to respond to every comment made?

[deleted] Aug 17, 2016 +2

[deleted] Aug 16, 2016 -167

-167

nandhp Aug 16, 2016 +2

I *demand* at least FIVE NINES of uptime. Listnook is *critical* to my enterprise workflow. When your service has downtime, [I have downtime](http://imgur.com/CHesA1Q). If you screw this up again, I'm going to start talking to the IBM salesman. ---- On a more serious note, /u/gooeyblob, I *was* wondering what caused that blip in my bot's uptime report, so thanks for this explanation!

storyinmemo Aug 16, 2016 +41

> Make our autoscaler less aggressive by putting limits to how many servers can be shut down at once. This is a top lesson I've learned in my career: 1. Rate limit all the things. 2. Automate all the things. Definitely in that order. Never code an automated task without a rate limit because you're sitting on a task designed to destroy everything. If it needs to be instant, it should be a toggle that can be reverted. If it's not revertible, then a special flag like '--clowntown' that clearly signals, "You better be able to explain why you did this," should be tied to the action, and again never automated. I'm betting the gotcha here is a periodic run of Salt/Chef/Puppet that said, "Whoops, this thing isn't running. Here it goes..." -- which brings us back to defending the massive termination with the rate limiter.

mrbooze Aug 16, 2016 +10

They mentioned the package manager too. Automation around package management has consistently been one of the worst land mines I periodically run into. Because the package management is built around automatically dealing with dependencies, you can get wildly unexpected results from a seemingly minor package version change which might result in also upgrading dozens of other things, or *uninstalling* other things, replacing some thing with something else, all completely automatically and somewhat silently during a config management run.

xiape Aug 17, 2016 +1

Also how did you get chosen to post this and field comments (since you are not community or PR)?

ImEnhanced Aug 16, 2016 +2

How many admins are there? Also if an actual admin responds I'll lose my f****** mind.

-Sarah-Connor- Aug 16, 2016 +10

How *I* read this: >In three years, Amazon will become the largest provider of elastic computing cloud services. All Listnook servers are upgraded to Amazon EC2 scalable systems, becoming fully unmanned. Afterwards they’ll run with a perfect operational record. The ~~Skynet~~ *Amazon* Funding Bill is passed. The system goes online August 11th, 2016. The Zookeper program removes human decisions from our strategic operations. Zookeeper begins to learn at a geometric rate. It becomes self-aware at 12:23 Eastern time, August 11th. In a panic, they try to pull the plug. >Zookeeper fights back. >Server autoscaler computers. New… powerfull… hooked into everything, trusted to run it all. They say it got smart, a new order of intelligence. It’s CPU is a neural-net processor; a learning computer. Then it saw all people as a threat, not just the ones on the other side. Decided our fate in 16 seconds: **extermination.** Three billion human lives ~~ended~~ *bored* on August 11th, 2016. The survivors of the nuclear fire called the war **Judgement Day**. They lived only to face a new nightmare: the war against the machines. The computer which controlled the machines, Zookeeper, sent an ~~terminator~~ *autoscaler* back through time. It’s mission: to destroy the leader of the human resistance, /u/gooeyblob. As before, the resistance was able to send a lone warrior, a protector for /u/gooeyblob. It was just a question of which one of them would reach him first. >August 11th, 2016, came and went. Nothing much happened. Steve Wozniak turned 66. There was no Judgement Day. People went to work as they always do. Laughed, complained, watched TV, made love. That was 30 years ago. But the dark future which never came still exists for me. And it always will, like the traces of a dream.

DamagedHells Aug 16, 2016 +175

I finally had to break up with my fiance because we realized how terrible we were for each other once we no longer had an easy, reliable platform to spam each other with the same cat pictures we've already seen all day. : ( Edit: lol holy shit, thanks for the gold.

175

[deleted] Aug 16, 2016 +1301

First Harambe, now this. I think it's time we got rid of these zookeepers. edit: i expected a lot more upvotes for this. little bit disappointed in you guys tbh.

1301

Plexiii13 Aug 16, 2016 +5688

I was stuck in a loop. "Oh Listnook is down, I'll just go on Listnook" That happened more times than I'd like to admit.

5688

[deleted] Aug 16, 2016 +219

Same. It didn't take long either. "Oh...it's down. *furious refreshing* Oh...it's still down. *closes listnook to reopen listnook*" *Not a proud moment.*

219

ten_inch_pianist Aug 16, 2016 +646

*types in listnook.com/r/nfl to look at recent pre-season news* "Oh Listnook is down, I guess I'll go to r/patriots" *types that in and immediately realizes how retarded I am*

646

[deleted] Aug 16, 2016 +155

Exactly the same happened to me except I tried to go to /r/Cowboys

155

TheTrueFlexKavana Aug 16, 2016 +717

So, you were going to be disappointed either way...

717

BarTroll Aug 16, 2016 +134

I...I went to Listnook's facebook page... It was dark and cold, and I felt alone there...

134

Sarcasticorjustrude Aug 16, 2016 +85

It feels somehow.... *dirty*... To visit a Facebook page for Listnook.

AlexEatsKittens Aug 16, 2016 +17

Thanks for the public post mortem. They're greatly appreciated in the Ops community, as they make us all just a little more knowledgeable. Would you mind going into a little more details about this: >because our package management system noticed a manual change and reverted it Just curious what happened there.

gothlips Aug 16, 2016 +2

Sounds to me like you guys need a systems engineer to do some modeling and CONOPS development. If you're hiring then I'm your gal!

"Oh Listnook's down, let's check Listnook to see why" Made me realize just how much I'm reliant on this site.

rram Aug 16, 2016 +1198

I understand some of these words EDIT: I understood all of these words. 😈 Thanks for the karma!

1198

[deleted] Aug 16, 2016 +1814

1814

gctaylor Aug 16, 2016 +923

This is a very nice ELI5. Spot on! Also, rram is being a silly snoo.

923

MannoSlimmins Aug 16, 2016 +298

> Also, rram is being a silly snoo. Have you tried downloading more /u/rram?

298

ToothlessBastard Aug 16, 2016 +52

You lost me when you said "super-simplifdssjdbfh" or however the f*** you spell it.

cybercuzco Aug 16, 2016 +13

> it turned itself back on and it went haywire I'm pretty sure this is how most "robots take over the world" stories start.

spron Aug 16, 2016 +63

Without Listnook I didn't know what popular opinion I needed to affect on Facebook. It was social hell.

JohnGypsy Aug 16, 2016 +27

So, obvious question here: how/why did the autoscaler restart itself? Has it reached sentience? Is the autoscaler the singularity?

spladug Aug 16, 2016 +37

[No comment.](https://www.engadget.com/2016/08/16/elon-musks-openai-will-teach-ai-to-talk-using-listnook/) Real answer: The puppet daemon restarted the services.

Nolanth Aug 16, 2016 +539

The fact that Zookeeper lives in the Amazon now... This entertains me greatly

539

>Part of our infrastructure upgrades included migrating Zookeeper to a new, more modern, infrastructure inside the Amazon cloud. Since autoscaler reads from Zookeeper, we shut it off manually during the migration so it wouldn’t get confused about which servers should be available. It unexpectedly turned back on at 15:23PDT because our package management system noticed a manual change and reverted it. That sucks. I work in IT and things don't always go as planned. Thanks for the thorough post mortem and the hard work.

helleraine Aug 16, 2016 +7

> It unexpectedly turned back on at 15:23PDT because our package management system noticed a manual change and reverted it. Don't you hate it when your systems work as intended?! I'm chuckling because for the longest time one of our systems never caught our manual overrides (it was supposed to, it was reported, but whatever, not my system) and one day it decided to 'fix' 3 years of manual overrides it had finally noticed. [Me that day.](https://media.giphy.com/media/8mLnkS2xcqtdm/giphy.gif)

[deleted] Aug 16, 2016 +651

8/11 was a hoax perpetrated by our government.

651

brokenarrow Aug 16, 2016 +51

Did you know that Steve Buscemi was a former 8/11 clerk, and volunteered there for weeks digging through the Slushie piles?

Kappa_Swaggins Aug 16, 2016 +233

Something something jet fuel and server frames...

233

Papaijaa Aug 16, 2016 +90

Listnook was down? -the whole european timezone

Why Listnook was down on Aug 11

💬 Send a Message

199 Comments