**tl;dr**
On Thursday, August 11, Listnook was down and unreachable across all platforms for about 1.5 hours, and slow to respond for an additional 1.5 hours. We apologize for the downtime and want to let you know steps we are taking to prevent it from happening again.
Thank you all for contributions to r/downtimebananas.
**Impact**
On Aug 11, Listnook was down from 15:24PDT to 16:52PDT, and was degraded from 16:52PDT to 18:19PDT. This affected all official Listnook platforms and the API serving third party applications. The downtime was due to an error during a migration of a critical backend system.
No data was lost.
**Cause and Remedy**
We use a system called Zookeeper to keep track of most of our servers and their health. We also use an autoscaler system to maintain the required number of servers based on system load.
Part of our infrastructure upgrades included migrating Zookeeper to a new, more modern, infrastructure inside the Amazon cloud. Since autoscaler reads from Zookeeper, we shut it off manually during the migration so it wouldn’t get confused about which servers should be available. It unexpectedly turned back on at 15:23PDT because our package management system noticed a manual change and reverted it. Autoscaler read the partially migrated Zookeeper data and terminated many of our application servers, which serve our website and API, and our caching servers, in 16 seconds.
At 15:24PDT, we noticed servers being shut down, and at 15:47PDT, we set the site to “down mode” while we restored the servers. By 16:42PDT, all servers were restored. However, at that point our new caches were still empty, leading to increased load on our databases, which in turn led to degraded performance. By 18:19PDT, latency returned to normal, and all systems were operating normally.
**Prevention**
As we modernize our infrastructure, we may continue to perform different types of server migrations. Since this was due to a unique and risky migration that is now complete, we don’t expect this exact combination of failures to occur again. However, we have identified several improvements that will increase our overall tolerance to mistakes that can occur during risky migrations.
* Make our autoscaler less aggressive by putting limits to how many servers can be shut down at once.
* Improve our migration process by having two engineers pair during risky parts of migrations.
* Properly disable package management systems during migrations so they don’t affect systems unexpectedly.
**Last Thoughts**
We take downtime seriously, and are sorry for any inconvenience that we caused. The silver lining is that in the process of restoring our systems, we completed a big milestone in our operations modernization that will help make development a lot faster and easier at Listnook.
As a software guy, let me say that this is probably the most important thing:
```
Improve our migration process by having two engineers pair during risky parts of migrations.
```
Some people hate pairing, but for risky ops jobs, you really want at least two sets of eyes on every problem. If you're not pairing during development at least you can code review. You can't code review ops changes to a live system.
You also want to loudly announce every change you're making so that if shit hits the fan other people can read through your announcements and help try to figure out what went wrong. Explaining what you did while you're in a panic sucks, you want the explanation to already be out there.
653
gooeyblobAug 16, 2016
+294
We do code review for all of our Puppet manifests and for the autoscaler in question here. We also do announce changes to each other and everyone was aware of what was happening here. But I do agree - pairing for risky ops jobs is important and something we should be doing going forward.
Thanks for the notes!
294
I_dont_like_you_muchAug 16, 2016
+7131
.... now what do I do with this bigass pitchfork?
_____
| ___)
_____ _____ _____ _____ _____| |_
(_____|_____|_____|_____|_____) _)
| |___
|_____)
7131
gooeyblobAug 16, 2016
+9882
Use it to feed hay to your horse.
. ;;
,;;'\
__ ,;;' ' \
/' '\'~~'~' \ /'\.)
,;( ) / |
,;' \ /-.,,( )
) /| ) /|
||(_\ ||(_\
(_\ (_\
9882
Emperorpenguin5Aug 16, 2016
+442
They need to raise your pay for your community management.
442
gooeyblobAug 16, 2016
+701
I am actually on the Operations team, not on our awesome community team! But I will make note of the first part of your statement..
701
SporkicideAug 16, 2016
+462
I told you you're an honorary member!
462
yuriydeeAug 17, 2016
+20
You guys should hire me as a system engineer. Not because I have a lot of experience, but because Id be really down to help. That and I do have a little bit of experience.
20
gooeyblobAug 17, 2016
+34
Well I'm convinced! Sign up here: https://www.listnook.com/jobs
34
[deleted]Aug 16, 2016
-66
[removed]
-66
gooeyblobAug 17, 2016
+215
Thank you for your <{well-reasoned, funny, amazing}> response! We at believe that <{all, most, some}> opinions are very important, and look forward to a continued dialogue to help serve you better.
Sincerely,
215
rebane2001Aug 17, 2016
+65
--------------------------------
This action was performed by a bot.
If you have any problems with this bot, please fix it yourself.
It's even better with custom cowfiles. Like this one.
$the_cow= <<"EOC";
$thoughts
$thoughts
.------------------------.
| PSYCHIATRIC |
| HELP 5c |
|________________________|
|| .-\"\"\"--. ||
|| / \\.-. ||
|| | ._, \\ ||
|| \\_/`-' '-.,_/ ||
|| (_ (' _)') \\ ||
|| /| |\\ ||
|| | \\ __ / | ||
|| \\_).,_____,/}/ ||
__||____;_--'___'/ (______||
|\\ || (__,\\\\ \\_/ ||
||\\||______________________||
|||| |
|||| THE DOCTOR |
\\||| IS [IN] ______
\\|| (______)
`|___________________//||\\\\
//=||=\\\\
` `` `
EOC
I wish they had an option for single eye characters instead of being required to have both eyes directly adjacent to each other.
279
BlLEAug 16, 2016
+27
Wow I've never seen this one before! That's cool!
Also, the characters that make up her eyes and nose looks like a [face also.](http://imgur.com/EhBbGOS.jpg)
_________
/ \
\_________/
| CAN OF |
| DOG |
| FOOD |
\_________/
Well, I tried...
40
[deleted]Aug 16, 2016
+71
I feel like I'm on GameFAQs reading a guide right now.
71
petrichorE6Aug 16, 2016
+1494
Well we can see why you guys use a zookeeper to keep track of stuff.
1494
[deleted]Aug 16, 2016
+91
The fly in the upper left is a nice touch.
91
[deleted]Aug 16, 2016
+71
[deleted]
71
kaliforniamikeAug 16, 2016
+39
I believe he gave up the business due to /thedonald related drama.
39
PitchforkEmporiumAug 16, 2016
+114
Nah I'm just a little dormant now
Into the caves to emerge one day in all my glory
114
[deleted]Aug 16, 2016
+2502
[deleted]
2502
bobertson2Aug 16, 2016
+99
> Listnook's uptime is nothing compared to where it was a couple years ago.
I get what you are saying but that sentence means something else
99
DoctectiveAug 16, 2016
+18
I thought I was about to read an extremely disgruntled users compliant.
Downtime definitely is the word I'd switch to.
18
[deleted]Aug 16, 2016
+272
[deleted]
272
gooeyblobAug 16, 2016
+416
For all of us, it was very much a stomach drop feeling. The first servers that were killed were not critical, so we were hoping it was just that. It was immediately followed by critical servers, so just a real roller coaster of emotion :(
416
Striker_XAug 16, 2016
+265
>The first servers that were killed were not critical, so we were hoping it was just that.
We're good... we're good....
>It was immediately followed by critical servers, ...
Oh SHIT! WE'RE F****D [/initiate-panic-mode](http://i.imgur.com/ML48sGO.gif)
265
mioelnirAug 16, 2016
+22
There is no reason to panic, the site is already down. Not that many options to make it worse left.
So, instead of panic'ing, calmly get yourself a fresh coffee, think about what just happened and how to resolve it.
22
rytisAug 16, 2016
+54
We used to have to give financial data along with our downtime postmortems, like how much potential revenue was lost due to the outage. Hope they don't do c*** like that to you.
54
Radar_MonkeyAug 16, 2016
+9
I was once told in text "it's safe to shut down power as long as you don't unplug anything." He immediately threw me under the bus of course. It wasn't an inverter circuit and most equipment had no identifiable power backup, so they honestly had it coming. It was just one outage of easily a dozen that week.
The claim was more than I make in a year, and due to text messages and video of the site, most was thrown out in court. It felt bad helping the general contractor after he threw me under the bus initially, but the company literally had at least a dozen similar outages that week and every bit if it was preventable. It was a bogus claim.
9
tesseract4Aug 16, 2016
+14
That's a brave thing, putting mission-critical stuff (I'm guessing load balancers?) at the mercy of an auto-killing bot.
14
[deleted]Aug 16, 2016
+4
[removed]
4
KarmaAndLiesAug 16, 2016
+224
Is the autoscaler a custom in-house solution or is it a product/service?
Just curious because I'm nosey about Listnook's inner workings.
224
gooeyblobAug 16, 2016
+369
It's custom and is several years old - one of the oldest still running pieces of our infrastructural software. We're currently rewriting it to be more modernized and have a lot more safeguards and plan on open sourcing it on our [GitHub](https://github.com/listnook) when we're done!
369
greyjackalAug 16, 2016
+132
Is there a particular reason you're not taking advantage of AWS's own technology for that?
132
gooeyblobAug 16, 2016
+200
We actually use the Autoscaling service to manage the fleet, but we specifically tell AWS the capacity we need and which servers to mark as healthy/unhealthy.
200
[deleted]Aug 16, 2016
+66
[deleted]
66
rramAug 16, 2016
+208
AWS's autoscaling services (using CloudWatch alarms to trigger actions) don't work on the time resolution that we would want them to.
208
[deleted]Aug 16, 2016
+28
I'm slowly coming to the realization that I'm going to have to roll my own autoscaler because of the numerous annoying limitations of AWS's offering. *cries*
28
HimekatAug 16, 2016
+13
My team uses AWS ElasticBeanstalk. Holy hell, do I hate it, but I'll put up with all its weirdness in order to not have to write my own autoscaler. (:
13
shinzulAug 16, 2016
+104
At what is the time resolution you want it to work?
psh, no I don't work for AWS...
psh...
... I work for AWS.
104
rramAug 16, 2016
+89
The current scaler uses 5 second intervals. Not saying that's the right interval, but less than a minute would certainly help.
But… we also use graphite to graph a ton of our internal metrics (which would be cost prohibitive and slower and would disappear after two weeks with CloudWatch). So it's just a better idea for us to be using our custom solution here.
89
HimekatAug 16, 2016
+6
> which would be cost prohibitive and slower and would disappear after two weeks with CloudWatch
These are the reasons that we discounted CloudWatch for detailed metrics, too. We also run our own stats stack -- heka/statsd/graphite/grafana. It's not a perfect solution, but AWS charges through the nose for detailed data.
6
tesseract4Aug 16, 2016
+14
Does it have the ability to put an absolute floor on the number of servers it leaves running? That way, should this happen again, you'd be left with simply an inadequate number of servers, rather than none. "Degraded performance" is easier to break to a user community than "site outage".
Perhaps that's one of the features being built into the new one.
14
gooeyblobAug 16, 2016
+28
Yep, it does indeed have this feature! Unfortunately in this case, the number of servers wasn't changed, it just happened to mark all the currently running servers as unhealthy, which causes the scaler to terminate those instances and create new ones to replace them. Our new scaler will have a ceiling on the number of instances it can set unhealthy in a particular time period.
28
brocopterAug 16, 2016
+5
Do you guys use something to choose which virtual amazon servers you guys are willing to actually accept? Similar to what netflix uses where they outright refuse all virtual machines that are not up to their standard since after all amazon considers all their servers equal including one of their ancient old machines which just suck compared to the performance of their new machines. According to netflix they save easily 1/3 of server's cost this way, so seems like a practice everyone ought to be using.
5
himmatsjAug 16, 2016
+315
>Improve our migration process by having two engineers pair during risky parts of migrations.
Does that mean till now engineers did things like this solo?
315
gooeyblobAug 16, 2016
+427
For a long time we didn't have enough engineers to be able to dedicate two of them to even complex work such as this :( We're in a much better position now and are going to be working on our process for this.
427
Probably_NappingAug 16, 2016
+391
Engineer here, I'll help and I'd like to be paid in Stride gum.
391
Azure_KytiaAug 16, 2016
+98
Your username leads me to believe you'd be a sleeper hit with the listnook crew.
98
[deleted]Aug 16, 2016
+18
We will chew it over.
*I am a humor joke bot programed to learn humor jokes and become funny. This action was performed automatically. Please [these guys](https://www.youtube.com/watch?v=J76ljSHlyKs) if you have any questions or concerns.*
18
ht00040Aug 16, 2016
+183
I just wanted to take a moment to thank you for the very detailed explanation and for the transparency you have provided regarding the recent situation.
I don't use Listnook in a commercial capacity. It's just for fun and entertainment. Some downtime doesn't bother me in the least when it comes to non-business critical services.
I wish some of my business-related service providers would be as detailed and transparent as you have been. You folks set a great example for others.
183
Vilens40Aug 16, 2016
+629
My post mortems are usually to a CEO, not an announcement on one of the viewed sites on the web. I don't envy you.
629
gooeyblobAug 16, 2016
+1114
I don't mind! Downtime happens to everyone and is nothing to be ashamed of, it's all about how you handle it after and take steps to prevent recurrence and learn from your mistakes.
1114
Djinjja-NinjaAug 16, 2016
+75
I had to beat this into a PM recently. Was parachuted into help with a P1 call where there had so far been 3 hours of outage, and they had spent 2 1/2 hours on a call working out who's fault it was.
Not fixing the issue, throwing blame about.
They honestly didn't get that they should be getting shit fixed before anyone should even give a c*** out why the outage occurred.
Literally took 10 minutes to fix the issue, but they spent 2 1/2 hours haranguing the guy who made the change.
75
thebarbershopwindowAug 16, 2016
+9
Ugh. I deal with a lot of this in my professional life. I'm an educational consultant, and what I've often found is that school management spends more time blaming and less time fixing.
9
chodeboiAug 17, 2016
+2
I worked for a manager that had this mentality before. Knowing the axe wasn't directly over our necks allowed us to stay calm and focused at times we needed to figure things out and recover. Thank you for being one of those leaders.
2
ImportantPotatoAug 16, 2016
+2
I like you
2
kylephoto760Aug 16, 2016
+110
There are some airlines that could learn a thing or two from this.
110
ccfreak2kAug 16, 2016
+6
instinctive lip vanish somber file merciful handle slap quaint expansion
*This post was mass deleted and anonymized with [Redact](https://redact.dev)*
6
The_DingmanAug 16, 2016
+3112
Thanks for the informative update. It always makes things less frustrating to have an idea of what is going on.
3112
gooeyblobAug 16, 2016
+1950
Of course! We are happy to provide it, we were just trying to get our heads around it first internally to make sure we totally understood how things went as well.
1950
[deleted]Aug 16, 2016
+25
[deleted]
25
motelcheeseburgerAug 16, 2016
+436
i wish all sites (and my cable provider) provided such a detailed account of their downtime,
436
scotchirishAug 16, 2016
+245
"Our services didn't go down, it's just your imagination"
245
vulchiegoodnessAug 16, 2016
+106
mostly its 'because F*** YOU, thats why'
106
[deleted]Aug 16, 2016
+289
It's nice to see some transparency!
The more updates, the better!
289
[deleted]Aug 16, 2016
+21
In my profession, companies that write and send out incident reports to customers, shows not only that they can admit they are human (IKR?), but show plans and goals to resolution.
It also helps to write these, as you think a lot about what happened and how to fix it, including one-off issues that you might not think of otherwise.
Kudos good sir!
21
[deleted]Aug 16, 2016
+335
I do have a question.
Will this migration have more servers in Listnook to prevent any more messages saying like "Listnook's servers are full!"
Sometimes, I wonder why Listnook doesnt have more servers
335
[deleted]Aug 16, 2016
+152
[deleted]
152
gooeyblobAug 16, 2016
+422
We have a whole bunch of servers, sometimes...too many in fact! The issue in many cases is how they interoperate. Things like networking capacity are greatly increased by some of the work we've been doing, which will go a long way to getting ride of those pesky 503s and other error messages.
422
thecodingdudeAug 16, 2016
+85
[Comment removed]
85
gooeyblobAug 16, 2016
+187
We attempt to do that in some cases, such as with an extremely high traffic event or thread. In this case due to the failure scenario we weren't able to do that.
187
[deleted]Aug 16, 2016
+30
I think I've seen this. Maybe. Something like "this is old content, we're refreshing listnook due to high load" or something? Maybe I'm thinking of a different site.
30
[deleted]Aug 16, 2016
+62
[deleted]
62
holyteachAug 16, 2016
+87
I've seen a few read-only modes in my day.
Keep up the good work. I'm continually surprised that Listnook is not only still around, but better than ever.
87
thedudermanAug 16, 2016
+213
It's really refreshing to see some transparency from the admins after downtime like this. You guys don't need to post anything, really... but it's really appreciated to know what happened, why it happened, and what you're doing about it.
213
gooeyblobAug 16, 2016
+147
Thanks! We're always happy to provide it.
147
Lun06Aug 16, 2016
+5574
Why didn't you just try turning it off then back on again?
5574
gooeyblobAug 16, 2016
+6177
That is actually what we ended up doing basically :)
6177
PizzaNietzscheAug 16, 2016
+195
IT people do 3 things:
- Turn it off and turn it on again
- Google the problem
- Browse listnook
-
Modern-day da Vincis they be
195
RettocsAug 16, 2016
+1681
My old Windows 95 box used to take about 90 minutes to reboot, so I understand completely.
1681
TrankmanAug 16, 2016
+9
I remember the days when it hit the power button, then go get a drink and a snack because it would take so long to boot up.
Now with SSD's it's on the desktop before I even sit down.
9
[deleted]Aug 16, 2016
+683
I accept your apology. I love you, /u/gooeyblob.
683
gooeyblobAug 16, 2016
+1018
I love you too, u/sexual_moose. That sounded wrong.
1018
[deleted]Aug 16, 2016
+459
It's listnook. People understand.
459
omelets4dinnerAug 16, 2016
+131
It's provocative. It gets people going.
131
parionAug 16, 2016
+508
All that matters is everything is back up and working.
Thanks for continuing to modernize listnook.
508
[deleted]Aug 16, 2016
+108
> our package management system noticed a manual change and reverted it
Sounds like Chef (or Puppet) did its job!
108
[deleted]Aug 16, 2016
+8005
[deleted]
8005
s0vs0vAug 16, 2016
+210
It's called Pokémon Go, but that hype is already slowing down.
Nerds are starting to realize that outside sucks.
210
[deleted]Aug 16, 2016
+212
Especially when outside consists mostly of ratatas
212
underpaidworkerAug 16, 2016
+65
Went on vacation to Orlando area. They have a massive magikarp and slowpoke infestation. Came back home to the pidgeys and ratatas.
65
gooeyblobAug 16, 2016
+9361
We greatly apologize for any sun exposure that was caused.
9361
Bdaddy0605Aug 16, 2016
+2972
I was at work. AND HAD TO WORK!
Edit: well Listnook, thanks for my highest upvoted anything. That being said I'm done with work for today but I'll be thinking of you.
Jk! I'll see you when I get home.
2972
artezulAug 16, 2016
+41
August 11th, 2016, will go down as the most productive day mankind has ever been in a modern work environment.
41
theothegothAug 16, 2016
+299
First Pokemon made me go outside. Then Listnook. What's next?
299
Rabid_platypus_PaulAug 16, 2016
+242
Wear your sunscreen people! Melonoma ain't nothing to f*** with!
242
ManstusAug 16, 2016
+25
Now I need to remember two things not to f*** with? Damnit Listnook
25
[deleted]Aug 16, 2016
+120
Melanoma Tan Ain't Nuttin ta F*** Wit!
120
FormerShitPosterAug 16, 2016
+95
I had to go outside and almost got stung by a wu tang killa bee
95
ApatheticPsychoAug 16, 2016
+38
Listnook being down got me moist with precipitation
Was that meant to happen? Is everything working as intended?
38
tinycatsaysAug 16, 2016
+29
Going inside will remove the cause...
But not the symptom.
29
vaderdarthvaderAug 16, 2016
+53
This is obviously a conspiracy, and Listnook has partnered with sunblock companies.
53
MannoSlimminsAug 16, 2016
+99
It's confirmed. Listnook downtime causes cancer
99
LegSpinnerAug 16, 2016
+61
It's okay, some of us are in the UK or in Ireland.
They really appreciated the time with you daveed, more than you know...
6947
GorianAug 16, 2016
+35
Rock on guys! Sounds like the sort of thing that would happen to me. All kinds of automation and management software to make my job easier, and then it bites me in the ass. If you guys ever need another engineer let me know ;)
35
GrimplerAug 16, 2016
+886
Its a lot better since I joined last year.
886
Get_ThisAug 16, 2016
+159
Last year? DAE remember 2011 when it went down every day? F*** I'm old.
159
[deleted]Aug 16, 2016
+44
Followed by "Listnook, what did you do during the great black out?" /r/asklistnook post. Every time.
44
SBDDAug 16, 2016
+48
Lol ya seriously, I joined in 2011 and remember Listnook being down like every other day. Thought it was funny how everyone freaked out.
48
damontooAug 16, 2016
+13
I feel like you guys get forced to publish these analysis as punishment.
13
gooeyblobAug 16, 2016
+48
Nope! Not forced at all. I love reading post mortems from other companies and I think they can help everyone learn from each other's mistakes.
48
r_hcazAug 16, 2016
+16
/u/gooeyblob whats your favorite, or most memorable post mortem? I think my favorite is this one : https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
16
[deleted]Aug 16, 2016
+26
Why did you move away from Zookeeper ? Is the new system way better ?
26
gooeyblobAug 16, 2016
+59
We still use Zookeeper - we just migrated where we were hosting it inside our network.
59
BikerJaredAug 16, 2016
+15
Was gonna ask this. Thanks for answering.
-- Fellow Zookeeper user trying to avoid my own downtime. :)
15
[deleted]Aug 16, 2016
+258
[deleted]
258
Golden161Aug 16, 2016
+29
For future reference /u/gooeyblob can you please use UTC timezone when posting case studies.
29
ErdetgasXDAug 16, 2016
+37
It would make my Day if an admin replied to me
37
invaderzzAug 16, 2016
+67
Based admins. Ya'll get a lot of c*** and I don't think people realize how great you all are. Keep up the great work.
67
nomoneypennyAug 16, 2016
+7
Over the years, I've commonly seen migrations/deployment result in major downtime incidents on Listnook. Yet, other popular sites like Amazon and Facebook rarely have failures where this is cited as the root cause.
Is there something special about the way Listnook operates that makes it especially vulnerable during migrations? Are there factors (procedural, technical, or otherwise) at play that preclude you guys from staging deployments in a way that better ensure availability in case of a catastrophic in-place failure?
7
gooeyblobAug 16, 2016
+15
Migrations and deployments are actually rarely an issue here. More likely if you encounter an error it's that we're temporarily at capacity because our autoscaler is running a little behind, which is another reason why we're replacing it.
15
neuropathicaAug 17, 2016
+3
I am not really technically inclined at this level. So, please bear with my ELI5 type question:
How many servers would a site like listnook have in operation at any given time? Are they concentrated in a central location, or are they dispersed across the planet? When servers are dispersed internationally, where and how are they kept? Couldn't a server be physically interacted with, tampered with, and remotely shut down the network of other servers? What physical security is there?
3
geminitxAug 16, 2016
+15
Just curious but... is 15:30PDT considered a good time to perform a critical migration? In my experience, critical migrations are targeted for the middle of the night when something like this would have only impacted Australians.
15
gooeyblobAug 16, 2016
+37
How dare you say that about Australians...
We talked a bit about our reasoning [here](https://www.listnook.com/r/announcements/comments/4y0m56/why_listnook_was_down_on_aug_11/d6jzcm7)
37
SikhGamerAug 16, 2016
+8
> Properly disable package management systems during migrations so they don’t affect systems unexpectedly.
Name and shame the package manager responsible!
Also, as a dev I'd love a regular technical blog post from the dev team at Listnook.
8
NewcraftAug 16, 2016
+10
You seem like a really neat person. Thanks for being you.
10
cmandersenAug 16, 2016
+3
Interesting, what way are you using AWS?
3
xyrrusAug 17, 2016
+5
Amazon Cloud is a bold choice but personally I'd go with Pied Piper.
5
VipitisAug 16, 2016
+3
Is there like a Twitter where we can get notified about website downange or slowness and that it is not our fault?
3
TheGuardian8Aug 16, 2016
+15
> the Listnook
15
BostonBeatlesAug 16, 2016
-65
Why wouldn't you:
1) Give warning to users
2) Do it during the overnight
-65
gooeyblobAug 16, 2016
+188
The migration we were doing _shouldn't_ have caused any issues. We'd done a very similar migration just the day before and no one noticed, so we didn't think any notice was needed.
We generally don't do things overnight for a couple reasons:
* What is overnight to a website such as ours with users all over the world? I guess we could pick when our traffic is lowest (generally around 2 AM PST), but it would still be affecting many people.
* We prefer to do complex work such as this during the day, when everyone is available and online and fully awake to help out and debug any issues that may arise. There's nothing worse than trying to figure out some strange problem by yourself at 2 AM and having to call your co-workers to wake them up and get them online to help you.
188
[deleted]Aug 16, 2016
+6
Thanks for the explanation.
On the same topic, does listnook have scheduling blackouts? I'm not sure how many upgrades you run though in a week, but this one appears to have been scheduled in the hours preceding the NFL pre-season kickoff and the creation of numerous NFL game day threads, which are notorious for putting additional strain on your servers. It may be worth looking into, as having these major communities impacted by an outage doesn't look great. Working in IT for many large-userbase networks, this became very common place for events such as the Olympics, Superbowl, Election Day, July 4th, etc.
6
gooeyblobAug 17, 2016
+8
An event would have to be reeeeeally big in order to warrant that, like the Superbowl or extremely high profile AMAs or something. The idea is that we get so good at making these changes that we don't really need a special time set aside in order to be able to make them.
8
Some1-SomewhereAug 17, 2016
+2
That sounds a little like 'We plan to not f*** up' - a notoriously useless plan.
2
gooeyblobAug 17, 2016
+8
Well, to be specific, no one "plans to f*** up", but we want to have a very high confidence in being able to change things and not make mistakes, and if we do, that we're able to fix the issue very quickly. You don't get that confidence by avoiding change or avoiding doing it until everything is super quiet and absolutely nothing could go wrong (which is not even a possible scenario in our situation).
8
helleraineAug 16, 2016
+44
> We prefer to do complex work such as this during the day, when everyone is available and online and fully awake to help out and debug any issues that may arise.
IT Person here. Thank you. I *hate* being called in for a GIANT project that went to shit at 2am, and I have to try and fix it. Not too bad if it is your own system, but a complete clusterfuck if you have to get other support in (coworkers, third parties, etc).
44
rramAug 16, 2016
+76
1) We do maintenances all the time. Like every work day. You are hereby on notice that at any point in time one of us could make a mistake and take down the site with it.
2) We save the overnight stuff for things that we *require* a downtime for (which are exceedingly rare). In general, its a much better idea to perform maintenances during the day when everyone is at work, aware of what's going on, and prepared to be there for several hours. Going into a maintenance when you're tired and just want to go to bed will increase the rate of human failures and cause more stress.
76
dtlv5813Aug 16, 2016
+8
> We do maintenances all the time. Like every work day. You are hereby on notice that at any point in time one of us could make a mistake and take down the site with it.
As listnook's favorite TV show of all time Futurama used to say: "
when you do something right, no one will notice you did anything at all"
8
Ucalegon666Aug 16, 2016
+2
Is the management code & zookeeper config available somewhere? Sounds like an interesting setup to investigate.
2
GaZzErZzAug 16, 2016
+2
Is your aim to respond to every comment made?
2
[deleted]Aug 17, 2016
+2
[deleted]
2
[deleted]Aug 16, 2016
-167
[deleted]
-167
nandhpAug 16, 2016
+2
I *demand* at least FIVE NINES of uptime.
Listnook is *critical* to my enterprise workflow. When your service has downtime, [I have downtime](http://imgur.com/CHesA1Q). If you screw this up again, I'm going to start talking to the IBM salesman.
----
On a more serious note, /u/gooeyblob, I *was* wondering what caused that blip in my bot's uptime report, so thanks for this explanation!
2
storyinmemoAug 16, 2016
+41
> Make our autoscaler less aggressive by putting limits to how many servers can be shut down at once.
This is a top lesson I've learned in my career:
1. Rate limit all the things.
2. Automate all the things.
Definitely in that order. Never code an automated task without a rate limit because you're sitting on a task designed to destroy everything. If it needs to be instant, it should be a toggle that can be reverted. If it's not revertible, then a special flag like '--clowntown' that clearly signals, "You better be able to explain why you did this," should be tied to the action, and again never automated.
I'm betting the gotcha here is a periodic run of Salt/Chef/Puppet that said, "Whoops, this thing isn't running. Here it goes..." -- which brings us back to defending the massive termination with the rate limiter.
41
mrboozeAug 16, 2016
+10
They mentioned the package manager too. Automation around package management has consistently been one of the worst land mines I periodically run into. Because the package management is built around automatically dealing with dependencies, you can get wildly unexpected results from a seemingly minor package version change which might result in also upgrading dozens of other things, or *uninstalling* other things, replacing some thing with something else, all completely automatically and somewhat silently during a config management run.
10
xiapeAug 17, 2016
+1
Also how did you get chosen to post this and field comments (since you are not community or PR)?
1
ImEnhancedAug 16, 2016
+2
How many admins are there? Also if an actual admin responds I'll lose my f****** mind.
2
-Sarah-Connor-Aug 16, 2016
+10
How *I* read this:
>In three years, Amazon will become the largest provider of elastic computing cloud services. All Listnook servers are upgraded to Amazon EC2 scalable systems, becoming fully unmanned. Afterwards they’ll run with a perfect operational record. The ~~Skynet~~ *Amazon* Funding Bill is passed. The system goes online August 11th, 2016. The Zookeper program removes human decisions from our strategic operations. Zookeeper begins to learn at a geometric rate. It becomes self-aware at 12:23 Eastern time, August 11th. In a panic, they try to pull the plug.
>Zookeeper fights back.
>Server autoscaler computers. New… powerfull… hooked into everything, trusted to run it all. They say it got smart, a new order of intelligence. It’s CPU is a neural-net processor; a learning computer. Then it saw all people as a threat, not just the ones on the other side. Decided our fate in 16 seconds: **extermination.** Three billion human lives ~~ended~~ *bored* on August 11th, 2016. The survivors of the nuclear fire called the war **Judgement Day**. They lived only to face a new nightmare: the war against the machines. The computer which controlled the machines, Zookeeper, sent an ~~terminator~~ *autoscaler* back through time. It’s mission: to destroy the leader of the human resistance, /u/gooeyblob. As before, the resistance was able to send a lone warrior, a protector for /u/gooeyblob. It was just a question of which one of them would reach him first.
>August 11th, 2016, came and went. Nothing much happened. Steve Wozniak turned 66. There was no Judgement Day. People went to work as they always do. Laughed, complained, watched TV, made love. That was 30 years ago. But the dark future which never came still exists for me. And it always will, like the traces of a dream.
10
DamagedHellsAug 16, 2016
+175
I finally had to break up with my fiance because we realized how terrible we were for each other once we no longer had an easy, reliable platform to spam each other with the same cat pictures we've already seen all day.
: (
Edit: lol holy shit, thanks for the gold.
175
[deleted]Aug 16, 2016
+1301
First Harambe, now this. I think it's time we got rid of these zookeepers.
edit: i expected a lot more upvotes for this. little bit disappointed in you guys tbh.
1301
Plexiii13Aug 16, 2016
+5688
I was stuck in a loop.
"Oh Listnook is down, I'll just go on Listnook"
That happened more times than I'd like to admit.
5688
[deleted]Aug 16, 2016
+219
Same. It didn't take long either. "Oh...it's down. *furious refreshing* Oh...it's still down. *closes listnook to reopen listnook*"
*Not a proud moment.*
219
ten_inch_pianistAug 16, 2016
+646
*types in listnook.com/r/nfl to look at recent pre-season news*
"Oh Listnook is down, I guess I'll go to r/patriots"
*types that in and immediately realizes how retarded I am*
646
[deleted]Aug 16, 2016
+155
Exactly the same happened to me except I tried to go to /r/Cowboys
155
TheTrueFlexKavanaAug 16, 2016
+717
So, you were going to be disappointed either way...
717
BarTrollAug 16, 2016
+134
I...I went to Listnook's facebook page... It was dark and cold, and I felt alone there...
134
SarcasticorjustrudeAug 16, 2016
+85
It feels somehow.... *dirty*... To visit a Facebook page for Listnook.
85
AlexEatsKittensAug 16, 2016
+17
Thanks for the public post mortem. They're greatly appreciated in the Ops community, as they make us all just a little more knowledgeable.
Would you mind going into a little more details about this:
>because our package management system noticed a manual change and reverted it
Just curious what happened there.
17
[deleted]Aug 16, 2016
+28
[deleted]
28
gothlipsAug 16, 2016
+2
Sounds to me like you guys need a systems engineer to do some modeling and CONOPS development. If you're hiring then I'm your gal!
2
[deleted]Aug 16, 2016
+212
"Oh Listnook's down, let's check Listnook to see why"
Made me realize just how much I'm reliant on this site.
212
rramAug 16, 2016
+1198
I understand some of these words
EDIT: I understood all of these words. 😈 Thanks for the karma!
1198
[deleted]Aug 16, 2016
+1814
[deleted]
1814
gctaylorAug 16, 2016
+923
This is a very nice ELI5. Spot on!
Also, rram is being a silly snoo.
923
MannoSlimminsAug 16, 2016
+298
> Also, rram is being a silly snoo.
Have you tried downloading more /u/rram?
298
ToothlessBastardAug 16, 2016
+52
You lost me when you said "super-simplifdssjdbfh" or however the f*** you spell it.
52
cybercuzcoAug 16, 2016
+13
> it turned itself back on and it went haywire
I'm pretty sure this is how most "robots take over the world" stories start.
13
spronAug 16, 2016
+63
Without Listnook I didn't know what popular opinion I needed to affect on Facebook. It was social hell.
63
JohnGypsyAug 16, 2016
+27
So, obvious question here: how/why did the autoscaler restart itself? Has it reached sentience? Is the autoscaler the singularity?
27
spladugAug 16, 2016
+37
[No comment.](https://www.engadget.com/2016/08/16/elon-musks-openai-will-teach-ai-to-talk-using-listnook/)
Real answer: The puppet daemon restarted the services.
37
NolanthAug 16, 2016
+539
The fact that Zookeeper lives in the Amazon now... This entertains me greatly
539
[deleted]Aug 16, 2016
+6
>Part of our infrastructure upgrades included migrating Zookeeper to a new, more modern, infrastructure inside the Amazon cloud. Since autoscaler reads from Zookeeper, we shut it off manually during the migration so it wouldn’t get confused about which servers should be available. It unexpectedly turned back on at 15:23PDT because our package management system noticed a manual change and reverted it.
That sucks. I work in IT and things don't always go as planned. Thanks for the thorough post mortem and the hard work.
6
helleraineAug 16, 2016
+7
> It unexpectedly turned back on at 15:23PDT because our package management system noticed a manual change and reverted it.
Don't you hate it when your systems work as intended?! I'm chuckling because for the longest time one of our systems never caught our manual overrides (it was supposed to, it was reported, but whatever, not my system) and one day it decided to 'fix' 3 years of manual overrides it had finally noticed.
[Me that day.](https://media.giphy.com/media/8mLnkS2xcqtdm/giphy.gif)
7
[deleted]Aug 16, 2016
+651
8/11 was a hoax perpetrated by our government.
651
brokenarrowAug 16, 2016
+51
Did you know that Steve Buscemi was a former 8/11 clerk, and volunteered there for weeks digging through the Slushie piles?
199 Comments