Half the internet runs on AWS and one overheating building can take it all down, wild how fragile the infrastructure really is.
1092
schnurble3 days ago
+738
if people would stop being so dependent on us-east-1, it would work a lot better.
738
2_Spicy_2_Impeach3 days ago
+378
As someone that worked for AWS... we told folks almost a decade ago to stop f****** building in us-east-1. It's the oldest and jankiest. As a customer I degraded us-east-1 doing a load test for a super bowl event.
378
ZAlternates3 days ago
+131
Part of the issue is that some of their global services like dns are still based out of us-east-1 for critical components. I believe IAM (the login service) is as well, but I can’t say I’ve researched it much recently. You’d like to hope they learn from past mistakes and improve…
131
kbn_3 days ago
+53
That stuff has been spread out more in recent years for exactly this reason. Still plenty of control planes in IAD but the major ones (Kinesis, Dynamo, Route53, etc) have a peer to peer failover mechanism to other regions.
53
2_Spicy_2_Impeach3 days ago
+27
Hopefully but probably not. I left because previously to get hired in you had to be more knowledgeable than half your team. Also, a lot of core product teams had so much turnover they were worthless due to culture. You had to befriend product teams to get info and prioritize PFRs. Still shocked stuff doesn't fall over more.
So many stories of idiots from partner and event side that I should write a book of what not to do. Also, stopped doing business with institutions after working with a lot of customers.
Then they hired thousands of idiots. For reference when I was there for ~1.5 years(FART), I was more senior than ~70% of the workforce. Stayed a while longer to cash out but what the f***.
27
footie_ruler3 days ago
+28
I was there for 7 years. Wanted to cash out all my initial RSUs because they spiked a lot. Worked in MSK, one of their largest money makers. The enshittification of AWS truly began in 23.
Random cuts and layoffs that were not cuts, but merely “restructuring to provide more HC to other services”. It’s all bullshit all the way down. They think they can get away with anything, but the recent migrations to GCP and Azure should just be the start of the floodgates. AWS is so shit now.
28
2_Spicy_2_Impeach3 days ago
+11
7 is impressive. Had a guy on my team that was \~15+ and came from the dot com side. Had some stories.
I dislike Azure more as a consumer. While there’s no love lost with AWS, I’ve never had more issues with a provider than Azure.
Their gov offering was interesting to say the least that I was forced to architect with. We had weekly standup with Microsoft to go over the issues we were experiencing.
Migrated away from AWS to Azure before my time there and regretted it.
11
speculatrix1 day ago
+1
I worked for a significant customer of aws, like $millions a month, and Google were so keen to win our business that they flew in several high level training experts across the Atlantic to give us four days of intensive training.
1
adx9312 days ago
+5
Why do I get the feeling there are load-bearing perl scripts underneath it all?
5
Aar0ns2 days ago
+2
We do not speak of the foundations on which all things are lain
2
schnurble3 days ago
+37
at my previous job I kept telling people to build primary sites in other regions - use2 if it HAS to be east, or usw1/2, but noooooo. 🤦♂️
37
rexspook3 days ago
+10
I work there now. Can confirm that a lot of the critical stuff has not really moved off of it. I mean my org is in every region by the nature of what we do. It’s the dependencies that haven’t been migrated yet that take you down.
10
2_Spicy_2_Impeach2 days ago
+2
Unfortunate to hear but not shocking. Genuinely curious how the morale is there now. My old manager is still there and randomly emails me at 4AM asking how things are.
Is the CloudFormation team finally fully staffed? It was an inside joke at the time.
2
rexspook2 days ago
+7
Honestly I stay as far away from anything CloudFormation related as possible. The running joke in our org is we keep making scripts and cli tools to fix the pipelines page because they don’t have the bandwidth to do it.
Morale is weird. This kind of LSE doesn’t really matter to most of us. Most of the morale problems right now are due to the numerous layoffs and rapidly advancing AI adoption with no clear guidance. It seems like half the engineers in my org are spending most of their time playing around with AI just because it’s available
7
2_Spicy_2_Impeach2 days ago
+3
Appreciate the insight. Yeah, the AI messaging was bizarre when I was there. Leadership said push, push, push (even though my team thankfully wasn’t goaled that way).
But most of the time there’d be some random internal workshop from someone who was bored and built something fun with no real world application.
3
rexspook2 days ago
+3
The first 2-3 months of 2026 was just tech demos of AI agents/skills/tooling that people built with nothing really being delivered related to our actual product lol
3
2_Spicy_2_Impeach2 days ago
+1
Oof.
Gotta get those demos/PoCs in to one the probably now hundreds of Salesforce tenants for goaling.
1
rexspook2 days ago
+2
Don’t know what that means. I’m not in sales. These were people showing off internal tooling to the technical teams
2
anengineerandacat2 days ago
+6
It's their legit first env for testing and even if your not using US-East-1 your using it.
There are core services that rely on it regardless of the region of your services.
They have a page that goes into this in detail, I just don't have it on my current device.
6
2_Spicy_2_Impeach2 days ago
+1
Yes. I learned that as a customer with the DynamoDB outage the day after we launched our platform on AWS. More so that it has the most issues and expect shit to get weird. Control plane backed by DynamoDB having an issue and how many things had DynamoDB dependencies behind the scenes.
1
0neMinute3 days ago
+7
Are you sure about this? Most new services launch in us east1 and can only be used initially via us east 1 . If anything they have been encouraging it even if not saying it.
7
2_Spicy_2_Impeach3 days ago
+1
Yes. I’m sure.
1
0neMinute3 days ago
+1
Can you send me the docs?
1
2_Spicy_2_Impeach3 days ago
-5
No. Just go work for them and you’ll see.
-5
2_Spicy_2_Impeach3 days ago
-6
No. Just go work for them and you’ll see.
-6
0neMinute3 days ago
+3
I do work them which is why i asked, all new services are launched in us east 1 . Alot of services are dependent on us east 1 . Does aws want to fix this? Hard to actually say as they haven’t yet even when proving they can with the euro and china regions.
3
2_Spicy_2_Impeach3 days ago
Hilarious. Admitting to what I stated while asking for documentation. If you do actually work there, part of the reason I left and glad I did. It’s the biggest region and has the most issues. I don’t give a f*** if new services launch there. Again, as someone who worked with some of the largest consumers of AWS services we told folks to stop building there.
Don’t get me started on China and KMS implementation.
Have a good one.
0
0neMinute3 days ago
+4
Wth you talking about i said the opposite if what your saying. They can absolutely fix us east 1 world wide, they only did euro zone and china due to regulation changes. Those zones like gov cloud are heavily restricted on services. You should know this is obvious if you even use aws ?
Edit: also as someone who worked for aws you should know one of tenets is if its not documented then its not an official stance or practice.
4
dave0352x3 days ago
+2
Warm and fuzzy!
So glad I stopped working there
2
kbn_3 days ago
+4
Half the problem is the fact that new managed services usually come up in IAD first. The other half is IAD remains the interface default for a lot of things if you’re just clicking around in the console. Amazon could do a much better job getting people to spread out to other regions.
4
CrayonUpMyNose2 days ago
+1
Raise the price and put the fact as a question into every associate exam, so that every last engineer mentions it as a basic fact in meetings.
1
Ani-31 day ago
+1
Doesn’t help that AWS seems to default to that region even if all of your infrastructure is in a different region. I know there’s a setting to change it.
1
CircumspectCapybara3 days ago
+32
It's because service providers are c**** and too lazy to properly design highly available multi-region distributed systems.
If you want a five nine availability SLO, you have to be multi-regional, there's just no getting around it. A flood or hurricane or Iranian missile strike can take out an entire region and you can't do anything about that, you have to be in multiple regions. Service providers gotta stop being c**** and do proper engineering.
32
eXecute_bit3 days ago
+12
Marketing wants to advertise all the nines, but investors and therefore the executives won't actually pay for it because (looking at all the nines they see from AWS) "it probably won't happen". Then it's somehow my fault when it does.
12
adx9312 days ago
+1
Every day brings news that makes me glad I retired when I did. I think the future is going to be a move back to slow, paper-based processes.
1
essjay243 days ago
+4
> Iranian missile strike
Don’t get me started.
I was getting heat because EMEA couldn’t log into my app. I asked them to send me an email about it because I knew that it wasn’t me but login services hosted out of a blown up data center. Imagine their surprise when the couldn’t login to email either.
4
Loud_Ninja23623 days ago
+1
That also requires execs who are willing to pay for the hard engineering required to do that proper design work.
1
reasonman2 days ago
+4
it's not even necessarily use1 that was the issue, it was a specific building in a specific az. a lot of customers deploy in single az and refuse to do multiaz or multiregion arch and get bit for it, even the biggest names do this c**** shit
4
schnurble2 days ago
+1
In this case yes it's a single AZ, sure. But if you look back the bigger outages have always been us-east-1. And yes customers need to have more regional redundancy. But still use1 is the common thread.
1
reasonman1 day ago
+1
well yes, its the largest most populous region with every supported service and feature living there. there's a much larger surface area for things to go wrong. of course they shouldn't, but things happen and if one region accounts for the vast majority of your load the naturally most of the impacts will be felt there.
1
drkspace22 days ago
+1
Even if you try your best to avoid us-east-1, you still need it for atleast iam and route53.
1
Acceptable_Bat3793 days ago
+43
I work in the field, the entire internet is held together by duct tape and dreams
43
kiss_my_what3 days ago
+15
BGP, DNS and a whole lot of whisky.
15
seriousnotshirley3 days ago
+8
Can confirm, worked on DNS and BGP teams at a company that had Ana awesome whiskey cabinet.
8
oldfogey123453 days ago
+41
Its people not paying for redundancy.
Oversimplifying a bit, but..
People who are affected by a heat issue for one building only paid Amazon to house their digital stuff in that one building.
If your service being down for any amount of time will cost your business enough money, then it's worth the extra money to pay Amazon to have a "copy" of your website in more than one building.
41
Bovronius3 days ago
+22
If it was all a matter of personal responsibility that'd be great....but even though in my decades Ive never put anything on us-east, when it's down our company is pretty much paralyzed. Both sales tax calculation companies we can use for our software go dark when US-east goes down, half our vendor portals go down... Our hr software...our banks site...
The move to cloud has put us all in a shared risk pool, and unfortunately everything is becoming so interconnected and dependant that when any of the big 3 have problems everyone is going to feel it.
22
Think_Positively3 days ago
+11
It's also only early May. I live in New England so I'm no VA weather expert, but I also don't need to be to understand that it's going to get a LOT hotter in the coming weeks and months.
AWS should be counted as a utility at this point, and they have regulations in place to account for stuff like this (unless you're in TX).
11
Weaver2703 days ago
+7
Redundancies are for government and private companies who dont have to meet quarterly numbers and... Insert other excuses here ..
7
Dabaer773 days ago
+4
"Efficiency" as understood by an MBA is eliminating any kind of back up or redundancy and hoping things just never break. Then when they end up breaking a different group of MBAs get to say no one could have foreseen anything ever breaking.
4
sylbug2 days ago
+6
It wasn’t supposed to be like that. The strength in a distributed network is that you can route around damage.
We took a robust, distributed system and did a capitalism on it.
6
no_dice2 days ago
+1
This was literally one AZ in a region with 6 of them. Anyone who experienced an outage as a result implemented something that goes against best practices.
1
oneseason20003 days ago
+7
Maybe more like how fragile it is when a few people can make unilateral decisions impacting tens of millions. The wild part is how the unilateral bit comes about, imo.
7
shinjikun103 days ago
+19
Back when the internet first started there was a man who helped design TCP/IP. He said in a meeting in congress that he could take down the entire internet himself if he wanted. I can't remember his name.
19
Single_9_uptime3 days ago
+27
Probably in reference to BGP. Anyone with access to core internet routers from tier one providers could cause havoc. But there are a lot more controls on that today than there were in the early days of the internet, and the network is far more disparate to the extent it isn’t possible for one person to take down the entire internet. Limited things still get broken occasionally from bunk advertisements leaking out, but it’s very limited who’s capable of doing so.
If it were that easy to take down the entire internet in remotely modern times it would have happened already.
27
hitbythebus3 days ago
+2
Some folk would take your comment as a challenge.
2
shinjikun103 days ago
-2
It could have been Yakov Rekhter or maybe Vint Cerf. I can't remember.
-2
ThoughtsOfALayman3 days ago
+8
Are you referring to L0pht, maybe? It was a group, rather than one man, but they made that claim before congress.
8
diogenes-shadow3 days ago
+9
Each AWS zone has at least three data centers working together. Any one of the three buildings can go down and the others should be able to keep things running most of the time.
The internet and modern services are very fault tolerant. They have outages but you mostly never hear about it for this reason.
9
PNW_ModTraveler3 days ago
+3
Both statements are false. I don’t support data centers but if you want to cry wolf…
It’s closer to 32% and “taking it all down” is just a sensationalist take.
3
[deleted]3 days ago
-3
[removed]
-3
PaidUSA3 days ago
+3
This is worse than any slop post by far. "Regulate my free expression" is so much more detrimental than slop.
3
Justin__D3 days ago
+3
Right? The internet used to be a much more open forum. Now it's censored to hell and back, as proven by shit like use of words like "grape" and "unalive."
That's a core part of the *problem* with the modern internet, yet we have people wanting more of that?
3
[deleted]3 days ago
+1
[removed]
1
PaidUSA3 days ago
+1
Block most of the c*** people post. Don’t f****** backtrack now. That’s literally censorship you’re calling for.
1
PNW_ModTraveler3 days ago
-3
So he come a police state like China but worse!? 😂
-3
mineyCrafta253 days ago
+1
The "cloud" infrastructure at that
1
Every-Development3983 days ago
+1
AWS is not one region but many this will impact some but not all by any means.
1
weasel51343 days ago
+1
Infrastructure is so much worse than you know. Just in general
1
LittleKitty2351 day ago
+1
About 6-8% of US bridges are officially rated as poor condition or structurally deficient. The power grid in both Texas and California is held together with hopes and prayers.
If you've been paying attention you know exactly how bad it is now, and after Trump DOGE efforts expect it to get worse
1
weasel51341 day ago
+1
I have first hand horror stories
I worked underneath (not on just physically below) a bridge so bad I was scared to drive my truck back over again
1
Hedhunta3 days ago
+1
Even the nice looking data centers are filled with kluge fixes in my experience its amazing anything works at all
1
JuicedRacingTwitch3 days ago
+1
If your ops are that critical you should plan for multi cloud failover and redundancy. This is a budget and scope issue.
1
Bornee352 days ago
+1
Those who came up with the original principle of a resilient, decentralized network for sharing information are probably rolling in their graves right now
1
DisillusionedPatriot2 days ago
+1
Even more wild is the lack of urgency to update said infrastructure.
1
Gzngahr1 day ago
+1
This is a weak point in the supposed AI job stealing apocalypse and why they would love to put data centers on the moon or in orbit.
Lay off too many people with little prospect of finding alternative work to continue affording their lives, someone is bound to attack the infrastructure. You don’t even have to destroy it, just sabotage the cooling systems or power supply.
1
haklor1 day ago
+1
The major cloud providers give architecture guidance to companies to ensure that services are not impacted in a single data center or region is impacted. It is on the various companies to determine the cost/benefit analysis on if they want to pay for the availability. Some companies refuse to pay until they get impacted by a small outage or degradation.
1
DukeandKate3 days ago
+83
Coinbase impacted - good.
83
livenn3 days ago
+57
Didn’t know those housed toilet paper
57
im-ba3 days ago
+13
I understood that reference 🔥🧻🔥
13
broke_boi13 days ago
+16
Took an AWS Cloud Architecting course a few months ago. One of the things they hammer is to deploy and have backups in different availability zones so shit like this doesn’t happen
16
adx9312 days ago
+3
Unfortunately, Amazon won't pay for the courses for their own people so they don't know that on the infrastructure side.
3
waidee701 day ago
+2
More like Coinbase didn’t know basic redundancy if one AZ caused actual issues for them
2
Sirwired2 days ago
+1
Any Amazon employee is eligible for the internal training.
1
fountain203 days ago
+27
Can we start doing this in real time. A year and a half has passed. Little late to fix the problem.
27
Magic_Neil3 days ago
+33
Well that explains why a half dozen of my servers went down a couple hours ago!
33
karateninjazombie2 days ago
+4
Meh. Non emergency.
Massive f*** up in planning though. Someone will lose their job as a result.
4
couchjitsu2 days ago
+3
Guess coinbase will have to axe another 14% of their workforce
3
RiversSecondWife3 days ago
+10
We have a bunch of fire here in Florida. You want some for that data center?
10
czs50563 days ago
+3
Better suck ALL the water in the entire state then to cool it off. /s
3
thepianoman4563 days ago
+4
Let’s just fuckin scrap AI.
It has a couple legitimate uses, but for all the AI slop the people generate for memes, or “creating art / music” (by stealing other people’s legitimately created art and music) we need more data centers.
If we all just refuse to use garbage ass generative AI, there won’t be a need for more data centers… at least, the absurd amount that the tech billionaires want to build.
4
Software_Quiet3 days ago
+3
as the kids say, let them cook!
3
Iconic2543 days ago
+3
The disruption was caused by overheating at a data center, which subsequently triggered a power loss that affected specific hardware.
3
Mrjlawrence3 days ago
+3
I’m sure this won’t result in tech bros clamoring for more data centers /s
3
Pardot423 days ago
+2
I'll bet there will be many data enters catastrophically overheating in the next few years
2
karer3is1 day ago
+1
We can only hope... although I can imagine the crypto and AI bro meltdowns that ensue will be even more intense than those of the data centers
1
olearyboy3 days ago
+1
Wasn’t even that hot yesterday
1
ReedForman3 days ago
+1
Come into one of the warehouses.. Amazon been going c**** on their AC bills lately
1
MrBahhum3 days ago
-2
They are overheating because they are poorly managed. All data centers are resource sinks. They need to disclose how much resources they use.
-2
i_am_voldemort3 days ago
-3
A communication disruption can mean only one thing: invasion.
For those who did not catch the reference:
https://youtu.be/eF4Hcr7XX3c
-3
secretqwerty103 days ago
-3
or, hear me out: the cooling failed, like it says in the article
-3
i_am_voldemort3 days ago
+4
It's a quote from The Phantom Menace you twit.
4
secretqwerty103 days ago
-2
ooo i'm sorry i don't remember a forgettable quote from a movie that's 3 years older than i am
-2
geekgirl1142 days ago
Is it one data center or the whole us-east-1 region? There are about 10 zones in the region
104 Comments