IN Amazon brain drain finally sent AWS down the spout - When your best engineers log off for good, don’t be surprised when the cloud forgets how DNS works

  • 🏰 The Fediverse is up. If you know, you know.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Corey Quinn, special to El Reg
Mon 20 Oct 2025 // 19:55 UTC

"It's always DNS" is a long-standing sysadmin saw, and with good reason: a disproportionate number of outages are at their heart DNS issues. And so today, as AWS is still repairing its downed cloud as this article goes to press, it becomes clear that the culprit is once again DNS. But if you or I know this, AWS certainly does.

And so, a quiet suspicion starts to circulate: where have the senior AWS engineers who've been to this dance before gone? And the answer increasingly is that they've left the building — taking decades of hard-won institutional knowledge about how AWS's systems work at scale right along with them.

What happened?​

AWS reports that on October 20, at 12:11 AM PDT, it began investigating “increased error rates and latencies for multiple AWS services in the US-EAST-1 Region.” About an hour later, at 1:26 AM, the company confirmed “significant error rates for requests made to the DynamoDB endpoint” in that region. By 2:01 AM, engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause, which led to cascading failures for most other things in that region. DynamoDB is a "foundational service" upon which a whole mess of other AWS services rely, so the blast radius for an outage touching this thing can be huge.

As a result, much of the internet stopped working: banking, gaming, social media, government services, buying things I don't need on Amazon.com itself, etc.

AWS has given increasing levels of detail, as is their tradition, when outages strike, and as new information comes to light. Reading through it, one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow. To be clear: I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.

Note that for those 75 minutes, visitors to the AWS status page (reasonably wondering why their websites and other workloads had just burned down and crashed into the sea) were met with an "all is well!" default response. Ah well, it's not as if AWS had previously called out slow outage notification times as an area for improvement. Multiple times even. We can keep doing this if you'd like.

The prophecy​

AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being "just another Monday outage." At AWS's scale, all of their issues are complex; this isn't going to be a simple issue that someone should have caught, just because they've already hit similar issues years ago and ironed out the kinks in their resilience story.

Once you reach a certain point of scale, there are no simple problems left. What's more concerning to me is the way it seems AWS has been flailing all day trying to run this one to ground. Suddenly, I'm reminded of something I had tried very hard to forget.

At the end of 2023, Justin Garrison left AWS and roasted them on his way out the door. He stated that AWS had seen an increase in Large Scale Events (or LSEs), and predicted significant outages in 2024. It would seem that he discounted the power of inertia, but the pace of senior AWS departures certainly hasn't slowed — and now, with an outage like this, one is forced to wonder whether those departures are themselves a contributing factor.

You can hire a bunch of very smart people who will explain how DNS works at a deep technical level (or you can hire me, who will incorrect you by explaining that it's a database), but the one thing you can't hire for is the person who remembers that when DNS starts getting wonky, check that seemingly unrelated system in the corner, because it has historically played a contributing role to some outages of yesteryear.

When that tribal knowledge departs, you're left having to reinvent an awful lot of in-house expertise that didn't want to participate in your RTO games, or play Layoff Roulette yet again this cycle. This doesn't impact your service reliability — until one day it very much does, in spectacular fashion. I suspect that day is today.

The talent drain evidence​

This is The Register, a respected journalistic outlet. As a result, I know that if I publish this piece as it stands now, an AWS PR flak will appear as if by magic, waving their hands, insisting that "there is no talent exodus at AWS," a la Baghdad Bob. Therefore, let me forestall that time-wasting enterprise with some data.
  • It is a fact that there have been 27,000+ Amazonians impacted by layoffs between 2022 and 2024, continuing into 2025. It's hard to know how many of these were AWS versus other parts of its Amazon parent, because the company is notoriously tight-lipped about staffing issues.
  • Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels. In other words, "people quitting who we wish didn't."
  • The internet is full of anecdata of senior Amazonians lamenting the hamfisted approach of their Return to Office initiative; experts have weighed in citing similar concerns.
If you were one of the early employees who built these systems, the world is your oyster. There's little reason to remain at a company that increasingly demonstrates apparent disdain for your expertise.

My take​

This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. Remember, there was a time when Amazon's "Frugality" leadership principle meant doing more with less, not doing everything with basically nothing. AWS's operational strength was built on redundant, experienced people, and when you cut to the bone, basic things start breaking.

I want to be very clear on one last point. This isn't about the technology being old. It's about the people maintaining it being new. If I had to guess what happens next, the market will forgive AWS this time, but the pattern will continue.

AWS will almost certainly say this was an "isolated incident," but when you've hollowed out your engineering ranks, every incident becomes more likely. The next outage is already brewing. It's just a matter of which understaffed team trips over which edge case first, because the chickens are coming home to roost. ®

Source (Archive)
 
Perhaps they should not have layed of all the Americans for a certain group of people..................
 
Perhaps they should not have layed of all the Americans for a certain group of people..................
It's not even the layoffs, their work environment sucks so much that they can't keep anyone who isn't dependent on them to stay in the country:
 
AWS has a huge “tribal knowledge” issue…. But they also have issues keeping employees longer than two years. So, after the initial 6-month new hire period, people are working for a year, maybe a year and a half before bailing out because the job sucks. They have very few “legacy” employees, and from people I know who have worked at AWS, they play the same corporate politics as anyone else. No one documents their processes as to make themselves seem more important, etc.
 
Once again, for those just joining us:

"No one is irreplaceable... but some people are a lot harder to replace than others."

Sadly, a lot of management and corporate types only get the first half of that maxim.
 
This is, unfortunately, a microcosm of the collapse of complex systems in the West in general. White men spend years, decades, or even generations building a complex system from nothing, in this case AWS services, and then they are ruthlessly purged from said system and replaced with browns and women since we are "all the same". The browns and women are not selected for intelligence, skill, or experience, but because of skin or genitals, and have absolutely nothing but contempt for the complex system and the White men who built it from nothing. Not only do they not respect the system they are placed in charge of, but they outright treat it with disdain, as they inherited something that works with well oiled precision, and leads them to believe that it must not be "that hard" since it works so well. Then, the system begins to decay because the White men who built and maintained it were forced out for having the audacity to be born wrong, and their female and swarthy replacements then struggle to handle a hiccup because they are profoundly ignorant about the system they have inherited from their betters.

Same thing for our airline systems, roadways, entertainment, military, the list goes on and on. This will continue to get worse and worse until we admit that the only reason the modern world exists is because White men built it, and allow White men to take it back over. That won't happen though, so I would expect AWS to have more outages like this in the future, that will probably be even more wide reaching and lengthy than this one has been.

Of course it's slightly more complicated than that, since I have heard working at Amazon sucks in general unless you're a golden child, but I still maintain that's the root cause of this problem. Shit in, shit out.
 
When your workplace is a toxic mess, your best employees move to better employers. The slacker assholes stay because they don't give a shit.

Also, competitors know which employers have unhappy employees and actively recruit. Former AWS employees will try to recruit their best former coworkers. They even get cash bonuses for bringing new people in.
 
Things never went down pre-2015, it's shocking how many companies do have national outages and customer data leaks since then. There's no penalty for it, and the people are willing to suffer it. It's going to get worse, until people finally shout "enough!" and hold someone accountable.
 
AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being "just another Monday outage." At AWS's scale, all of their issues are complex; this isn't going to be a simple issue that someone should have caught, just because they've already hit similar issues years ago and ironed out the kinks in their resilience story.
I’ve been studying for an AWS certification and I can tell you they are very proud of their reliability and resiliency. Their study materials read more like an ad than how to keep your portion of the cloud running.
 
No one documents their processes as to make themselves seem more important, etc.
A big issue a lot of places have is that nobody has the time to document anything. Proper documentation - the kind that lets someone actually understand a process more or less from scratch - easily makes a process take 2-3x the time. Rather than just do X, Y, and Z, I'm going to also write out my thought processes, what each step does in the larger context, take screenshots and edit them to highlight what I'm doing... it takes a long time to make anything useful to the rest of the team. And when shit is broken and you need to fix it now nOW NOW NOW NWO then your manager is going to say to just get it working, don't waste time documenting, because he wants to look good to HIS manager by getting it fixed ASAP.

Smart people will recognize that having notes on emergencies would be useful, but middle managers are neither smart nor people so you get the obvious outcome. And even if they do let you document, most IT people suck cock at communicating well, so the documentation will vary in quality wildly unless you mandate some kind of technical writing classes for your staff, which ALSO take up more time in the workday. And also to your point, if you work for such a manager then "appearing important" is a serious currency and if you're spending time documenting things properly then your shitbag coworker can cut corners to appear twice as productive as you, and guess who gets promoted? So you leave, and soon the team is just people who play the game properly, and point fingers when things break.
 
Why didn't the best and brightest elite human capital from India do the needful and turn the internet back on?
 
I know former colleagues who worked at AWS, their IT jobs are just as brutal of a sweatshop environment as their warehouses are. Most smart people go there to make a shit load of money fast, quit and work somewhere less intense.

Why didn't the best and brightest elite human capital from India do the needful and turn the internet back on?

They’re pajeets, what were you expecting? Also, Monday was Diwali which is basically Pajeet Christmas.

Things never went down pre-2015, it's shocking how many companies do have national outages and customer data leaks since then. There's no penalty for it, and the people are willing to suffer it. It's going to get worse, until people finally shout "enough!" and hold someone accountable.

This is completely untrue. Modern IT operations practices have greatly curtailed outages industry wide. Even if you exclusively hire the shittiest street shitters to staff your data center, uptime has probably greatly improved if only because the tools of the trade are better.

The reason it doesn’t seem that way is that everyone now leans on stuff like AWS to some extent, often indirectly, which means if AWS shits their pants, everyone gets shat on and the smell is unmissable. 20 years ago if the First National Bank of Bumfuck, Oklahoma’s systems took a dump, nobody cared except their 15 customers.

As for breaches and security issues, yes, it is worse now because there’s more valuable data to steal, and it is increasingly protected by less and less competent people who seem to get browner and browner by the day.
 
Last edited:
AWS has a huge “tribal knowledge” issue
I've known a few people who worked at Amazon. One of their biggest problems is documentation. If you've ever had to deal with their external documentation to integrate AWS into your stack, you'll know that it's dogshit. However, internal documentation...is also shit. The reason it's shit is the same reason Stack Overflow went to shit. First off, a bunch of it is written by Indians. That, itself, reduces the quality greatly. But the thing that makes it abysmal is that they have to write a certain amount of documentation to get promoted. This isn't inherently bad, but the problem comes from no one checks it. So you've got jeets writing internal documentation that is incorrect, that stays incorrect because no one checks it, based off different documentation that is incorrect (because that's where these jeets get the information to write documentation, they don't actually test shit out or try to make sure shit actually works a certain way), that will one day be used to make increasingly more incorrect information.

Amazon has some of the most retarded ways of running a company ever. If Bezos didn't have billions/trillions of dollars of government money subsidizing his services and in many cases handling his shit for him by lending him the handful of actually competent government technical employees, Amazon would go belly-up. In fact, most of the major tech companies are run so poorly that I have to assume GenAI is already a thing, and there is a secret supercomputer doing all the work for all these companies behind the scenes, because I don't understand how we don't have more problems and outages than we do. It's hard to imagine that technical experts and engineers from over a decade ago set up systems so streamlined and optimal that only a handful of competent people could keep it up and running to the point we're only starting to see major cracks now.
 
Once again, for those just joining us:

"No one is irreplaceable... but some people are a lot harder to replace than others."

Sadly, a lot of management and corporate types only get the first half of that maxim.
...while also simultaneously falling very far on the wrong side of the coin with regard to the second half.
 
Back
Top Bottom