Postmortem September 17th outage and rollback

  • 🔧 Issue with uploading attachments resolved.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?
Mixing stock helps, RAID you generally want matched manufacturers and models but you can mix where they're purchased to get different lots and manufacture dates. And you always make sure the real time replicated backup on a different RAID is a different manufacturer and different hardware and in a different disaster zone.
 
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?

They have multiple servers that cost tens of thousands of dollars each and which are dedicated to a single purpose and have staff that constantly monitor them.
And also the aforementioned avoidance of using the same drive batches. If they have the same defect, they will fail at the same time. If you get various drives from different batches or even different manufacturers you're avoiding that chance.
 
I'm still reading through this conversation, but I have a question: Are the backups of the forum on more resilient long-term storage? How many of those drives have to fail before everything's lost, I suppose is what I'm asking. There are off-site backups as well, right?
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?
The most resilient critical systems use multiple computers, running equivalent but different software, executing in lock-step, and any machine which diverges is eliminated from the running pool.

The big technology companies, such as Google, really just throw bodies at the problem instead. IBM provides systems that are basically designed to never fail at all; now, sure, those companies also have a lot of employees managing them, but an IBM mainframe can do things like monitor itself and tell IBM to send spare parts before anything actually fails.
 
I have a feeling that if you and CrunkLord had been in charge of repairing the Millennium Falcon, there would have been a lot less drama in the Star Wars Trilogy.
 
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?
In certain critical scenarios, IT/facilities/whoever else has 2N+1 redundancy for everything, and it's commissioned to verify that system failures would be independent.

I'm not Josh, so I don't know exactly what he's working with, but if the chance of a single failure of a device is 1/100 every day (failures is actually quite normal for a device running 24/7), then having four independent failures in one day, over the lifetime of a 10 year website would be at least 1 in 30 thousand (devices tend to fail at similar times, so it could be lower, maybe 1 in 10 thousand, or even 1 in 100), and with multiple things that could go wrong, it should not be surprising.

Big enterprises lower those odds by having multiple datacenters, made using different methods, in different storage areas (Google Cloud/Microsoft OneDrive is the most frequent backup for smaller enterprises), for national companies: on different power grids, and for huge conglomerates like Microsoft and Google: on different continents. Josh owns much of his equipment, and the total failure of the power grid in a single country could potentially take this website down. Then, in each datacenter, all the servers are independent, and literally every other piece of equipment, from transformers to HVAC equipment is 2N+1 redundant as well.

This is not such a big deal for Kiwifarms, which doesn't operate with a million of dollars of revenue every single hour (imagine how much money Google would lose if AdSense went down for an hour), but huge enterprises will build a huge fucking datacenter in another country and spend millions of dollars.
 
circuits.png




DISCLAIMER: this user knows nothing about Computering . it is just a joke please dont Pummel me null about the body and/or face.
 
TKFD status: "More resilient than cockroaches"
TTD status: "Self fulfilling"

@Null might be a filthy chief janny, but he's also a great warrior fighting for our autism.
 
Back
Top Bottom