Postmortem September 17th outage and rollback

DavidS877 · Sep 17, 2023

Nopenopenope said:
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?

Mixing stock helps, RAID you generally want matched manufacturers and models but you can mix where they're purchased to get different lots and manufacture dates. And you always make sure the real time replicated backup on a different RAID is a different manufacturer and different hardware and in a different disaster zone.

Slav Power · Sep 17, 2023

Nopenopenope said:
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?

Null said:
https://youtu.be/JHVSoJDZ06U?t=169

They have multiple servers that cost tens of thousands of dollars each and which are dedicated to a single purpose and have staff that constantly monitor them.

And also the aforementioned avoidance of using the same drive batches. If they have the same defect, they will fail at the same time. If you get various drives from different batches or even different manufacturers you're avoiding that chance.

UERISIMILITUDO · Sep 17, 2023

I'm still reading through this conversation, but I have a question: Are the backups of the forum on more resilient long-term storage? How many of those drives have to fail before everything's lost, I suppose is what I'm asking. There are off-site backups as well, right?

Nopenopenope said:
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?

The most resilient critical systems use multiple computers, running equivalent but different software, executing in lock-step, and any machine which diverges is eliminated from the running pool.

The big technology companies, such as Google, really just throw bodies at the problem instead. IBM provides systems that are basically designed to never fail at all; now, sure, those companies also have a lot of employees managing them, but an IBM mainframe can do things like monitor itself and tell IBM to send spare parts before anything actually fails.

Weeb Slinger · Sep 17, 2023

I have a feeling that if you and CrunkLord had been in charge of repairing the Millennium Falcon, there would have been a lot less drama in the Star Wars Trilogy.

DirectorDelta · Sep 17, 2023

Glory to our dear leader for getting us back up so fast, I'm glad it's only tech issues and not some sort of anniversary dropkiwifarm attack.

Artificial Stupidity · Sep 17, 2023

I was on Pol without kf, it's so fucking pozzed every post was just the usual bombardment of crap posts, low tier bait, glad to be back.

The Ultimate Ramotith · Sep 17, 2023

Th is actually Depressing... getting close to Misery.

The 120.000.000 Internet Sticker thing is random.txt material, at least.

DavidS877 · Sep 17, 2023

The Ultimate Ramotith said:
Th is actually Depressing... getting close to Misery.

The 120.000.000 Internet Sticker thing is random.txt material, at least.

I'm curious if alerts are a different table than reactions and if the alert table gets pruned at least. I'm pretty sure I don't need every alert from my $X years here.

NOT Sword Fighter Super · Sep 17, 2023

Thank you for your hard work, Dogman.

spastic_bag · Sep 17, 2023

Despite KiwiFarms' best efforts, KiwiFarms persists.

MetokurGroomedMe · Sep 17, 2023

thanks for keepin' it up, null, and thanks for the info about it. guess this is as close to 'dead' as KF has ever come.

Null · Sep 17, 2023

UERISIMILITUDO said:
Are the backups of the forum on more resilient long-term storage?

Yes. I have it set up so if a nuclear bomb detonates in the datacenter, we won't lose anything.

purpleflurp · Sep 17, 2023

Mr Steal Your Farm · Sep 17, 2023

Nopenopenope said:
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?

In certain critical scenarios, IT/facilities/whoever else has 2N+1 redundancy for everything, and it's commissioned to verify that system failures would be independent.

I'm not Josh, so I don't know exactly what he's working with, but if the chance of a single failure of a device is 1/100 every day (failures is actually quite normal for a device running 24/7), then having four independent failures in one day, over the lifetime of a 10 year website would be at least 1 in 30 thousand (devices tend to fail at similar times, so it could be lower, maybe 1 in 10 thousand, or even 1 in 100), and with multiple things that could go wrong, it should not be surprising.

Big enterprises lower those odds by having multiple datacenters, made using different methods, in different storage areas (Google Cloud/Microsoft OneDrive is the most frequent backup for smaller enterprises), for national companies: on different power grids, and for huge conglomerates like Microsoft and Google: on different continents. Josh owns much of his equipment, and the total failure of the power grid in a single country could potentially take this website down. Then, in each datacenter, all the servers are independent, and literally every other piece of equipment, from transformers to HVAC equipment is 2N+1 redundant as well.

This is not such a big deal for Kiwifarms, which doesn't operate with a million of dollars of revenue every single hour (imagine how much money Google would lose if AdSense went down for an hour), but huge enterprises will build a huge fucking datacenter in another country and spend millions of dollars.

Cats · Sep 17, 2023

DISCLAIMER: this user knows nothing about Computering . it is just a joke please dont Pummel me null about the body and/or face.

CognitiveDeficiency · Sep 17, 2023

I'm glad you enjoyed the match anon.

msd · Sep 17, 2023

Thanks for saving our stickers dad

Super Dilator Duel 3D · Sep 17, 2023

115,000,000 of those stickers are from me, sorry

AngryTreeRat · Sep 17, 2023

TKFD status: "More resilient than cockroaches"
TTD status: "Self fulfilling"

@Null might be a filthy chief janny, but he's also a great warrior fighting for our autism.

Edict of Expulsion · Sep 17, 2023

Please import the missing data once you determine if it was the backplane or the drives themselves that failed. Should just be some database commands

Postmortem September 17th outage and rollback

DavidS877

2026, year of DOOM.

Slav Power

UERISIMILITUDO

UNA PERSONA AGRICOLARUM MIHI EST

Weeb Slinger

DirectorDelta

Artificial Stupidity

The Ultimate Ramotith

CSD > ACAB

DavidS877

2026, year of DOOM.

NOT Sword Fighter Super

"Cheerleeder" of Slapfights

spastic_bag

biodegrade me daddy

MetokurGroomedMe

I am LORD Frieza, yes. Honk honk.

Null

Ooperator

purpleflurp

Now with more Flurp!

Mr Steal Your Farm

Shiet homie, we gwana dox dey niggaz fo sho

Cats

CognitiveDeficiency

Fresh Meat

msd

Local thread slowpoke, still a Dorothy enthusiast

Super Dilator Duel 3D

Only For GameTroon Advance

AngryTreeRat

Meds aren't working. Better get new ones.

Edict of Expulsion

Coolest cracker in the box