Postmortem September 17th outage and rollback

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
I get why so many people are thinking the worst (and with good reason), but I suspect the reason all 4 drives were taken out at once was due to a good old fashioned disk controller failure. Though Null's hypothesis of all drives being from the same batch and the batch being dodgy is just as likely.

Thanks for getting the Farms back up so quickly, Null. Will be interesting to see what the actual cause of failure was.
So, "most" of the time was spent on importing the stickers.

Can we get a number here? I am thinking 5 hours minimum. That's kind of funny in a horrifying way.
Even if importing the stickers takes a whole day, it's a small price to pay. The last thing Null needs is for sticker spergs to spill their spaghetti all through the forum and into his DMs.
Man, the server's HDDs and Network Adapters love to die on Saturday/Sunday nights. Not only can't you catch a break, but they die at the most inconvenient of times.
It's nothing more sinister than Finagle's law at work.
 
Last edited:
The situation is funny to me, given how you recently mentioned in a MATI stream about that fediverse LGBTQIAP+ instance going down and staying down, due to having no backups.

PSA for all kiwis: Always backup your important data in at least two different ways.
 
I had two HDD's die within days of each other because they were the same batch, bought and installed at the same time. It is more common than you think.
I know and I agree. I have had identical hard drives failing within days of each other as well. But if four SSDs died at the exact same moment, I would think it was more likely that some other part caused the failure. A defective power supply or backplane for example. Or a firmware bug like those HP SSDs had. It caused them to always fail after 2¹⁵ hours of operation.
 
Aside from a hardware fault [batch defect for the actual drives, the controller having a fit, the datacenter cat taking a piss on the rack - you name it] the other thing it could be is if the ddos retarder or any other component were making so many calls it hit the disk read/write limit. It won't hurt to run diagnostics on the disks to check what's the story, plus if you send the info over to the manufacturer you may get a nice discount or straight up new drives if it's a bad batch.
 
A nuclear bomb would only set the KF back 12 hours max due to twice-daily back-ups.

The stickers though, those are too much. Please react sparingly!
 
Noooooooo my stickers! I had at least 4 more stickers before and now there gone forever, how will i ever trust this site again now that it stole my sweet sweet stickers,
just tell me im good
 
An entire days worth of autism. Gone. Like tears in the rain.
On an Afghanistan thread, rare archived Twitter/Facebook videos and posts about the Fall of Kabul in 2021 were also apparently lost when the same thing happened to the farms in 2021 or so, that's not even about internet drama, these videos are an iconic part of human history that are now (probably unless any kiwis got them) gone, because turbo-autists are the only ones who are spergy enough to archive them (:_(
 
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?
I would combine backups with some form of active-active or active-passive high availability. Microsoft(yuck) SQL can build out a database cluster and give you an alias to connect your front-end to the database. The end goal is to not rely on one server or storage pool. The problem is the more you scale out, the more expensive it gets.

Obligatory thanks Null and Frens for giving up your Sunday so I can sneed!
 
9/17
Never forget the day the quadruple drives failed and the posts that died.
shaka.JPG
"Shaka, when the drives fell."
 
I am guessing there is some sort of adapter on the frontloader's backplate which combines them to a single mobo PCIe slot or something and that shit out.
Yeah NVME PCI-E bifurcation boards. I've thought about getting one for my PC since this new mobo only has 2 sata ports.
 
Ah so I wasn't tripping this morning and this afternoon when I tried to access the farms and kept getting a 502, thank you for your painstaking efforts in keeping this forum alive Null.
 
I thought I had all my post priviliges removed due to being a retard. Sad that some threads lost post data though. Stickers I couldn't really care less about so if I lost some I didn't even know.
 
I've reinstalled the Kiwi Farms so much at this point that I know the procedure very well
Awesome that you have a consistent and well-rehearsed DR plan. Full on RAID failures suck, but you’ve got what most people haven’t when they need it - a working backup.

Just out of curiosity, what filesystem are you using? Assuming you’re running a traditional Linux-based stack, it might be worth using ZFS or btrfs to vault hourly CoW snapshots of the running system somewhere off-site to save you needing to reinstall in a similar scenario.
 
That brokeness appears to be affecting the onion site and only the onion site. Tried to login multiple times there before hopping on .st, everything's good on the clear web version.
It's the exact opposite over here. i can only access the onionnsite and st just shows this every time for the past 2 hours
 

Attachments

  • 20230917_184129.jpg
    20230917_184129.jpg
    122.7 KB · Views: 22
where most of that was just waiting on the database to import over 120,000,000 post stickers.

I'm surprised that took up much time, surely it's just a few integers attached to each post? Or is this sort of data stored in some other ridiculous fashion? If you're merely being facetious then disregard.
 
Back
Top Bottom