Postmortem September 17th outage and rollback

Pee Cola · Sep 17, 2023

I get why so many people are thinking the worst (and with good reason), but I suspect the reason all 4 drives were taken out at once was due to a good old fashioned disk controller failure. Though Null's hypothesis of all drives being from the same batch and the batch being dodgy is just as likely.

Thanks for getting the Farms back up so quickly, Null. Will be interesting to see what the actual cause of failure was.

X Prime said:
So, "most" of the time was spent on importing the stickers.

Can we get a number here? I am thinking 5 hours minimum. That's kind of funny in a horrifying way.

Even if importing the stickers takes a whole day, it's a small price to pay. The last thing Null needs is for sticker spergs to spill their spaghetti all through the forum and into his DMs.

RiceofSkywalker said:
Man, the server's HDDs and Network Adapters love to die on Saturday/Sunday nights. Not only can't you catch a break, but they die at the most inconvenient of times.

It's nothing more sinister than Finagle's law at work.

Idiot doing idiot things · Sep 17, 2023

The situation is funny to me, given how you recently mentioned in a MATI stream about that fediverse LGBTQIAP+ instance going down and staying down, due to having no backups.

PSA for all kiwis: Always backup your important data in at least two different ways.

Real Fakeman · Sep 17, 2023

Dollar Store Sentai said:
I had two HDD's die within days of each other because they were the same batch, bought and installed at the same time. It is more common than you think.

I know and I agree. I have had identical hard drives failing within days of each other as well. But if four SSDs died at the exact same moment, I would think it was more likely that some other part caused the failure. A defective power supply or backplane for example. Or a firmware bug like those HP SSDs had. It caused them to always fail after 2¹⁵ hours of operation.

Grotesque Bushes · Sep 17, 2023

Aside from a hardware fault [batch defect for the actual drives, the controller having a fit, the datacenter cat taking a piss on the rack - you name it] the other thing it could be is if the ddos retarder or any other component were making so many calls it hit the disk read/write limit. It won't hurt to run diagnostics on the disks to check what's the story, plus if you send the info over to the manufacturer you may get a nice discount or straight up new drives if it's a bad batch.

变性黑鬼 · Sep 17, 2023

A nuclear bomb would only set the KF back 12 hours max due to twice-daily back-ups.

The stickers though, those are too much. Please react sparingly!

Cat tit bingo · Sep 17, 2023

Noooooooo my stickers! I had at least 4 more stickers before and now there gone forever, how will i ever trust this site again now that it stole my sweet sweet stickers,
just tell me im good

Dr Cruel · Sep 17, 2023

And here I was thinking my chimp raping a frog video got me permabanned.

Atlas Sneezed · Sep 17, 2023

Null said:
WD M.2 1.6TB enterprise drives

M.2 and enterprise aren't really a thing. Are you sure you don't mean U.2 (or U.3)? At any rate smartctl should give you all you need, unless they're totally dead in which case that would be pretty fucking sus

Kayfabe · Sep 17, 2023

mindlessobserver said:
An entire days worth of autism. Gone. Like tears in the rain.

On an Afghanistan thread, rare archived Twitter/Facebook videos and posts about the Fall of Kabul in 2021 were also apparently lost when the same thing happened to the farms in 2021 or so, that's not even about internet drama, these videos are an iconic part of human history that are now (probably unless any kiwis got them) gone, because turbo-autists are the only ones who are spergy enough to archive them (:_(

Seething Troon Collector · Sep 17, 2023

Nopenopenope said:
Question for the nerds: how do enterprises and critical systems etc stop this happening? I guess it makes sense that if you have 4 drives made at the same time, being used in the same way, they will fail at the same time or very close to each other. Making RAID more risky?

I would combine backups with some form of active-active or active-passive high availability. Microsoft(yuck) SQL can build out a database cluster and give you an alias to connect your front-end to the database. The end goal is to not rely on one server or storage pool. The problem is the more you scale out, the more expensive it gets.

Obligatory thanks Null and Frens for giving up your Sunday so I can sneed!

make_it_so · Sep 17, 2023

Patrick Bait-man said:
9/17
Never forget the day the quadruple drives failed and the posts that died.

"Shaka, when the drives fell."

Mr.Bucket · Sep 17, 2023

Null said:
I am guessing there is some sort of adapter on the frontloader's backplate which combines them to a single mobo PCIe slot or something and that shit out.

Yeah NVME PCI-E bifurcation boards. I've thought about getting one for my PC since this new mobo only has 2 sata ports.

Mysterious Autist XX · Sep 17, 2023

Ah so I wasn't tripping this morning and this afternoon when I tried to access the farms and kept getting a 502, thank you for your painstaking efforts in keeping this forum alive Null.

Arkenas Einbrecht · Sep 17, 2023

Thank you for having the resilience of a Tarasque Erverrlerd. This site is too important(and fun) to allow it to collapse.

vomitusdeux · Sep 17, 2023

I thought I had all my post priviliges removed due to being a retard. Sad that some threads lost post data though. Stickers I couldn't really care less about so if I lost some I didn't even know.

Perfectly Innocent · Sep 17, 2023

Null said:
I've reinstalled the Kiwi Farms so much at this point that I know the procedure very well

Awesome that you have a consistent and well-rehearsed DR plan. Full on RAID failures suck, but you’ve got what most people haven’t when they need it - a working backup.

Just out of curiosity, what filesystem are you using? Assuming you’re running a traditional Linux-based stack, it might be worth using ZFS or btrfs to vault hourly CoW snapshots of the running system somewhere off-site to save you needing to reinstall in a similar scenario.

ClashCity · Sep 17, 2023

Toolbox said:
That brokeness appears to be affecting the onion site and only the onion site. Tried to login multiple times there before hopping on .st, everything's good on the clear web version.

It's the exact opposite over here. i can only access the onionnsite and st just shows this every time for the past 2 hours

Harm · Sep 17, 2023

Is the symbol for the Internet Tough Guys subforum appearing incorrectly for anyone else?

Quack_Quack · Sep 17, 2023

goodgrief said:
Thanks for making sure this site can always come back. Some threads here are seriously indispensable.

The amount of shit we've collectively complied on who knows how many cows. It's like a mass index of the greasy side of the web

St.Davis · Sep 17, 2023

Null said:
where most of that was just waiting on the database to import over 120,000,000 post stickers.

I'm surprised that took up much time, surely it's just a few integers attached to each post? Or is this sort of data stored in some other ridiculous fashion? If you're merely being facetious then disregard.

Postmortem September 17th outage and rollback

Pee Cola

"Teflon piss king" - Grok 2026

Idiot doing idiot things

Real Fakeman

Grotesque Bushes

Null yeeted my spaghetti dog avatar

变性黑鬼

Opportunities multiply as they are seized -Sun Tzu

Cat tit bingo

Owns japanese coneheads on laserdisc

Dr Cruel

Glowies gotta glow

Atlas Sneezed

Kayfabe

Undertake a look at this!

Seething Troon Collector

Purveyor of fine European cheeses

make_it_so

Locked and Loaded

Mr.Bucket

Balls pop out of my mouth.

Mysterious Autist XX

Please be patient I am retarded.

Arkenas Einbrecht

vomitusdeux

🤝 Friends?

Perfectly Innocent

👮‍♀️🪙👛

ClashCity

Attachments

Harm

Wild geese that fly with the moon on their wings

Quack_Quack

Pubeous Babblerhosen

St.Davis