Postmortem September 17th outage and rollback

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Is there a deadman switch in place to upload a torrent of the site in case something happens to you? I don't think trannies are going to assassinate you, but people die in traffic every day.
Isn't that what the oasis dev did?
It was some form of front page but with links to everything, he died or something and dropped the source.
 
1695004010461.png
and now we wait
 
I'm very thankful I get free enterprise help regarding the disks because I hate touching disks. I have a phobia of doing any disk operations.
Chad oramge cat vacuuming the dust out of his PC while it's on and running Ark on max settings vs virgin null scared to look at his own hard drives lest they self-destruct when they sense a person breathing in the same room as them
 
I like to imagine that the time Null spent getting the server back up was prolonged by all of my relatively worthless shitposts made over the years.
Kinda like this one.

Seriously though, thanks for all of your work Josh, you’re the best
 
Well at least the men who wear dresses still have not been successful at getting our stickers. I would hate to lose the amount of puzzle pieces that I have worked tirelessly to collect over the years. It’s like losing the equivalent of all the bathtub HRT that can be cooked up in an evening, would be an absolute tragedy. Resulting in absolute Kiwi Death.

Another L for the elements of nature that try to demolish this forum. At this point the forum is going to get knocked down by a bumble bee knocking into the severs power supply and will somehow result in the same result. Some down time, but for the forum to return in its former glory. Proving that not only men in dresses, but God, nature, and even the disabled have no chance in stopping this wild ride.
 
it only took about 7 hours, where most of that was just waiting on the database to import over 120,000,000 post stickers.
No wonder you dislike the idea of new reactions so much.

If post stickers are taking so much time, maybe consider not backing them up. We can live without imaginary internet awards.
You think you do, but you don't. The sticker system merely existing prevents so many unnecessary 1-word shitposts and bannings it would make your head spin. Threads would be practically unreadable, it would take ten times the effort to separate the useful info and content from the chaff, and unlimited paid staff could never hope to parse the rules being violated. React stickers are an indelible part of site culture at this point, removing them would be a catastrophic collapse.
 
Last edited:
Damn, I completely missed this whole happening due to Saturday being my day of rest during the "busy seasons" at work, and to me being too busy working all day today to even get on the Farms. It's pretty awesome that it was all taken care of and explained before I even noticed though. Viva la Sneed.

>Why don't you Farm on Saturdays, Jogger?

Because, Fresh Meat, Saturday is SneedlessHangoverDay, the Jogger's day of rest. That means that I don't work, I don't get in a car, I don't ride in a car, I don't pick up the phone, I don't turn on the oven, and I sure as shit DON'T FUCKING FARM. SneedlessHangoverDay!
 
"Hello Joshua. I want to play a game."

"You are currently strapped to your chair at your desk. If you'll look to the side, you'll see a bucket of water hanging above your datacenter."

"Within the next 60 seconds, this bucket will be dropped, resulting in irreversible damage to all data pertaining to the infamous stalker site known as Kiwi Farms; effectively killing it."

"If you wish to save this precious gossip site of yours, you must perform one small task: Chop off your dick."

"To complete this task, a butter knife has been placed in your right hand."

"Sneed or feed, the choice is yours."
billy.png
 
There were the HP SSDs that all failed after an exact number of hours, so if you started with them brand new and put them in an array together, they'd all fail practically on the same nanosecond.
Yes, but wasn't that more of a firmware issue than the actual drives failing? Its been a while since ive kept up on that stuff though, so I might be misremembering. It might just be me sperging, but I would really only consider it a "failure" if it was something that was basically unavoidable or no workaround beforehand. I know flash/SSDs are limited by their read/write cycles and bit-rot, so the excess backups and all that could also cause an artificially short life as well.
 
I am guessing there is some sort of adapter on the frontloader's backplate which combines them to a single mobo PCIe slot or something and that shit out.
How likely is it that 4 disks can break simultaneously? 🤔

Autism follows. Feel free to correct my math if I got it wrong.

Drive: Western Digital Ultrastar DC SN620
Mean Time Between Failures (MTBF): 2 million hours (aka 228 years)
Odds one drive fails before MTBF: ~68%
Annualized Failure Rate (AFR): 0.44%

afr.png
Ultrastar Drive Specs

mtbfgraph.jpg
MTBF probability distribution

Odds of four independent SSD drives all failing over the course of a year:

1 in 1÷(0.0044^4)=~ 1 in 2.7 billion

Odds of four independent SSD drives all failing on a single day:

(1-p)^365 = (1-0.0044)
p= 1-(.9956^(1/365))

1 in 1÷((1-(.9956^(1/365)))^4)=~ 1 in 10^19

This number (10 to the 19th) is more than the number of grains of sand on the earth.

Conclusion:. The drives failures were not independent, and must have had a common cause.

Root causes from most to least likely:

(1) Firmware, Kernel, or ZoL Bug
(2) Failed PCH Electrical Component
(3) Power surge
(4) Sabotage / Exploit

The drive failures can be clustered in time due to common flaws, increased load, etc but four exactly simultaneously is very unusual - although not unheard of.

Many modern devices are brought down via bugs and/or shitty components.

ZoL is pretty stable in general, but when it has bugs they can be catastrophic, because it was ported from Solaris and it's codebase is very complex. Similar issues with complexity with PCH firmware.

The most common circuit hardware component that fails (in any device) is the lowly electrolytic capacitor.

Manufacturers use garbage-tier Chinese capacitors which cannot take any thermal stress.
capfail.jpeg
Example of a capacitor failure: It typically bulges at the top when it goes bust.

My money is on (a) a firmware, kernel, or ZoL bug or (b) an electrical (capacitor) failure between the backplane and the PCH.

In both cases I expect you will recover at least two or three of the drives depending on the configuration.

So more likely a part failure related to the drives and not four different drives simultaneously failing?
Four drives didn't fail simultaneously without a common cause.

Made an account to clarify things. I'm the dude that helps Null with this stuff when he needs it.

There are a few reasons this could have happened, from most likely to less likely:
- BIOS/UEFI Firmware stopped communicating with the NVMe drives. This happened with a certain BIOS setting when it was initially setup
- The drives actually died from the workload. Unlikely considering these can handle 1.7 Drive Writes per day. But very feasible. These are 2nd hand enterprise drives
- The backplane/JNVMe headers exploded. Super unlikely

The drives are likely still alive, and the server's firmware probably took a shit.
We need to inspect the server's BIOS settings or possibly even update the firmware. Then we can determine if the drives are toast or useless.
There was no foul play at hand here. At best a firmware bug, at worse, the drives sudoku'd.
Great info thank you sir. 100% agree. Two or three of the drives are probably fine. Would be curious to know when you find out.

One drive (or the PCH) may have experienced an unusual failure mode and that bug cascaded up the stack. The system as a whole may have been unable to deal with the problem.

Thanks for your hard work together with @Null to track down the issue and bring the site back online.

Here’s what kills me. How does an entire shelf of NVMe drives fail. I’ve never seen that. Spinning rust sure, but solid state storage is pretty resilient. I can buy the firmware issue though, disk firmware can fuck storage faster than you can blink.

It strange, I think it's a low level firmware bug or electrical problem in the motherboard.

I HIGHLY doubt that 4 drives all completely failed all at once, even if they were made on the same day, sequentially after each other. That's just a hair short of impossible.
Yeah agreed there is no way they all failed at once unless there is a underlying proximal cause.
 
Back
Top Bottom