Postmortem September 17th outage and rollback

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
It's the exact opposite over here. i can only access the onionnsite and st just shows this every time for the past 2 hours
I worded that a bit poorly, didn't mean to say I couldn't log in, just figured maybe exiting the browser and logging in again after a bit might fix some things. Couldn't post as they said, the button to show the password when logging in didn't work, attempting to give posts stickers failed to bring open the menu, that one happens quite a lot when something goes wrong. Replying on/to profile posts also did not function.
 
There was no foul play at hand here. At best a firmware bug, at worse, the drives sudoku'd.
Good to hear that because I was initially suspicious of troon shenanigans from some insider.
 
I'm surprised that took up much time, surely it's just a few integers attached to each post? Or is this sort of data stored in some other ridiculous fashion? If you're merely being facetious then disregard.
Hopefully Mysql dump/restore is smart enough to disable indexes(and FK constraints and similar) during the restore and replace them after... if not then I could see 120,000,000 stickers taking a while.
 
I thought it was another Troon attack.
Troons working for WD have had a backdoor in the SSD firmware just in case Joshua Moon of Kiwi Farms, racist, homophobic, koumpounophobic, swatting forum got a hold of some and then and nuked all of them at once
 
I'm surprised that took up much time, surely it's just a few integers attached to each post? Or is this sort of data stored in some other ridiculous fashion? If you're merely being facetious then disregard.
When I first joined this website, I was impressed at the number of fancy things it does for its users. Many of them are append-only sequences of indefinite size. Stickers would best be represented as a list of reaction type and user elements, but there are probably other representations needed for efficient implementation of other features. If all of the stickers are verified and transformed into other representations, I can see it taking a while, although not hours.
ZFS is rock solid, but underlying hardware? Not so much these days.
Software wishes it could be as reliable as hardware. Unfortunately, yes, the trend is for hardware to become as reliable as software, in the worst possible way.
 
Here’s what kills me. How does an entire shelf of NVMe drives fail. I’ve never seen that. Spinning rust sure, but solid state storage is pretty resilient. I can buy the firmware issue though, disk firmware can fuck storage faster than you can blink.
 
Here’s what kills me. How does an entire shelf of NVMe drives fail. I’ve never seen that. Spinning rust sure, but solid state storage is pretty resilient. I can buy the firmware issue though, disk firmware can fuck storage faster than you can blink.
My thoughts is an input failure as stated by others, or a single SSD failure that overloaded the others leading to a cascade failure of the entire setup. The IT guy seems to think a firmware bug is likely, which is also possible. I HIGHLY doubt that 4 drives all completely failed all at once, even if they were made on the same day, sequentially after each other. That's just a hair short of impossible.
 
How does an entire shelf of NVMe drives fail.
it is not uncommon for entire ssd raid arrays to fail at (roughly) the same time as the drives all roughly have the same lifetime
i've seen this mentioned in a couple of raid guides
hdds are a lot more unlikely to fail at the same time as long as someone isn't shouting NIGGER at the drives at the time of failure

the common poor mans solution to this is to buy an extra drive and swap it with each of the drives after several days of use to make sure no two drives will die at the same time because they have reached the end of their lifetime
 
There was no foul play at hand here. At best a firmware bug, at worse, the drives sudoku'd.
Imagine being a troon who's wasted years of your already shortened life trying to bring down an autistic gossip site filled with harmless retards, only for you to fail spectacularly at every instance. But then, out of nowhere, the site drops and you are vindicated, all the federal crimes you committed to bring those transphobes down came to fruition and you can now harass women and children in peace.

But nope, all of your gay-ops did fuck all except cost us time, and it was a fucking drive failure of all things that almost caused us some actual damage, and we still tanked our way through that. The coping, seething, and dilating must be on another fucking level. Beautiful.

(And thanks for the help you've given to our Dear Feeder in keeping the site up It Dude, it's very much appreciated).
 
I HIGHLY doubt that 4 drives all completely failed all at once, even if they were made on the same day, sequentially after each other. That's just a hair short of impossible.
There were the HP SSDs that all failed after an exact number of hours, so if you started with them brand new and put them in an array together, they'd all fail practically on the same nanosecond.
 
Sounds like the backplane the drives slot into failed and fried the drives to me. Only decent explanation besides all 4 drives coming from the same batch. That backplane should also be discarded and replaced, hopefully the server chassis has more than 4 slots.
 
Here’s what kills me. How does an entire shelf of NVMe drives fail. I’ve never seen that. Spinning rust sure, but solid state storage is pretty resilient. I can buy the firmware issue though, disk firmware can fuck storage faster than you can blink.
It's much much MUCH more likely that something happened on one drive that "took them all down" from the viewpoint of the machine, though usually rebooting it recovers mostly. ZFS can go sideways but the admin sounds competent enough to notice the old "off by one" errors that cause all drives to be foreign and need importing.

But a drive locking up on a return causing the driver (which is shared with all the drives) to lockup? That shit happens in Linux all the time.
 
Back
Top Bottom