Postmortem September 17th outage and rollback

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account

Null

Ooperator
kiwifarms.net
Joined
Nov 14, 2012
The Kiwi Farms went down at approximately 5:30am EU. The database locked up unexpectedly and the site became unavailable. @CrunkLord420 correctly identified it was a disk issue, so I called in the two guys who help me manage our storage array ("raid"). They diagnosed that all four enterprise raid NVMe harddrives had failed simultaneously, which wiped out the entire database and all forum software. I verified we had a fresh copy of the backup and a recent-ish backup of the actual software.

I had two options moving forward:
1) Pull out remote hands on a US Saturday night and have them determine if the drives can be salvaged and overnight parts to fix whatever has actually broken, or
2) Completely reinstall the entire forum on a different raid until we can do #1 and then move it later.

I decided to go with #2 as the faster but more laborious option, which is why we've had data rollback.

This is was very close to a total nightmare scenario where the server would be completely destroyed. I've reinstalled the Kiwi Farms so much at this point that I know the procedure very well, and it only took about 7 hours, where most of that was just waiting on the database to import over 120,000,000 post stickers.

Anyways, sorry, we're back up. Let me know if there's any weirdness. I'll get chat up in a bit.
 
NVMe hard drives?
I don't know the terminology for drives, I don't even like touching them. They're WD M.2 1.6TB enterprise drives in a frontloading form factor which should not shit out unless I am seriously abusing them with writes continuously.
 
Have you ever thought it possible that someone might try to physically sabotage Kiwi Farms?
 
Back
Top Bottom