I am guessing there is some sort of adapter on the frontloader's backplate which combines them to a single mobo PCIe slot or something and that shit out.
How likely is it that 4 disks can break simultaneously?
Autism follows. Feel free to correct my math if I got it wrong.
Drive: Western Digital Ultrastar DC SN620
Mean Time Between Failures (MTBF): 2 million hours (aka 228 years)
Odds one drive fails before MTBF: ~68%
Annualized Failure Rate (AFR): 0.44%
Ultrastar Drive Specs
MTBF probability distribution
Odds of four independent SSD drives all failing over the course of a year:
1 in 1÷(0.0044^4)=~
1 in 2.7 billion
Odds of four independent SSD drives all failing on a single day:
(1-p)^365 = (1-0.0044)
p= 1-(.9956^(1/365))
1 in 1÷((1-(.9956^(1/365)))^4)=~
1 in 10^19
This number (10 to the 19th) is more than the number of grains of sand on the earth.
Conclusion:. The drives failures were not independent, and must have had a common cause.
Root causes from most to least likely:
(1) Firmware, Kernel, or ZoL Bug
(2) Failed PCH Electrical Component
(3) Power surge
(4) Sabotage / Exploit
The drive failures can be clustered in time due to common flaws, increased load, etc but four exactly simultaneously is very unusual - although not unheard of.
Many modern devices are brought down via bugs and/or shitty components.
ZoL is pretty stable in general, but when it has bugs they can be catastrophic, because it was ported from Solaris and it's codebase is very complex. Similar issues with complexity with PCH firmware.
The most common circuit hardware component that fails (in any device) is the lowly
electrolytic capacitor.
Manufacturers use garbage-tier Chinese capacitors which cannot take any thermal stress.
Example of a capacitor failure: It typically bulges at the top when it goes bust.
My money is on (a) a firmware, kernel, or ZoL bug or (b) an electrical (capacitor) failure between the backplane and the PCH.
In both cases I expect you will recover at least two or three of the drives depending on the configuration.
So more likely a part failure related to the drives and not four different drives simultaneously failing?
Four drives didn't fail simultaneously without a common cause.
Made an account to clarify things. I'm the dude that helps Null with this stuff when he needs it.
There are a few reasons this could have happened, from most likely to less likely:
- BIOS/UEFI Firmware stopped communicating with the NVMe drives. This happened with a certain BIOS setting when it was initially setup
- The drives actually died from the workload. Unlikely considering these can handle 1.7 Drive Writes per day. But very feasible. These are 2nd hand enterprise drives
- The backplane/JNVMe headers exploded. Super unlikely
The drives are likely still alive, and the server's firmware probably took a shit.
We need to inspect the server's BIOS settings or possibly even update the firmware. Then we can determine if the drives are toast or useless.
There was no foul play at hand here. At best a firmware bug, at worse, the drives sudoku'd.
Great info thank you sir. 100% agree. Two or three of the drives are probably fine. Would be curious to know when you find out.
One drive (or the PCH) may have experienced an unusual failure mode and that bug cascaded up the stack. The system as a whole may have been unable to deal with the problem.
Thanks for your hard work together with
@Null to track down the issue and bring the site back online.
Here’s what kills me. How does an entire shelf of NVMe drives fail. I’ve never seen that. Spinning rust sure, but solid state storage is pretty resilient. I can buy the firmware issue though, disk firmware can fuck storage faster than you can blink.
It strange, I think it's a low level firmware bug or electrical problem in the motherboard.
I HIGHLY doubt that 4 drives all completely failed all at once, even if they were made on the same day, sequentially after each other. That's just a hair short of impossible.
Yeah agreed there is no way they all failed at once unless there is a underlying proximal cause.