I’ve recently came into possession of a few HGST HUH721010AL4200 drives (10TB SAS 4kn) and some LSI/Broadcom/Avago 9211-8i and 9305-16i HBAs.
I’ve been setting them up in packs of 12 drives in a ZFS RAIDZ2 on Debian 12 with a 9305-16i HBA. Driver is mpt3sas (v43.100.00.00 according to dmesg).
Initial checks of the drives turned out just fine, around 45k to 50k hours of power-on time, no bad sectors reported. I’ve moved some files (around 20TB) on them and then a drive was reported as faulted by ZFS due to read errors. Not to worry, I have loads of spares from the bunch, but that spare almost immediately faulted due to read errors when the resilver was running. Bad luck I thought, but then a third disk also had the same issue. Yet, SMART didn’t record any bad sectors or other defects. The issue reported in dmesg was a SCSI command time-out (longer than 30 seconds, which is the default). Raising the time-out to 60 seconds made the issue go away, but made ZFS slow as hell.
Now I started suspecting the HBA, replaced it by the same model, newest firmware. Same issues. While testing other drives faulted as well. Each time I would recreate the RAIDZ2 pool from scratch, fill it with garbage data and start scrubbing while writing to create additional stress.
Bad cables maybe? For the 9305-16i I had to buy new cables, SFF8643 to SFF8087. It would be really bad luck to have bought 4 faulty cables, so I switched the HBAs to two 9211-8i and put back the SFF8087 cables which worked for years and years. Same issues, same drives, again.
Could the backplanes be faulty? These also worked for years and years without any issues. Nonetheless, I plugged those drives directly to SFF8643->4xSATA and SFF8087->4xSATA cables, same issues.
Now I’ve also swapped the mainboard to a Supermicro X10SDV-F just to rule that out: same issues. Also I updated the drives’ firmware to the most recent one to no avail.
Another box with a 12 drive RAIDZ2 pool I have built started showing the same symptoms, but this is another different mainboard, case, backplanes and PSU. Only similarities are the OS, drive model and the HBAs, thus the same driver.
I dropped Debian on the larger box and installed TrueNAS Core, it’s FreeBSD with a different driver for those HBAs. Lo and behold, it ran the stress tests for days without so much as a hickup. So it’s the driver? I reinstalled Debian and ZFS and updated the driver to the newest one available from Broadcom (47.00.00.00). Everything worked just fine from there.
Has anyone encountered this (recently)? I searched everywhere for similar cases and found nothing fitting my situation. I would think my combination of hardware is not that special to cause such an edge case of driver issues that goes unnoticed by others, especially when the 9211-8i HBA is one of the most popular models out there.
All in all, I would’ve prevented all this headache and work by just swapping the driver, but I went down the hardware road.
Large Storage:
- Intel Xeon E3-1240Lv3
- Supermicro X10SLL-F
- 16GB DDR3 ECC RAM
- 550W PSU
- 1x Areca ARC1280ML RAID Controller
- Norco 4224 Case
- 2x LSI 9211-8i / 1x LSI 9305-16i
- 12x HGST HUH721010AL4200 on LSI and ZFS RAIDZ2
- 12x WD Red 6TB (Pre SMR era) on Areca as RAID6, to be replaced by 12x HGST HUH721212AL4200
Small Storage:
- Intel Xeon E3-1230Lv2
- Supermicro X9SCM
- 16GB DDR3 ECC RAM
- 500W PSU
- some Fantec 12-bay case
- 2x LSI 9211-8i
- 12x HGST HUH721010AL4200 on LSI and ZFS RAIDZ2