#780162 general debian base-system fix: default HDD timeouts cause data loss or corruption (silent controller resets) #780162
- Package:
- smartmontools
- Source:
- smartmontools
- Description:
- control and monitor storage systems using S.M.A.R.T.
- Submitter:
- Chris
- Date:
- 2019-10-10 00:57:05 UTC
- Severity:
- important
The smartctl utility can set the "scterc" timeouts of HDDs.
Please do also ship the provided scripts and default udev rule with the
smartmontools package, so that they try to configure safe timeouts,
depending on the drives capabilities, usage and configuration.
The problem with mismatching default timeouts surfaced through repeated
reports about drives being droped from arrays and a much higher rate of
unrecoverable errors during reconstruction on the linux-raid
mailinglist, but is not limited to redundant disk setups.
(The scripts have been posted upstream without response, but it is still
a distro resposibility to ensure that installations will have safe
defaults. Note that the provided udev rules specific to mdadm are only
to be included in the mdadm package.)
RATIONALE
The error recovery (ERC) time of a drive *must* be shorter than the
controller timeout.
Otherwise read errors will cause controller resets, leading to direct
data loss or, if it is a redundant disk, loss of redundancy and a very
high probability of another read error and data loss when
re-establishing the redundancy.
If a drive does not support adjusting its ERC timeout, the controller
timeout must be increased above the drive's maximal error recovery time.
If you don't want that kind of long device timeout, you should look for
a drive with SCT ERC timeout support. (smartctl -l scterc /dev/...)
smartctl-timeouts README
The smartctl-timeouts scripts adjust controller and disk timeouts
according to disk redundancy status, and fix commonly mismatching
defaults for drives without an error recovery timeout support or
default, which has often lead to data loss.
The scripts are to be called by udev rules during device initialization,
and by kernel modules according to run-time redundancy status changes.
Every redundancy providing block device module may ship with proper
udev rules that initialize the timeouts for their possibly redundant
devices.
An alternative to these scripts may be to investigate the FASTFAIL
feature in the kernel.
TESTING: Extract all files of the .zip to /etc/udev/rules.d/ and reboot,
and see if the timeouts have been adjusted:
smartctl -l scterc /dev/sdX ;
cat /sys/block/sdX/device/timeout (replace "sdX" with your
device name)
If you have redundand devices, also unplug and re-plug a
device, to see if timeouts are properly re-applied.
NOTE: Correct execution during boot may in some cases require that
smartctl and the smartctl-timeouts scripts are available in
the initramfs?
IMPACT (without having specific timeouts configured)
For possibly redundant disks: If supported but simply disabled in the
drive, the ERC timeout is adjusted to the current controller timeout
minus 5 seconds.
The controller timeout is only raised (to
NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS) for drives without SCTERC
support. As well as for entirely non-redundant-disks, in an attempt to
allow these drives to finish their error recovery regularily before a
reset is triggerd.
As controller timeouts are only increased selectively (only drives
without SCTERC support and surely non-redundant disks), the scripts
only adapt mismatching timeouts, by default. Existing manufacturer or
custom ERC timeout settings (as in professional, dedicated, redundant
setups, e.g. storage servers etc.) won't be changed, except with
specific configuration options.
TODO
* The ERC timeouts need to be re-applied to the disks a after resume!
This can be accomplished with the included systemd unit file that
triggers an udev change event.
https://bugs.debian.org/779412 "block devices loosing state after
resume: trigger udev rules to re-apply settings"
Hi Chris, can you please let us know the link to the upstream discussion? From your description, I don't see a imminent risk of data loss which warrants a RC bug level. Therefore downgrading to important. (CC'ing also the smartmontools bug; same logic, but this a feature request, aka wishlist bug)
Am Sun, 15 Mar 2015 23:57:44 +0100 schrieb Tobias Frost <tobi@debian.org>: Thanks for responding. There are frequent reports and responds (like look at recent threadsfor "timeout mismatch") comming up. Here is the thread where I got the hint, and then gathered some more info from looking up more precise responds: http://thread.gmane.org/gmane.linux.raid/48071/focus=48086 (The README in the .zip contains the last version.) AFAIK the all drives that have no data recovery timeout and try to recover a read error longer than the 30s controller timeout (most regular non-raid drives) get completely reset upon a simple block error, risking not only the block but any open/unwritten data. A raid may be able to recover everything from a redundand disk, only if there is no second read error while rebuilding the entire disk. The risk of a error while reading large disks is high (significant amount of such rebuild failure reports) where the second controller reset leads to the raid failing to rebuild leaving behind the array in a corrupt state. Without raid, there is no chance to recover, not only the defect block but also not any other open/unwritten data. That was why I set the bug level to RC, and it does somehow still look quite RC to me.
I've thought about the serverity some more, and conclueded I'll do an attempt setting severity back to serious: The affected user base is very large (with regular non-raid drives). An occasional read/or write error can happen anytime (as in "imminent"). And more severely: such errors that would regularly be recoverable by the drives firmware (with proper timeouts) are nothing unusual with magnetical disks. But for the affected user base the recovering process of the drive will be interupted by a complete controller reset (risking data). As long as one has not been hit by this bug it may not seem too severe, but that changes as soon as some common intermittent disk read/write error (possibly even unavoidable over time), that could be perfectly recoverable by the firmware with correct default timeouts, has silently caused data or redundancy loss or corruption.
Hi all, @ Chris - could you share the relevant bug number where you posted the scripts/patch upstream ? From the initial bug you reported " (The scripts have been posted upstream without response, but it is still a distro resposibility to ensure that insta(The scripts have been posted upstream without response, but it is still a distro resposibility to ensure that installations will have safe defaults. Note that the provided udev rules specific to mdadm are only to be included in the mdadm package.) This is at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=780162#5 Look forward to having that but number so maybe we could also apply some pressure there as well.
New incoming fax document. Please, download fax document attached to this email. From: Julio Bolton Scanned: Wed, 28 Oct 2015 13:18:03 +0300 Number of pages: 10 Filesize: 144 Kb Quality: 600 DPI Processed in: 45 seconds Fax name: scan_00000809280.doc Thanks for using Interfax service!
Control: Severity -1 important Risk = Probability * damage .. Probability is quite low (otherwise we've been flooded by bug reports, also upstream. Therefore reducing severity again. Tobo
There is now a report in the upstreadm issue tracker https://www.smartmontools.org/ticket/658 Comments are welcome.
Hi, I haven't seen the issue come up on the btrfs mailing list recently, but in the year I've followed the list it has been fairly well established that the default kernel SCT is too short for desktop-class drives. I haven't personally run into issues, because my drives have 7sec SCT ERC, and the default kernel SCT is 30sec. In addition to linking to this bug, I referenced the following thread on our wiki.d.o/btrfs page: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg53249.html In my opinion, smartmontools should not default to attempting to modify the SCT ERC of a drive. Instead it should query the drive's capabilities and only modify the kernel timeout if the query fails (assume desktop drive) or returns a large value (proof of desktop drive). From what I gather, that would be sufficient to close this bug. The second issue seems like it needs a new (normal priority) bug to be filed. That bug might be a request for a debconf interface to configure custom drive SCT ERC and kernel SCT values. I imagine a three column interface with device drive_SCT kernel_SCT columns would do the trick. This would be useful for following two cases: - tuning for greater performance (eg: read from redundant copy instead of waiting 7 sec, for more consistent IOPS) - allows drives with firmware preconfigured for RAID to be used in single disk volumes. Eg: if a user buys a "NAS" drive with 7sec SCT ERC, but wants maximum chance of recovery from read errors in single drive configuration, then the drive's firmware timeout should be reconfigured to something big like 120sec and the kernel SCT should be bumped to 180sec; this is the expected behaviour for single disk configuration. A long time ago I read that too low kernel SCT values are also a problem for ZFS, but of course the users who are hitting these problems are using "desktop" drives with crippled firmware for RAID. As I see it, this is an opportunity for Debian to distinguish itself as exemplary, because other distributions have not yet addressed these emerging issues. Of course, those "in the know" have already configured their systems correctly using rc.local... Sincerely, Nicholas
Looking at the Wikipedia article about Error Recovery Control, it mentions[1] the FreeBSD handles this better. Just what do they do and could it be done in Debian? Or is somebody already working on something like this for the kernel? "In a software RAID configuration whether or not TLER is helpful is dependent on the operating system. For example, in FreeBSD the ATA/CAM stack controls the timeouts, and is set to progressively increase the timeouts as they occur. Thus, if a desktop disk without TLER starts delaying a response to a sector read, FreeBSD will retry the read with successively longer timeouts to prevent prematurely dropping the disk out of the array." 1. https://en.wikipedia.org/wiki/Error_recovery_control#Standalone_vs._RAID_considerations
I understand that the mdadm maintainer chose[1] not to include a patch for changing SCT values, but maybe md, lvm, btrfs and friends could be patched to simply check the SCT values and emit warnings through syslog or dmesg if they find SCT has not bee configured before a device is assembled or mounted? 1. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=780207