#780162 general debian base-system fix: default HDD timeouts cause data loss or corruption (silent controller resets)

Package:
smartmontools
Source:
smartmontools
Description:
control and monitor storage systems using S.M.A.R.T.
Submitter:
Chris
Date:
2019-10-10 00:57:05 UTC
Severity:
important
#780162#5
Date:
2015-03-09 22:17:52 UTC
From:
To:
The smartctl utility can set the "scterc" timeouts of HDDs.
Please do also ship the provided scripts and default udev rule with the
smartmontools package, so that they try to configure safe timeouts,
depending on the drives capabilities, usage and configuration.

The problem with mismatching default timeouts surfaced through repeated
reports about drives being droped from arrays and a much higher rate of
unrecoverable errors during reconstruction on the linux-raid
mailinglist, but is not limited to redundant disk setups.

(The scripts have been posted upstream without response, but it is still
a distro resposibility to ensure that installations will have safe
defaults. Note that the provided udev rules specific to mdadm are only
to be included in the mdadm package.)


RATIONALE

The error recovery (ERC) time of a drive *must* be shorter than the
controller timeout.

Otherwise read errors will cause controller resets, leading to direct
data loss or, if it is a redundant disk, loss of redundancy and a very
high probability of another read error and data loss when
re-establishing the redundancy.

If a drive does not support adjusting its ERC timeout, the controller
timeout must be increased above the drive's maximal error recovery time.
If you don't want that kind of long device timeout, you should look for
a drive with SCT ERC timeout support. (smartctl -l scterc /dev/...)


smartctl-timeouts README

The smartctl-timeouts scripts adjust controller and disk timeouts
according to disk redundancy status, and fix commonly mismatching
defaults for drives without an error recovery timeout support or
default, which has often lead to data loss.

The scripts are to be called by udev rules during device initialization,
and by kernel modules according to run-time redundancy status changes.
Every redundancy providing block device module may ship with proper
udev rules that initialize the timeouts for their possibly redundant
devices.

An alternative to these scripts may be to investigate the FASTFAIL
feature in the kernel.

TESTING: Extract all files of the .zip to /etc/udev/rules.d/ and reboot,
         and see if the timeouts have been adjusted:

          smartctl -l scterc /dev/sdX ;
          cat /sys/block/sdX/device/timeout (replace "sdX" with your
          device name)

         If you have redundand devices, also unplug and re-plug a
         device, to see if timeouts are properly re-applied.

NOTE: Correct execution during boot may in some cases require that
      smartctl and the smartctl-timeouts scripts are available in
      the initramfs?




IMPACT (without having specific timeouts configured)

For possibly redundant disks: If supported but simply disabled in the
drive, the ERC timeout is adjusted to the current controller timeout
minus 5 seconds.

The controller timeout is only raised (to
NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS) for drives without SCTERC
support. As well as for entirely non-redundant-disks, in an attempt to
allow these drives to finish their error recovery regularily before a
reset is triggerd.

As controller timeouts are only increased selectively (only drives
without SCTERC support and surely non-redundant disks), the scripts
only adapt mismatching timeouts, by default. Existing manufacturer or
custom ERC timeout settings (as in professional, dedicated, redundant
setups, e.g. storage servers etc.) won't be changed, except with
specific configuration options.




TODO

* The ERC timeouts need to be re-applied to the disks a after resume!
  This can be accomplished with the included systemd unit file that
  triggers an udev change event.
https://bugs.debian.org/779412 "block devices loosing state after
  resume: trigger udev rules to re-apply settings"

#780162#12
Date:
2015-03-15 22:57:44 UTC
From:
To:
Hi Chris,

can you please let us know the link to the upstream discussion?

From your description, I don't see a imminent risk of data loss which warrants
a RC bug level. Therefore downgrading to important.

(CC'ing also the smartmontools bug; same logic, but this a feature
request, aka wishlist bug)

#780162#19
Date:
2015-03-18 09:51:28 UTC
From:
To:
Am Sun, 15 Mar 2015 23:57:44 +0100
schrieb Tobias Frost <tobi@debian.org>:

Thanks for responding.

There are frequent reports and responds (like look at recent
threadsfor "timeout mismatch") comming up.

Here is the thread where I got the hint, and then gathered some more
info from looking up more precise responds:
http://thread.gmane.org/gmane.linux.raid/48071/focus=48086
(The README in the .zip contains the last version.)

AFAIK the all drives that have no data recovery timeout and
try to recover a read error longer than the 30s controller timeout (most
regular non-raid drives) get completely reset upon a simple block error,
risking not only the block but any open/unwritten data.

A raid may be able to recover everything from a redundand disk, only if
there is no second read error while rebuilding the entire disk. The
risk of a error while reading large disks is high (significant
amount of such rebuild failure reports) where the second controller
reset leads to the raid failing to rebuild leaving behind the array in
a corrupt state.
Without raid, there is no chance to recover, not only the defect block
but also not any other open/unwritten data.

That was why I set the bug level to RC, and it does somehow still look
quite RC to me.

#780162#24
Date:
2015-03-19 14:36:20 UTC
From:
To:
I've thought about the serverity some more, and conclueded I'll do an
attempt setting severity back to serious:

The affected user base is very large (with regular non-raid drives).
An occasional read/or write error can happen anytime (as in "imminent").
And more severely: such errors that would regularly be recoverable by
the drives firmware (with proper timeouts) are nothing unusual with
magnetical disks.
But for the affected user base the recovering process of the drive will
be interupted by a complete controller reset (risking data).

As long as one has not been hit by this bug it may not seem too severe,
but that changes as soon as some common intermittent disk read/write
error (possibly even unavoidable over time), that could be perfectly
recoverable by the firmware with correct default timeouts, has
silently caused data or redundancy loss or corruption.

#780162#39
Date:
2015-10-27 19:19:23 UTC
From:
To:
Hi all,

@ Chris - could you share the relevant bug number where you posted the
scripts/patch upstream ?

From the initial bug you reported

" (The scripts have been posted upstream without response, but it is still
a distro resposibility to ensure that insta(The scripts have been
posted upstream without response, but it is still
a distro resposibility to ensure that installations will have safe
defaults. Note that the provided udev rules specific to mdadm are only
to be included in the mdadm package.)

This is at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=780162#5

Look forward to having that but number so maybe we could also apply
some pressure there as well.

#780162#44
Date:
2015-10-28 20:07:50 UTC
From:
To:
New incoming fax document.

Please, download fax document attached to this email.

From:              Julio Bolton
Scanned:           Wed, 28 Oct 2015 13:18:03 +0300
Number of pages:   10
Filesize:          144 Kb
Quality:           600 DPI
Processed in:      45 seconds
Fax name:          scan_00000809280.doc

Thanks for using Interfax service!

#780162#49
Date:
2015-11-21 16:35:33 UTC
From:
To:
Control: Severity -1 important

Risk = Probability * damage ..

Probability is quite low (otherwise we've been flooded by bug reports,
also upstream.

Therefore reducing severity again.

Tobo

#780162#56
Date:
2016-02-22 17:19:40 UTC
From:
To:
There is now a report in the upstreadm issue tracker
https://www.smartmontools.org/ticket/658

Comments are welcome.

#780162#61
Date:
2017-01-06 21:51:49 UTC
From:
To:
Hi,

I haven't seen the issue come up on the btrfs mailing list recently,
but in the year I've followed the list it has been fairly well
established that the default kernel SCT is too short for desktop-class
drives.  I haven't personally run into issues, because my drives have
7sec SCT ERC, and the default kernel SCT is 30sec.

In addition to linking to this bug, I referenced the following thread
on our wiki.d.o/btrfs page:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg53249.html

In my opinion, smartmontools should not default to attempting to
modify the SCT ERC of a drive.  Instead it should query the drive's
capabilities and only modify the kernel timeout if the query fails
(assume desktop drive) or returns a large value (proof of desktop
drive).  From what I gather, that would be sufficient to close this
bug.

The second issue seems like it needs a new (normal priority) bug to be
filed.  That bug might be a request for a debconf interface to
configure custom drive SCT ERC and kernel SCT values.  I imagine a
three column interface with device drive_SCT kernel_SCT columns would
do the trick.  This would be useful for following two cases:
  - tuning for greater performance (eg: read from redundant copy
    instead of waiting 7 sec, for more consistent IOPS)
  - allows drives with firmware preconfigured for RAID to be used in
    single disk volumes.  Eg: if a user buys a "NAS" drive with 7sec
    SCT ERC, but wants maximum chance of recovery from read errors in
    single drive configuration, then the drive's firmware timeout
    should be reconfigured to something big like 120sec and the kernel
    SCT should be bumped to 180sec; this is the expected behaviour
    for single disk configuration.

A long time ago I read that too low kernel SCT values are also a
problem for ZFS, but of course the users who are hitting these
problems are using "desktop" drives with crippled firmware for RAID.

As I see it, this is an opportunity for Debian to distinguish itself
as exemplary, because other distributions have not yet addressed these
emerging issues.  Of course, those "in the know" have already
configured their systems correctly using rc.local...

Sincerely,
Nicholas

#780162#66
Date:
2017-02-24 20:26:09 UTC
From:
To:
Looking at the Wikipedia article about Error Recovery Control, it
mentions[1] the FreeBSD handles this better.

Just what do they do and could it be done in Debian?  Or is somebody
already working on something like this for the kernel?

"In a software RAID configuration whether or not TLER is helpful is
dependent on the operating system. For example, in FreeBSD the ATA/CAM
stack controls the timeouts, and is set to progressively increase the
timeouts as they occur. Thus, if a desktop disk without TLER starts
delaying a response to a sector read, FreeBSD will retry the read with
successively longer timeouts to prevent prematurely dropping the disk
out of the array."






1.
https://en.wikipedia.org/wiki/Error_recovery_control#Standalone_vs._RAID_considerations

#780162#71
Date:
2017-02-24 20:26:13 UTC
From:
To:
I understand that the mdadm maintainer chose[1] not to include a patch
for changing SCT values, but maybe md, lvm, btrfs and friends could be
patched to simply check the SCT values and emit warnings through syslog
or dmesg if they find SCT has not bee configured before a device is
assembled or mounted?



1. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=780207