#588516 missing success/error reporting for checkarray cronjobs

Package:
mdadm
Source:
mdadm
Description:
Tool to administer Linux MD arrays (software RAID)
Submitter:
"C. Gatzemeier"
Date:
2023-02-25 12:54:05 UTC
Severity:
wishlist
Tags:
#588516#5
Date:
2010-07-09 09:50:47 UTC
From:
To:
As I see the redundancy of arrays is checked monthly (if the machine
happens to run in the early morning of the first sunday of the month).

This seems to happen as a background (kernel) process which only
logs to syslog. The checkarray command run by the cron job does not give
any feedback whether all disks were in sync or if some errors were
found.

This means the disks could start getting bad, the checks may correct
things, but it may go unnoticed until a disk fails completely.

If "checkarray" would wait for the array checks to finish or fail,
cron could send checkarray's error/success output (possibly just greped
from the syslog) in an email to the admin.

#588516#10
Date:
2010-07-12 11:35:02 UTC
From:
To:
severity 588516 wishlist
tags 588516 wontfix
thanks

I'd have to make the cron job poll and block, which quickly gets
ugly. I won't do it, but if you wanted to submit a patch, please do.

#588516#19
Date:
2010-10-11 16:10:47 UTC
From:
To:
I propose a (maybe partial) fix to this at the end of #405919 which I'll
attach to the end.

mdadm includes logcheck rules which mean that non-zero mismatch counts
get reported, however due to a kernel bug (or at least weirdness), for
RAID1 and RAID10 the mismatch count is basically meaningless anyway.

The attached fixes logcheck so that it doesn't report mismatches, and
adds to the daily cron so that mismatches on array types where the
mismatch count is actually worth looking at is checked....

So, you don't get an error message immediately, but you should get one
within 24 hours.

Tim.


*** mdadm-logcheck-patch.diff
--- mdadm.orig	2010-09-28 16:45:03.000000000 +0100
+++ /etc/logcheck/ignore.d.server/mdadm	2010-09-28 16:58:25.000000000 +0100
@@ -17,7 +17,7 @@
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])? RAID([01456]|10) conf printout:$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])?[[:space:]]+---( [wrf]d:[[:digit:]]+){2,3}$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])?[[:space:]]+disk [[:digit:]]+,( wo:[[:digit:]]+,)? o:[[:digit:]]+, dev:[[:alnum:]]+$
-^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: Rebuild((Start|Finish)ed|[[:digit:]]+) event detected on md device /dev/[-_./[:alnum:]]+$
+^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: Rebuild((Start|Finish)ed|[[:digit:]]+) event detected on md device /dev/[-_./[:alnum:]]+(, component device  ?mismatches found: [[:digit:]]+)?$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: SpareActive event detected on md device /dev/[-_./[:alnum:]]+, component device /dev/[-_./[:alnum:]]+$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: (New|Degraded)Array event detected on md device /dev/[-_./[:alnum:]]+$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: DeviceDisappeared event detected on md device /dev/[-_./[:alnum:]]+$

*** /home/tim/mdadm-mismatch-fix.diff
--- /etc/cron.daily/mdadm.old	2010-09-28 15:35:15.954390947 +0100
+++ /etc/cron.daily/mdadm	2010-09-28 17:07:19.954518154 +0100
@@ -15,4 +15,59 @@
 MDADM=/sbin/mdadm
 [ -x $MDADM ] || exit 0 # package may be removed but not purged

+PRINT_SUMMARY=0
+
+for mcnt in /sys/block/md?/md/mismatch_cnt
+do
+	if [ -f $mcnt ]
+	then
+		read cnt < $mcnt
+		read level < $( dirname $mcnt )/level
+		if [ $cnt != 0 ] && ! ( [ "$level" = "raid10" ] || [ "$level" = "raid1" ])
+		then
+			cat << WARN_TEXT
+
+Warning - $mcnt indicates that the associated RAID
+device has $cnt blocks in which the data on one array member is inconsistent
+with the data on the other array member(s).
+WARN_TEXT
+			PRINT_SUMMARY=1
+		fi
+	fi
+done
+
+exit
+
+
+if [ $PRINT_SUMMARY != 0 ]
+then
+	cat << WARN_TEXT
+
+DATA LOSS MAY HAVE OCCURRED.
+
+This condition may have been caused by one or more of the following events:
+
+. A power failure whilst the array was being written-to.
+. Data corruption by faulty hard disk drive, drive controller, cabling, RAM,
+    motherboard, PSU etc. etc.
+. A kernel bug.
+. An array being forcibly created in an inconsistent state using the
+    "--assume-clean" argument to mdadm.
+
+This count is updated when the md subsystem carries out a 'check' or
+'repair' action.  In the case of 'repair' it reflects the number of
+mismatched blocks prior to carrying out the repair.
+
+Once you have fixed the error, carry out a 'check' action to reset the count
+to zero.
+
+Note that this check is only applied to arrays which aren't RAID1 or RAID10,
+due to a kernel limitation.  See the md (section 4) manual page, and the
+following URL for details:
+
+https://raid.wiki.kernel.org/index.php/Linux_Raid#Frequently_Asked_Questions_-_FAQ
+
+WARN_TEXT
+fi
+
 exec $MDADM --monitor --scan --oneshot