#778849 Support restoring initrd on shutdown and pivoting into it

#778849#5
Date:
2015-02-20 15:59:12 UTC
From:
To:
Dracut, which provides linux-initramfs-tool and is thus an alternative
to initramfs-tools, supports restoring the initrd on shutdown and
pivoting into it:
https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html#_dracut_on_shutdown

One example where this is needed is a ZFS root filesystem: A clean
shutdown requires unmounting the root filesystem and exporting the
ZFS storage pool containing that filesystem. Dracut modules may
contain shutdown scripts which are called after the system has
pivoted to the initrd. In the case of ZFS, the shutdown script
looks like this:
https://github.com/zfsonlinux/zfs/blob/master/dracut/90zfs/export-zfs.sh.in

This is not specific to ZFS but affects anyone having the root
filesystem on an LVM. Currently, dracut is the only option for
these folks. It would be nice if initramfs-tools supported a
shutdown procedure akin to dracut to give people a choice.

#778849#10
Date:
2017-04-06 09:29:02 UTC
From:
To:
Hi,

Lukas Wunner:

I might try to come up with a hackish PoC for Tails soon (rationale
for the curious: we will soon start relying on the kernel's memory
poisoning to erase most memory on shutdown; this can only work if the
read-write branch of our aufs filesystem is unmounted on shutdown, and
switching to dracut is a longer-term project, so our options so far
are either hacking this support into initramfs-tools, or using
a dracut-generated initrd for shutdown only).

FWIW Arch Linux' mkinitcpio also does:
https://git.archlinux.org/mkinitcpio.git/tree/shutdown

Details of the needed interface can be found in:

 * https://www.freedesktop.org/wiki/Software/systemd/InitrdInterface/
 * https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/
 * systemd-shutdown(8)

Cheers,

#778849#15
Date:
2017-04-07 10:02:46 UTC
From:
To:
Hi,

intrigeri:

Here we go! Installing the four following files (slightly adapted to
drop a couple Tails-specific bits) on a Stretch system seems to do the
job. I hope it can allow interested people to validate this approach,
and then if there's enough demand I bet someone will integrate it into
initramfs-tools properly :)

If additional cleanup must be done from inside the initramfs after
returning to it, drop snippets in /usr/share/initramfs-tools/hooks/*
that install the required scripts into /lib/systemd/system-shutdown/
*in the initramfs*. E.g. for Tails I had to do quite more work there
to ensure the aufs stack our root filesystem uses is disassembled
properly (again in order to have the aufs read-write branch, on tmpfs,
cleaned up and its content erased by Linux' memory poisoning); I'll
contribute this code to live-boot if/when this feature is properly
integrated into initramfs-tools.

I don't know if I'll work more on this wrt. initramfs-tools.
It'll depend a lot on the timing of Tails moving to dracut, which is
entirely unclear at this time. Sorry!


/lib/systemd/system/initramfs-shutdown.service
----------------------------------------------

[Unit]
Description=Restore /run/initramfs on shutdown
Documentation=https://www.freedesktop.org/wiki/Software/systemd/InitrdInterface/
After=local-fs.target boot.mount boot.automount
Wants=local-fs.target
Conflicts=shutdown.target umount.target
DefaultDependencies=no
ConditionPathExists=!/run/initramfs/bin/sh

[Service]
RemainAfterExit=yes
Type=oneshot
ExecStart=/bin/true
ExecStop=/usr/share/initramfs-tools/initramfs-restore

[Install]
WantedBy=multi-user.target

/usr/share/initramfs-tools/initramfs-restore
--------------------------------------------

#!/bin/sh

set -e
set -u

WORKDIR=$(mktemp -d)
/usr/bin/unmkinitramfs /initrd.img "$WORKDIR"
mv "$WORKDIR"/main/* /run/initramfs/
rm -rf "$WORKDIR"

/lib/systemd/system-shutdown/initramfs-tools
--------------------------------------------

#!/bin/sh

# Otherwise systemd-shutdown cannot execute /run/initramfs/shutdown
mount -o remount,exec /run

/usr/share/initramfs-tools/hooks/shutdown
-----------------------------------------

#!/bin/sh

set -e

PREREQ=""

prereqs () {
       echo "${PREREQ}"
}

case "${1}" in
       prereqs)
               prereqs
               exit 0
               ;;
esac

. /usr/share/initramfs-tools/hook-functions

# systemd-shutdown itself
mkdir -p $DESTDIR/lib/systemd
copy_exec /lib/systemd/systemd-shutdown /shutdown

# Ensure systemd detects when we're in the initramfs on shutdown
# (see the in_initrd function in the systemd source tree)
touch $DESTDIR/etc/initrd-release

exit 0



Cheers,

#778849#22
Date:
2022-11-06 13:38:30 UTC
From:
To:
I have yet to investigate intrigeri's suggestions from 2017, however I would suggest that this is something that needs to be upgraded from wishlist in 2022, and here's the reason simply enough:

root@aki:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
[..]
unsafe_shutdowns			: 106
[..]
num_err_log_entries			: 284
[..]
root@aki:~# nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
[..]
unsafe_shutdowns			: 121
[..]
num_err_log_entries			: 291
[..]

Given that the frequency and number of SMART errors are deemed an indicator of drive health, that's bad. Also, improper shutdown on NVMe devices could be particularly problematic because they have caches and wear leveling and cleanup cycles that could happen any time the drive is "running" until a shutdown command is issued and responded to. There might actually be some risk of data corruption/loss. (I doubt it with commodity consumer SSDs, but Debian isn't just run on those.)

For a few weeks, we tried on #debian to sort out the cause of the above errors. We thought NVMe drive quirk Linux doesn't support? Maybe Linux is issuing the shutdown command and not waiting long enough? There's Google bait suggesting that's a problem, and there's some BS factoids in dpkg I should remove the next time I connect to OFTC describing the "solution" which I've since discovered doesn't work. This was hard to test because obviously no logger is running at this point of the shutdown process.

The root cause of the problem isn't an unknown quirk, it's that I have LVM on LUKS. (See what I did there?) Connected a drive with an unencrypted Debian system on it that mounted my main installation's /boot and even the LUKS/LVM root somewhere and never got a single unsafe shutdown despite multiple reboots/shutdowns. Because that temp install's root was not on LVM on LUKS backing.

Dracut is a suboptimal solution. In part because after three days of trying to get it to boot my system, I've yet to see it do so, and because while there's lots of documentation for it, it's for other distributions, it's wrong, it's obsolete, or it's misleading. Including one rantthrough from 2017 that offers a profanity-laden survey of most of the others and why they don't work for Debian systems or at all.

As far as I can tell you either need to significantly modify grub or switch to systemd-boot or set up Dracut to generate an EFI executable blob using files that aren't available on a Debian system or throw up my hands and go use Fedora until I understand Dracut enough to try and use it on Debian. Or something. Again: What sparse documentation exists is spotty, inconsistent, and at least five years out of date. Dracut is not how Debian does things, just like OpenRC and rEFInd are not how Debian does things. That's all there if you want to set it up, but you're not going to find many Debian resources on using it.

I think unsafe shutdowns of NVMe devices is actually a bug. And I think it could cause data loss or corruption on more advnaced hardware than I'm using. There's a few options for addressing it and most of them become problems beyond initramfs-tools' scope. But this seven year old bug might be the path of least resistance.

Joseph

#778849#27
Date:
2022-12-24 14:16:54 UTC
From:
To:
Hi Joseph.

The last paragraph of this e-mail is specifically addressed to you, but
most of this e-mail is addressed generally.

Also, apologies if this message is a bit rushed.  I have a few things to
do today.

I was planning to try out intrigeri's solution on a VM but have not had
the chance to do this.

I agree that this should be higher than wishlist for the above reason
plus Lukas's ZFS shutdown problem mentioned in the initial
description/submission of this bug.

This really should be fixed for Bookworm.

Awhile back, I did have a look around the fix.  From what I remembered,
intrigeri's solution used a systemd shutdown 'script' to check for
devmaps or whatever of LVMs, ZFS partitions, etc... and runs specific
commands to umount the partitions.

However, I think my memory may be bad because I "now" don't see evidence
of such umounts in intrigeri's solution!

I would like to try things out today but maybe too rushed.

Jo, have you been able to try out intrigeri's solution (in GENERAL as
opposed to his specific patch/fix, which is mentioned in this bug report
and may have bits missing)?  The reason I say this is because you would
have the exact recreation steps and be able to do it easily.  For me, it
would be a shot in the dark or awkward for me to recreate.  I would only
be able to check that root LVM on LUKS would not cause any untoward
problems.

Thanks,
Gervase.

#778849#32
Date:
2023-01-10 22:39:49 UTC
From:
To:
Just in case it is not obvious (I did not see it until I toggled
"useless messages"!)...

This Bug#778849 (Severity: wishlist) blocks Bug#978642 [Wipe LUKS Disk
Encryption Key for Root Disk from RAM during Shutdown to defeat Cold
Boot Attacks from Initial Ramdisk (initramfs-tools or dracut)].

#778849#37
Date:
2023-01-11 00:17:44 UTC
From:
To:
Apparently, I got confused.  What I saw is the script called 'shutdown'
from the mkinitcpio package used in Arch Linux (see
https://gitlab.archlinux.org/archlinux/mkinitcpio/mkinitcpio/-/blob/master/shutdown
).

What it does is (1) recursively umount the devices, (2) detaches loop
back devices and then (3) disassembles stacked devices (i.e. encrypted
devices, lvm and raid).

In contrast, what intrigeri's solution SEEMS to do (I haven't done any
experimentation using the solution) is provide a way for Debian's initrd
process to "pivot" back to a systemd shutdown procedure within an
initramfs environment, as opposed to running the Arch Linux shutdown
script.  This shutdown procedure differs from Arch Linux's because its
initramfs infrastructure differs from Debian's, I assume?

As intrigeri wrote in his instructions, the relevant scripts would need
to be written for dismantling devices ('virtual' or physical) and placed
in /usr/share/initramfs-tools/hooks/* (if I understood things
correctly).  So, if ZFS was installed as root, there would need to be a
script for that and/or if LUKS was installed as root, there would need
to be a script for that, etc...

The way that intrigeri's solution sets up the shutdown executable by
just copying it to initramfs seems very clunky to me.  Shouldn't it be
in the initramfs image file already even before the system is switched
on/booted up!?

Anyway, the above is my understanding of the situation.  It may be
completely wrong because I barely understand the initrd process!

Thanks,
Gervase.

#778849#42
Date:
2023-03-19 02:16:19 UTC
From:
To:
The following initrd info may or may not be pertinent to this bug in
respect to how initrds may be created in future versions of Debian...

#778849#47
Date:
2023-09-18 07:52:49 UTC
From:
To:
    a. netbooting Debian Live on diskless hosts.
    b. "zpool export -a" on servers.

I am only considering case (a), below.

I tried intrigeri's approach for Debian Live but I ran into a couple of problems:

    1. it assumes /initrd.img inside the rootfs exists and
       is consistent with the already-running system.
       This is not the case for me (I remove it to save space), and
       also not necessarily the case during upgrades.

    2. it tries to unpack /initrd.img after systemd-networkd stops.
       Without KeepConfiguration= (which is a pain to guarantee),
       that means no network access, which means no access to remote rootfs.

I instead tried just keeping the boot initrd around.
Using a simple bind-mount didn't work (I don't understand why) – SOME files are missing after switch_root.
Doing a full cp -a did work, though.

This method seems to work for my very simple test case of failed-to-unmount-rootfs error going away.
I'm really not happy with it overall, though.
I've run out of "time budget" to work on this in the short term.

https://github.com/cyberitsolutions/bootstrap2020/tree/twb/doc/workaround-778849

PS: I looked at dracut, but it's simply unsupported for live-boot (Debian Live / Tails), and
    for servers, I found it unreliable (much worse than initramfs-tools).
    (e.g. if bash has a security update, dracut doesn't trigger and the embedded copy of bash in the initrd remains vulnerable.)
    (e.g. telling dracut to use only busybox/klibc and not bash breaks, because lots of dracut components need bash but don't declare a dependency on it.)
    (e.g. dracut is written in bash and regularly has errors but doesn't exit non-zero, so you do not notice until the server doesn't actually boot anymore.)

#778849#52
Date:
2023-09-18 08:07:17 UTC
From:
To:
https://github.com/systemd/systemd/blob/v252/src/shutdown/shutdown.c#L422

i.e. it's similar to arch's script, except it's 1) C code; 2) distro-agnostic; and 3) a bit feature-limited.

I think if you want it to run arbitrary other commands (e.g. "zpool export -a"), you would need more code.

I think for that you'd want systemdize /run/initramfs/shutdown
(i.e. be a copy of systemd's /bin/init), and then run some subset of
https://github.com/systemd/systemd/blob/v252/man/bootup.xml#L291-L330

Note that systemd can "be" the boot initrd, too, which is the previous flow chart:
https://github.com/systemd/systemd/blob/v252/man/bootup.xml#L236-L288

AFAIK Debian initramfs-tools doesn't support this at all.
AFAIK ArchLinux supports this, but it is opt in (off by default).

Last time I looked (around Debian 10),
Debian dracut theoretically supported putting systemd in charge of boot initrd (and shutdown initrd?), BUT
it also installed a zillion bits of coreutils that systemd itself doesn't use.
Since my goal was to REDUCE the attack surface of the boot initrd, I gave up on dracut at the time.

I think it'd be better if /run/initramfs/shutdown used existing code -- either
/lib/systemd/systemd-shutdown/*.shutdown, or
maybe .service units, if that's appropriate.

But I confess I still do not understand how a "pure systemd" boot initrd + shutdown initrd would actually look.

#778849#57
Date:
2023-10-24 19:50:03 UTC
From:
To:
I was also experiencing improper shutdown of a root filesystem on nvme drives :(  This is a new Bookworm install with the root file system on an LVM thin pool which in turn resides on two nvme drives configured for Raid10 via mdadm.

I tried the suggestion from intrigeri but did not accomplish a successful pivot to initramfs.  I -believe- the problem was that the "mount -o remount,exec /run" occurred too "late" when it was attempted in /usr/lib/systemd/system-shutdown/initramfs-tools.  I moved the "mount -o remount,exec /run" to /usr/share/initramfs-tools/initramfs-restore and was able to get systemd to successfully pivot into the initramfs, and detach all drives ... YMMV :)

#778849#62
Date:
2025-04-14 07:19:44 UTC
From:
To:
Hello,

I sneak in just to report another situation where this feature would
be useful: root filesystem placed on an MD Raid array with an external
write-intent bitmap.

To be effective the write-intent bitmap file needs to be placed on an
external partition outside of the raid array. This creates a chicken
and egg problem where the raid array needs the partition where the
file is placed which in turn needs the root partition to be mounted.

Haven't tried it, but this probably could be solved within initramfs
by mounting the external partition somewhere under /run before
assembling the raid array. This needs to be unwinded at shutdown and I
see no way to do it properly outside of initramfs, as the unwinding
would require the following steps:

- Unmount root
- Stop the raid array, so that the bitmap file is no longer used
- Unmount the write-intent bitmap partition

Another solution might consider the following steps:

- Mount root read only, so no further writes could happen
- Remove the bitmap file from the raid array
- Unmount the write-intent bitmap partition

May be this could be done outside of initramfs but I'm not sure if the
bitmap file would be used again after reboot.

If I find some spare time I'll try to experiment using some VM.

FYI, with the root filesystem on an hybrid NVME/HDD Raid 1 I'm
experiencing the same NVME "unsafe_shutdowns" problem reported by
Joseph Carter. Not unexpected considering that this situation is
pretty similar.

Hope it helps.

Bye,