#955413 btrfs-progs: new mount-time checking causes systemd to timeout and boot failure

Package:
btrfs-progs
Source:
btrfs-progs
Description:
Checksumming Copy on Write Filesystem utilities
Submitter:
Graham Cobb
Date:
2025-08-13 19:01:03 UTC
Severity:
minor
#955413#5
Date:
2020-03-31 12:20:48 UTC
From:
To:
The new checks at mount time cause mount times for large filesystems to be much
longer. My roughly 10TB filesystem now takes over 90 seconds to mount.

Unfortunately, this is longer than the default systemd mount timer and systemd
assumes the mount has failed (and, in fact, cancels it). If the mount is not
marked with "nofail" this causes boot to fail and to drop into the rescue
console.

This new checking is a good thing, and neither the new checks, nor the systemd
behaviour are bugs, nor are they the responsibility of btrfs-progs.

It is likely that users of large btrfs filesystems will have btrfs-progs
installed. So, I believe it would be extremely useful to add a NEWS item to the
next btrfs-progs package release to warn btrfs users of this change and
recommend that they consider making changes to /etc/fstab for any btrfs
volumes mounted at boot time.

Alternatively, the NEWS item could be included in the kernel release, however
that would target a much larger group of people than just btrfs users.

A release notes entry for bullseye should also be considered.

Suggested NEWS text:

BTRFS filesystems are now checked for tree corruption at mount time.
For filesystems of around 10TB or larger, this is likely to cause the mount
time to exceed the default systemd mount timeout of 90 seconds. If the
filesystem is mounted at boot time, and not marked as "nofail", this will cause
the boot to fail.

The systemd mount timeout can be increased in /etc/fstab by setting the mount
option x-systemd.mount-timeout= with a new timeout value in seconds.
See systemd.mount(5) for full documentation on the systemd mount options
that can be specified in /etc/fstab.

#955413#10
Date:
2020-04-03 13:44:48 UTC
From:
To:
We see that, too, at the institute... any larger (few TB) filesystems
in /etc/fstab make systemd cause the system to fail at boot... leaving
it a state with no remote resuce (ssh) being possible.

Since such filesystems would mount just fine... I would rather say that
this functionality is a severe bug.


Cheers,
Chris.

#955413#15
Date:
2020-08-01 12:38:14 UTC
From:
To:
(Sorry for the delay -- I investigated the cause but somehow failed to post
here.  Recently someone on the mailing list reported it also, with same
conclusion.)

90 seconds for 10TB sounds like something is terribly wrong in your case.
I currently have only one spinning rust array, of 3 disks 7TB, and it
completes mount in around a second.

But, there are folks with massive arrays with mount times of several minutes
or tent of minutes, so the issue is real.

How can a mount be "cancelled"?  The syscall provides no way to abort a
mount attempt -- it either succeeds or fails, with no possible timeouts on
userspace side.

I'm not experienced at debugging systemd issues, but it appears to me that
systemd has a separate thread doing the mount() call, reports timeout
despite the call being still in progress, then upon successful mount
_unmounts_ the filesystem again.


From: Christoph Anton Mitterer <calestyo@scientia.net>
} We see that, too, at the institute... any larger (few TB) filesystems
} in /etc/fstab make systemd cause the system to fail at boot... leaving
} it a state with no remote resuce (ssh) being possible.
}
} Since such filesystems would mount just fine... I would rather say that
} this functionality is a severe bug.

They do mount successfully with a manual mount, so do they with initscripts.
The timeout is done on systemd side, with the kernel doing as expected.

Also, it can't be fixed in btrfs-progs, as no code in this package is run at
mount time.


Meow!

#955413#20
Date:
2020-08-01 18:57:42 UTC
From:
To:
Thanks for looking into it. I agree with your analysis but the issue is real.
I think the long mount times tend to occur when there are large numbers of
subvolumes and other large trees.

Although the problem is not fixable in btrfs-progs, I think it needs a NEWS item
in btrfs-progs - otherwise many people who use btrfs disks will hit big problems after
their Debian upgrade with no idea what the problem is or how they can fix it.

I think including the NEWS text I proposed in the btrfs-progs package would
avoid problems for a significant percentage of existing Debian btrfs users.

#955413#25
Date:
2023-05-03 20:56:49 UTC
From:
To:
Hi,

Graham Cobb <g+debian@cobb.uk.net> writes:

I'm curious how "aged" the fs is, (largest generation from btrfs subvol
list), how many subvolumes, if qgroups are used, raid56, are reflink
copies of large trees regularly made, deduplication, compression, etc.
I've updated the btrfs page on wiki.debian.org with your report and
suggested workaround, and I'd like to provide more context to not scare
people off btrfs.  Ie, as Adam said, it's not normal for mount to take
this long.

Christoph Anton Mitterer <calestyo@scientia.net> writes:

I'm also curious about if there are any factors here that are causing
the failure.  If this is truly typical btrfs+systemd scaling problem
then, yes, it's unconscionable to pretend that everything is ok...and it
at least needs to be documented.  Please feel free to contribute to the
wiki.debian.org page, by the way!

Adam Borowski <kilobyte@angband.pl> writes:

Hm, yes, I agree this sounds odd, but Fedora 33 shipped with btrfs by
default in Oct 2020, and has stayed with that default, so I wonder if
this bug was fixed in systemd by then?  If not, I wonder if we're "doing
things differently", so to speak, and triggering a bug that they're not
(initramfs vs dracut, etc).

Cheers,
Nicholas

#955413#30
Date:
2023-05-03 23:36:52 UTC
From:
To:
More recent kernels have definitely improved the situation. My largest
filesystem now takes about 60 seconds to mount at boot, although I have
not checked how long it takes if it was not cleanly unmounted.

This is an online backup disk (btrbk). It stores snapshots. Lots of
them. From many different systems. It is btrfs-over-lvm-over-luks. Data
is Single; Metadata is RAID1. There are two disks, each has an LV on a
VG on a LUKS volume.

# btrfs fi usage /mnt/snapshots/
Overall:
    Device size:                  18.00TiB
    Device allocated:             15.22TiB
    Device unallocated:            2.77TiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         15.10TiB
    Free (estimated):              2.88TiB      (min: 1.49TiB)
    Free (statfs, df):             2.88TiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no
Data,single: Size:15.01TiB, Used:14.90TiB (99.29%)
   /dev/mapper/cryptdata16tb2--vg-backupsnapshot          10.52TiB
   /dev/mapper/cryptbackup16tb--vg-backupsnapshot          4.49TiB
Metadata,RAID1: Size:110.00GiB, Used:103.12GiB (93.74%)
   /dev/mapper/cryptdata16tb2--vg-backupsnapshot         110.00GiB
   /dev/mapper/cryptbackup16tb--vg-backupsnapshot        110.00GiB
System,RAID1: Size:32.00MiB, Used:1.80MiB (5.62%)
   /dev/mapper/cryptdata16tb2--vg-backupsnapshot          32.00MiB
   /dev/mapper/cryptbackup16tb--vg-backupsnapshot         32.00MiB
Unallocated:
   /dev/mapper/cryptdata16tb2--vg-backupsnapshot           1.37TiB
   /dev/mapper/cryptbackup16tb--vg-backupsnapshot          1.40TiB

Some answers to your questions:

I'm curious how "aged" the fs is, (largest generation from btrfs subvol
list) -- 1248641

how many subvolumes -- 1810

if qgroups are used -- no

raid56 -- no - single data, raid1 metadata

are reflink copies of large trees regularly made -- yes (btrbk)

deduplication -- no

compression, etc. -- some

Note: there are two other btrfs filesystems on this system.
Same approach (btrfs-over-lvm-over-luks). Similar disks.
Both are around 6-8 TB. These mount much more quickly.
The main difference is they do not have anything like the number of
subvolumes, nor the continuous creation and deletion of subvolumes that
this backup disk has.

I think the way to frame this is not that it is a problem people are
likely to see, but that if they find themselves in that situation, the
answer is simple: add x-systemd.mount-timeout=180s (and consider adding
nofail).

I will try to take a look at the wiki in the next few days.

#955413#35
Date:
2025-08-13 19:00:35 UTC
From:
To:
Graham Cobb <g+debian@cobb.uk.net> writes:

I'm closing this bug, as a documentation issue, because the
documentation on our wiki appears to have been sufficient--there has
been no activity at this bug in two years.

Cheers,
Nicholas