Hi Debian kernel team,
Sorry for the short report yesterday. Reportbug sent it out earlier
than I expected. Here is the full report:
* What led up to the situation?
On Debian 10 buster, I upgraded the linux kernel packages from
4.19+105+deb10u12 (buster) to 5.10.46-4~bpo10+1 (buster-backports).
Since then, the system started to regularly hang. These are always
complete freezes: I cannot move the mouse pointer anymore, cannot
switch to virtual console, and no response to network pings anymore.
* What exactly did you do (or not do) that was effective (or
ineffective)?
Use Firefox or Evince. See below for details.
* What was the outcome of this action?
Complete system hang/freeze.
* What outcome did you expect instead?
No hang.
Git bisect results:
Bisect Debian: https://salsa.debian.org/kernel-team/linux.git
First bad commit: 3fcc0ffb or 0fc228cb. Probably 3fcc0ffb, the first
update from 5.6 to 5.7.
* | a2f70104 [amd64] Update "x86: Make x32 syscall support conditional ..." for 5.7
* | 0fc228cb lockdown: Update Secure Boot support patches for 5.7 <--- git bisect bad
* | 3fcc0ffb Update to 5.7-rc4 <--- git bisect skip, because it fails to build
* | 6e17c1ca Enable support for fsverity <--- git bisect good
|/
o b49338be (tag: debian/5.6.7-1) Prepare to release linux (5.6.7-1).
Bisect upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
Notes about bisecting upstream:
- The pristine upstream kernel does not by default show the bug,
because it disables `intel_iommu` by default. You need to either:
- Apply the following two Debian patches:
features/x86/intel-iommu-add-option-to-exclude-integrated-gpu-only.patch
features/x86/intel-iommu-add-kconfig-option-to-exclude-igpu-by-default.patch
- Or boot the pristine upstream kernel with `intel_iommu=on`.
- The first bad commit is in a range of commits that fail to build
because of an unrelated problem:
depmod: ERROR: Cycle detected: drm_kms_helper -> drm -> drm_kms_helper
Cherrypick the following later revert commits to solve that:
$ git cherry-pick -x 6ae1a4bb^..09912635
$ git commit --allow-empty
$ git cherry-pick --continue
(Only relevant for bisecting one specific old branch, otherwise not
relevant to this bug report.)
First bad commit upstream: bf72c8c6, first merged in torvalds/master
with v5.7-rc1.
commit bf72c8c6ee77d46f74a2b143303a9c9923f9e7a7 (refs/bisect/bad)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Thu Jan 30 09:22:38 2020 +0000
drm/i915/gt: Skip global serialisation of clear_range for bxt vtd
VT'd on Broxton and on Braswell require serialisation of GGTT updates.
However, it seems to only be required for insertion, so drop the
complication and heavyweight stop_machine() for clears. The range will
be serialised again before use.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200130092239.1743672-1-chris@chris-wilson.co.uk
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index fdfed921..f83070b5 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -350,31 +350,6 @@ static void bxt_vtd_ggtt_insert_entries__BKL(struct i915_address_space *vm,
stop_machine(bxt_vtd_ggtt_insert_entries__cb, &arg, NULL);
}
-struct clear_range {
- struct i915_address_space *vm;
- u64 start;
- u64 length;
-};
-
-static int bxt_vtd_ggtt_clear_range__cb(void *_arg)
-{
- struct clear_range *arg = _arg;
-
- gen8_ggtt_clear_range(arg->vm, arg->start, arg->length);
- bxt_vtd_ggtt_wa(arg->vm);
-
- return 0;
-}
-
-static void bxt_vtd_ggtt_clear_range__BKL(struct i915_address_space *vm,
- u64 start,
- u64 length)
-{
- struct clear_range arg = { vm, start, length };
-
- stop_machine(bxt_vtd_ggtt_clear_range__cb, &arg, NULL);
-}
-
static void gen6_ggtt_clear_range(struct i915_address_space *vm,
u64 start, u64 length)
{
@@ -881,8 +856,6 @@ static int gen8_gmch_probe(struct i915_ggtt *ggtt)
IS_CHERRYVIEW(i915) /* fails with concurrent use/update */) {
ggtt->vm.insert_entries = bxt_vtd_ggtt_insert_entries__BKL;
ggtt->vm.insert_page = bxt_vtd_ggtt_insert_page__BKL;
- if (ggtt->vm.clear_range != nop_clear_range)
- ggtt->vm.clear_range = bxt_vtd_ggtt_clear_range__BKL;
}
ggtt->invalidate = gen8_ggtt_invalidate;
How to reproduce:
- BIOS: VT-d enabled.
and
- Kernel releases since v5.7-rc1. I can still reproduce it with drm-tip
e61e3604 of 2021-09-17 based on v5.15-rc1.
and
- Kernel from:
- Debian official packages, or
- Built from Debian linux kernel source (salsa), or
- Built from pristine upstream patched with these 2 Debian patches:
features/x86/intel-iommu-add-option-to-exclude-integrated-gpu-only.patch
features/x86/intel-iommu-add-kconfig-option-to-exclude-igpu-by-default.patch
or
- Built from pristine upstream without the 2 Debian patches, but then
use boot parameter `intel_iommu=on`.
and
- Linux boot parameter:
- Debian packages, or upstream patched with the 2 Debian patches:
- No boot parameter, or
- `intel_iommu=intgpu_off` (Debian-specific, Debian default), or
- `intel_iommu=on`.
or
- With pristine upstream, without Debian patches:
- `intel_iommu=on`.
and
- The buster-versions of binaries from these sources:
- mesa 18.3.6-2+deb10u1 (not 20.3.5-1~bpo10+1), and
- libglvnd 1.1.0-1 (not 1.3.2-1~bpo10+2)
and
- Xorg driver:
- modesetting: Easiest to reproduce with Evince: Hang in <2 minutes,
usually less <30 seconds. Also reproducible with Firefox with
Compositing: OpenGL, though it will take longer, see Steps below.
or
- intel: Have not been able to reproduce with Evince, only with
Firefox with Compositing: OpenGL. Sometimes <1 min, sometimes 20
minutes. See Steps below.
To prevent the bug:
- BIOS: VT-d disabled.
or
- Linux version 5.6 or lower.
or
- Boot parameter `intel_iommu=off` (this is the default for pristine
upstream without the 2 Debian patches)
or
- The buster-backports-versions of binaries from these sources:
- mesa 20.3.5-1~bpo10+1, and
- libglvnd 1.3.2-1~bpo10+2
No influence:
- Boot parameter `intel_iommu=strict iommu.strict=1` does not prevent
the bug.
Steps to reproduce:
I usually use Xorg modesetting + Evince, because it triggers the bug
fastest and most reliably. Firefox with OpenGL can trigger the bug with
Xorg modesetting or Xorg intel, but takes longer.
- With Evince (with Xorg modesetting, not Xorg intel driver): Open the
PDF with Evince (tested with 3.30.2-3+deb10u1) and quickly and
repeatedly scroll up and down the document. System should hang in
less than 2 minutes, usually even less than 30 seconds.
- Choose a PDF that can be scrolled through, but is not so big that
Evince will spend time "Loading..." during scrolling. For example,
on Debian I used:
- /usr/share/doc/quilt/quilt.pdf (12 pages)
- /usr/share/doc/dbconfig-common/dbapp-policy.pdf.gz (7 pages)
- With Firefox (tested with 78.14.0esr-1~deb10u1), in `about:config`:
gfx.webrender.all false
layers.acceleration.force-enabled true
layers.acceleration.disabled false
Now `about:support` should show `Compositing: OpenGL`. Now scroll and
switch between tabs, such as about:config, about:preferences,
about:performance, about:support, about:performance, some random
offline documentation, or a random bug on bugs.debian.org. System may
hang in <1 min, but it may also take 20 minutes (or hours even).
- Make sure the page is not that long that the page goes blank during
scrolling. The page should be long enough to be scrollable,
but short enough that the contents stay visible even during fast
scrolling.
- Not sure if it can reproduce with Firefox with Compositing: Basic
or Compositing: WebRender.
Further notes:
- I searched upstream bug tracker for drm/intel a bit and so far found
only this report that may be related:
- https://gitlab.freedesktop.org/drm/intel/-/issues/4082
(System hangs during parallel media transcode operations after
enabling VT-d)
However, the discussion there focused more on that they had
temporarily ignored `intel_iommu=igfx_off` for their CI, causing the
bug to surface. They did not really look into the underlying issue,
except:
> this kernel bug may have been there for longer time, or be HW/FW
> issue needing WA.
(I guess HW/FW=Hardware/Firmware, WA=Workaround.)
and:
> Complete system hang sounds like a possible hw bug.
and:
> I'm more inclined to think the issue being some race condition in
> kernel, which could trigger BUG/Oops/panic (i.e. cause machine also
> to drop network connection and not answer pings any more). IOMMU
> changes performance a bit, so it could trigger races that do not
> trigger with IOMMU off.
- Actually, I feel there are two bugs here:
- One Debian bug, in the 2 Debian intel iommu patches: I would expect
that the Debian-specific `INTEL_IOMMU_DEFAULT_ON_INTGPU_OFF` would
result in the same behavior as `intel_iommu=off`, but here it
actually seems to behave like `intel_iommu=on`. Or maybe I am
misunderstanding something.
- One upstream bug, since bf72c8c6, first merged in v5.7-rc1. Still
in drm-tip e61e3604 of 2021-09-17 based on v5.15-rc1. I thought I
better first hear what you (the Debian kernel team) have to say
about this, so I did not report upstream yet.
- I do not know what to make of the fact that mesa and libglvnd from
buster-backports make the bug disappear. Perhaps this means that once
I upgrade to Debian 11 bullseye, I will not experience the bug
anymore anyway. But I thought mesa and libglvnd are like "user-space"
from the kernel's point-of-view and should never be able to make the
system freeze, whatever their version. How likely is it that later
some other user-space program not depending on mesa, or perhaps even
a later version of mesa, is able to trigger the bug again?
- Attached is a kernel log of a boot without any `intel_iommu` boot
parameters on which I was able to reproduce the bug. No log data for
the exact moment the bug is triggered, unfortunately. Note that the
MCE error does not occur on every boot and I have also seen hangs on
boots that did not have the MCE error.
Hope I did not forget anything, otherwise I will send more info later.
Thank you for your attention, and for all the work you do on packaging
the kernel. Really impressed by the sheer amount of work you all must
be doing to get all those packages out.
Best regards,
Peter Nowee