#994721 linux-image-5.10.0-0.bpo.8-amd64: Freeze on i915 Broxton with linux >=5.7 and old mesa; not prevented by intel_iommu=intgpu_off

#994721#5
Date:
2021-09-19 21:03:05 UTC
From:
To:
Dear Maintainer,

*** Reporter, please consider answering these questions, where appropriate ***

   * What led up to the situation?
   * What exactly did you do (or not do) that was effective (or
     ineffective)?
   * What was the outcome of this action?
   * What outcome did you expect instead?

*** End of the template - remove these template lines ***

#994721#10
Date:
2021-09-19 21:24:23 UTC
From:
To:
Details to follow tomorrow (bisected already, how to reproduce)
#994721#15
Date:
2021-09-20 21:54:14 UTC
From:
To:
Hi Debian kernel team,

Sorry for the short report yesterday. Reportbug sent it out earlier
than I expected. Here is the full report:

   * What led up to the situation?

On Debian 10 buster, I upgraded the linux kernel packages from
4.19+105+deb10u12 (buster) to 5.10.46-4~bpo10+1 (buster-backports).

Since then, the system started to regularly hang. These are always
complete freezes: I cannot move the mouse pointer anymore, cannot
switch to virtual console, and no response to network pings anymore.


   * What exactly did you do (or not do) that was effective (or
     ineffective)?

Use Firefox or Evince. See below for details.

   * What was the outcome of this action?

Complete system hang/freeze.

   * What outcome did you expect instead?

No hang.


Git bisect results:

Bisect Debian: https://salsa.debian.org/kernel-team/linux.git
First bad commit: 3fcc0ffb or 0fc228cb. Probably 3fcc0ffb, the first
update from 5.6 to 5.7.

    * | a2f70104 [amd64] Update "x86: Make x32 syscall support conditional ..." for 5.7
    * | 0fc228cb lockdown: Update Secure Boot support patches for 5.7       <--- git bisect bad
    * | 3fcc0ffb Update to 5.7-rc4                                          <--- git bisect skip, because it fails to build
    * | 6e17c1ca Enable support for fsverity                                <--- git bisect good
    |/
    o b49338be (tag: debian/5.6.7-1) Prepare to release linux (5.6.7-1).


Bisect upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Notes about bisecting upstream:
- The pristine upstream kernel does not by default show the bug,
  because it disables `intel_iommu` by default. You need to either:
  - Apply the following two Debian patches:
        features/x86/intel-iommu-add-option-to-exclude-integrated-gpu-only.patch
        features/x86/intel-iommu-add-kconfig-option-to-exclude-igpu-by-default.patch
  - Or boot the pristine upstream kernel with `intel_iommu=on`.
- The first bad commit is in a range of commits that fail to build
  because of an unrelated problem:
      depmod: ERROR: Cycle detected: drm_kms_helper -> drm -> drm_kms_helper
  Cherrypick the following later revert commits to solve that:
      $ git cherry-pick -x 6ae1a4bb^..09912635
      $ git commit --allow-empty
      $ git cherry-pick --continue
  (Only relevant for bisecting one specific old branch, otherwise not
  relevant to this bug report.)

First bad commit upstream: bf72c8c6, first merged in torvalds/master
with v5.7-rc1.

    commit bf72c8c6ee77d46f74a2b143303a9c9923f9e7a7 (refs/bisect/bad)
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Thu Jan 30 09:22:38 2020 +0000

        drm/i915/gt: Skip global serialisation of clear_range for bxt vtd

        VT'd on Broxton and on Braswell require serialisation of GGTT updates.
        However, it seems to only be required for insertion, so drop the
        complication and heavyweight stop_machine() for clears. The range will
        be serialised again before use.

        Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
        Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
        Link: https://patchwork.freedesktop.org/patch/msgid/20200130092239.1743672-1-chris@chris-wilson.co.uk

    diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c
    index fdfed921..f83070b5 100644
    --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
    +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
    @@ -350,31 +350,6 @@ static void bxt_vtd_ggtt_insert_entries__BKL(struct i915_address_space *vm,
            stop_machine(bxt_vtd_ggtt_insert_entries__cb, &arg, NULL);
     }

    -struct clear_range {
    -       struct i915_address_space *vm;
    -       u64 start;
    -       u64 length;
    -};
    -
    -static int bxt_vtd_ggtt_clear_range__cb(void *_arg)
    -{
    -       struct clear_range *arg = _arg;
    -
    -       gen8_ggtt_clear_range(arg->vm, arg->start, arg->length);
    -       bxt_vtd_ggtt_wa(arg->vm);
    -
    -       return 0;
    -}
    -
    -static void bxt_vtd_ggtt_clear_range__BKL(struct i915_address_space *vm,
    -                                         u64 start,
    -                                         u64 length)
    -{
    -       struct clear_range arg = { vm, start, length };
    -
    -       stop_machine(bxt_vtd_ggtt_clear_range__cb, &arg, NULL);
    -}
    -
     static void gen6_ggtt_clear_range(struct i915_address_space *vm,
                                      u64 start, u64 length)
     {
    @@ -881,8 +856,6 @@ static int gen8_gmch_probe(struct i915_ggtt *ggtt)
                IS_CHERRYVIEW(i915) /* fails with concurrent use/update */) {
                    ggtt->vm.insert_entries = bxt_vtd_ggtt_insert_entries__BKL;
                    ggtt->vm.insert_page    = bxt_vtd_ggtt_insert_page__BKL;
    -               if (ggtt->vm.clear_range != nop_clear_range)
    -                       ggtt->vm.clear_range = bxt_vtd_ggtt_clear_range__BKL;
            }

            ggtt->invalidate = gen8_ggtt_invalidate;



How to reproduce:
- BIOS: VT-d enabled.
and
- Kernel releases since v5.7-rc1. I can still reproduce it with drm-tip
  e61e3604 of 2021-09-17 based on v5.15-rc1.
and
- Kernel from:
  - Debian official packages, or
  - Built from Debian linux kernel source (salsa), or
  - Built from pristine upstream patched with these 2 Debian patches:
        features/x86/intel-iommu-add-option-to-exclude-integrated-gpu-only.patch
        features/x86/intel-iommu-add-kconfig-option-to-exclude-igpu-by-default.patch
    or
  - Built from pristine upstream without the 2 Debian patches, but then
    use boot parameter `intel_iommu=on`.
and
- Linux boot parameter:
  - Debian packages, or upstream patched with the 2 Debian patches:
    - No boot parameter, or
    - `intel_iommu=intgpu_off` (Debian-specific, Debian default), or
    - `intel_iommu=on`.
  or
  - With pristine upstream, without Debian patches:
    - `intel_iommu=on`.
and
- The buster-versions of binaries from these sources:
  - mesa      18.3.6-2+deb10u1  (not 20.3.5-1~bpo10+1), and
  - libglvnd  1.1.0-1           (not  1.3.2-1~bpo10+2)
and
- Xorg driver:
  - modesetting: Easiest to reproduce with Evince: Hang in <2 minutes,
    usually less <30 seconds. Also reproducible with Firefox with
    Compositing: OpenGL, though it will take longer, see Steps below.
  or
  - intel: Have not been able to reproduce with Evince, only with
    Firefox with Compositing: OpenGL. Sometimes <1 min, sometimes 20
    minutes. See Steps below.

To prevent the bug:
- BIOS: VT-d disabled.
or
- Linux version 5.6 or lower.
or
- Boot parameter `intel_iommu=off` (this is the default for pristine
  upstream without the 2 Debian patches)
or
- The buster-backports-versions of binaries from these sources:
  - mesa       20.3.5-1~bpo10+1, and
  - libglvnd   1.3.2-1~bpo10+2

No influence:
- Boot parameter `intel_iommu=strict iommu.strict=1` does not prevent
  the bug.


Steps to reproduce:

I usually use Xorg modesetting + Evince, because it triggers the bug
fastest and most reliably. Firefox with OpenGL can trigger the bug with
Xorg modesetting or Xorg intel, but takes longer.

- With Evince (with Xorg modesetting, not Xorg intel driver): Open the
  PDF with Evince (tested with 3.30.2-3+deb10u1) and quickly and
  repeatedly scroll up and down the document. System should hang in
  less than 2 minutes, usually even less than 30 seconds.
  - Choose a PDF that can be scrolled through, but is not so big that
    Evince will spend time "Loading..." during scrolling. For example,
    on Debian I used:
    - /usr/share/doc/quilt/quilt.pdf (12 pages)
    - /usr/share/doc/dbconfig-common/dbapp-policy.pdf.gz (7 pages)
- With Firefox (tested with 78.14.0esr-1~deb10u1), in `about:config`:
      gfx.webrender.all                  false
      layers.acceleration.force-enabled  true
      layers.acceleration.disabled       false
  Now `about:support` should show `Compositing: OpenGL`. Now scroll and
  switch between tabs, such as about:config, about:preferences,
  about:performance, about:support, about:performance, some random
  offline documentation, or a random bug on bugs.debian.org. System may
  hang in <1 min, but it may also take 20 minutes (or hours even).
  - Make sure the page is not that long that the page goes blank during
    scrolling. The page should be long enough to be scrollable,
    but short enough that the contents stay visible even during fast
    scrolling.
  - Not sure if it can reproduce with Firefox with Compositing: Basic
    or Compositing: WebRender.


Further notes:
- I searched upstream bug tracker for drm/intel a bit and so far found
  only this report that may be related:
  - https://gitlab.freedesktop.org/drm/intel/-/issues/4082
    (System hangs during parallel media transcode operations after
    enabling VT-d)
  However, the discussion there focused more on that they had
  temporarily ignored `intel_iommu=igfx_off` for their CI, causing the
  bug to surface. They did not really look into the underlying issue,
  except:
  > this kernel bug may have been there for longer time, or be HW/FW
  > issue needing WA.
  (I guess HW/FW=Hardware/Firmware, WA=Workaround.)
  and:
  > Complete system hang sounds like a possible hw bug.
  and:
  > I'm more inclined to think the issue being some race condition in
  > kernel, which could trigger BUG/Oops/panic (i.e. cause machine also
  > to drop network connection and not answer pings any more).  IOMMU
  > changes performance a bit, so it could trigger races that do not
  > trigger with IOMMU off.
- Actually, I feel there are two bugs here:
  - One Debian bug, in the 2 Debian intel iommu patches: I would expect
    that the Debian-specific `INTEL_IOMMU_DEFAULT_ON_INTGPU_OFF` would
    result in the same behavior as `intel_iommu=off`, but here it
    actually seems to behave like `intel_iommu=on`. Or maybe I am
    misunderstanding something.
  - One upstream bug, since bf72c8c6, first merged in v5.7-rc1. Still
    in drm-tip e61e3604 of 2021-09-17 based on v5.15-rc1. I thought I
    better first hear what you (the Debian kernel team) have to say
    about this, so I did not report upstream yet.
- I do not know what to make of the fact that mesa and libglvnd from
  buster-backports make the bug disappear. Perhaps this means that once
  I upgrade to Debian 11 bullseye, I will not experience the bug
  anymore anyway. But I thought mesa and libglvnd are like "user-space"
  from the kernel's point-of-view and should never be able to make the
  system freeze, whatever their version. How likely is it that later
  some other user-space program not depending on mesa, or perhaps even
  a later version of mesa, is able to trigger the bug again?
- Attached is a kernel log of a boot without any `intel_iommu` boot
  parameters on which I was able to reproduce the bug. No log data for
  the exact moment the bug is triggered, unfortunately. Note that the
  MCE error does not occur on every boot and I have also seen hangs on
  boots that did not have the MCE error.

Hope I did not forget anything, otherwise I will send more info later.

Thank you for your attention, and for all the work you do on packaging
the kernel. Really impressed by the sheer amount of work you all must
be doing to get all those packages out.

Best regards,
Peter Nowee

#994721#32
Date:
2021-09-21 05:49:45 UTC
From:
To:
reproduce. I was just using a safe environment to practice reportbug
with, when it suddenly sent out the report already.

To reproduce the bug, use `intel_iommu=on`, `intel_iommu=intgpu_off` or
no boot parameter at all, as described in messages #10 and #15.

#994721#37
Date:
2025-02-19 15:10:56 UTC
From:
To:
Hi

This bug was filed for a very old kernel or the bug is old itself
without resolution.

If you can reproduce it with

- the current version in unstable/testing
- the latest kernel from backports

please reopen the bug, see https://www.debian.org/Bugs/server-control
for details.

Regards,
Salvatore

#994721#40
Date:
2025-02-19 15:10:56 UTC
From:
To:
Hi

This bug was filed for a very old kernel or the bug is old itself
without resolution.

If you can reproduce it with

- the current version in unstable/testing
- the latest kernel from backports

please reopen the bug, see https://www.debian.org/Bugs/server-control
for details.

Regards,
Salvatore