#991546 linux-image-5.10.0-8-amd64: soft lockup after screen does a sleep cycle (dpms off then on)

Package:
src:linux
Source:
linux
Submitter:
Marc
Date:
2021-07-28 14:45:03 UTC
Severity:
important
Tags:
#991546#5
Date:
2021-07-27 07:19:31 UTC
From:
To:
Dear Maintainer,

After a recent move to sway (wayland), I observe a BUG whenever the screen is switched off by swayidle.
The screen usually wakes up as expected, but the interface is partially frozen :firefox is partially drawned, unresponsive but I can still execute some command in already opened term.

Dmesg usually has this kind of reports after the wake up, after few seconds:

[  101.233424] rcu: INFO: rcu_sched self-detected stall on CPU
[  101.233429] rcu: 	9-....: (5250 ticks this GP) idle=f7e/1/0x4000000000000000 softirq=5115/5115 fqs=2417
[  101.233435] 	(t=5250 jiffies g=11477 q=6538)
[  101.233437] NMI backtrace for cpu 9
[  101.233439] CPU: 9 PID: 1811 Comm: Xwayland Tainted: G S          E     5.11.0-rc4-00375-ga692a610d7ed #6
[  101.233441] Hardware name: Gigabyte Technology Co., Ltd. AB350M-Gaming 3/AB350M-Gaming 3-CF, BIOS F42d 10/18/2019
[  101.233442] Call Trace:
[  101.233444]  <IRQ>
[  101.233446]  dump_stack+0x6b/0x83
[  101.233450]  nmi_cpu_backtrace.cold+0x32/0x69
[  101.233452]  ? lapic_can_unplug_cpu+0x80/0x80
[  101.233456]  nmi_trigger_cpumask_backtrace+0xd7/0xe0
[  101.233459]  rcu_dump_cpu_stacks+0xa5/0xd3
[  101.233462]  rcu_sched_clock_irq.cold+0x202/0x3b3
[  101.233464]  update_process_times+0x8c/0xc0
[  101.233467]  tick_sched_handle+0x22/0x60
[  101.233469]  tick_sched_timer+0x7a/0xa0
[  101.233471]  ? tick_nohz_handler+0xb0/0xb0
[  101.233474]  __hrtimer_run_queues+0x12a/0x270
[  101.233475]  hrtimer_interrupt+0x10e/0x280
[  101.233477]  __sysvec_apic_timer_interrupt+0x5f/0xd0
[  101.233479]  asm_call_irq_on_stack+0x12/0x20
[  101.233482]  </IRQ>
[  101.233482]  sysvec_apic_timer_interrupt+0x72/0x80
[  101.233485]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  101.233487] RIP: 0010:amdgpu_bo_do_create+0x323/0x4f0 [amdgpu]
[  101.233600] Code: 8b 78 08 48 89 de e8 0c ca 00 00 e9 19 ff ff ff 81 c2 ff 0f 00 00 49 81 c3 ff 0f 00 00 c1 fa 0c 49 81 e3 00 f0 ff ff 48 63 ca <4d> 89 dd 48 89 4c 24 08 e9 54 fd ff ff 48 c1 e8 0c 48 39 83 d8 01
[  101.233602] RSP: 0018:ffffabbb01bebc18 EFLAGS: 00000206
[  101.233604] RAX: 0000000000000004 RBX: ffffabbb01bebd38 RCX: 0000000000000200
[  101.233605] RDX: 0000000000000200 RSI: ffffabbb01bebd38 RDI: ffff9f0628800000
[  101.233606] RBP: ffff9f0628800000 R08: 0000000000000000 R09: 0000000000000000
[  101.233607] R10: 000000000000000a R11: 0000000404000000 R12: ffffabbb01bebd38
[  101.233608] R13: ffffabbb01bebd30 R14: ffff9f0628800000 R15: ffffabbb01bebe28
[  101.233611]  ? amdgpu_bo_do_create+0x29d/0x4f0 [amdgpu]
[  101.233719]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[  101.233721]  amdgpu_bo_create+0x40/0x270 [amdgpu]
[  101.233830]  amdgpu_gem_create_ioctl+0x123/0x310 [amdgpu]
[  101.233939]  ? amdgpu_gem_force_release+0x150/0x150 [amdgpu]
[  101.234048]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[  101.234068]  drm_ioctl+0x20f/0x3a0 [drm]
[  101.234087]  ? amdgpu_gem_force_release+0x150/0x150 [amdgpu]
[  101.234196]  ? do_setitimer+0x179/0x210
[  101.234198]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[  101.234303]  __x64_sys_ioctl+0x83/0xb0
[  101.234307]  do_syscall_64+0x33/0x80
[  101.234309]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  101.234311] RIP: 0033:0x7f4feaa02cc7
[  101.234313] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
[  101.234314] RSP: 002b:00007ffe4a48f338 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  101.234316] RAX: ffffffffffffffda RBX: 00007ffe4a48f390 RCX: 00007f4feaa02cc7
[  101.234317] RDX: 00007ffe4a48f390 RSI: 00000000c0206440 RDI: 000000000000000a
[  101.234318] RBP: 00000000c0206440 R08: 00000000ffffffff R09: 00007f4feaaccbe0
[  101.234319] R10: 0000000000000100 R11: 0000000000000246 R12: 0000559ea056b9f0
[  101.234320] R13: 000000000000000a R14: 0000000404000000 R15: 0000000000200000

I have been asked on IRC to open this issue.
I have tested with latest upstream 5.13 kernel and can't reproduce the issue.
I am currently bisecting to pinpoint the commit that seems to fix it upstream, but as it is also my main computer, I can only do few iterations per day, so should not be finished until before the end of the week.

Here's the exact commands used to reproduce:

 - boot
 - ensure swayidle is not running
 - open firefox
 - execute: $ swaymsg "output * dpms off"; sleep 10; swaymsg "output * dpms on"
 - the screen is switched off, then on again
   - if bug present, firefox will have its upper part incorrectly drawned (transparent, its my sway background showing instead of the toolbar). After few seconds, dmesg shows above errors
   - if firefox is correct, then the bug is not showing (I did several cycles without being able to reproduce).

I'll update this issue when I have a better idea of when upstream changed this behavior.

#991546#10
Date:
2021-07-27 08:43:28 UTC
From:
To:
I'm adding some more info as I discover here that my kernel is tainted:

 * SMP kernel oops on an officially SMP incapable processor

I guess the reason is that my ryzon 1600 has an hw bug and I need to
disable its C6 state by writing in some MSR.

I don't think this is related to my problem, but can verify the
bug still happens with MSR untouched if you think it is worth it.

#991546#15
Date:
2021-07-27 20:28:24 UTC
From:
To:
After bisecting, I get this SHA1 as the first to have fixed the issue
(at least, it's not showing as easily as before it). It makes sense as
the backtrace shows something in amdgpu and this is a bug fix :)

8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---
89fa15ecdca7eb46a711476b961f70a74765bbe4 is the first nobug commit
commit 89fa15ecdca7eb46a711476b961f70a74765bbe4
Author: Huang Rui <ray.huang@amd.com>
Date:   Sat Jan 30 17:14:30 2021 +0800

    drm/amdgpu: fix the issue that retry constantly once the buffer is oversize

    We cannot modify initial_domain every time while the retry starts. That
    will cause the busy waiting that unable to switch to GTT while the vram
    is not enough.

    Fixes: f8aab60422c3 ("drm/amdgpu: Initialise drm_gem_object_funcs for imported BOs")

    Signed-off-by: Huang Rui <ray.huang@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org

 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<---

I also tried to simply cherry-pick it on top of v5.10 tag, and it seems
to also fix the issue.

Here's the bisect log, in case this single SHA1 is not enough:

git bisect start '--term-old=bug' '--term-new=nobug'
# bug: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect bug 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# nobug: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect nobug 62fb9874f5da54fdb243003b386128037319b219
# nobug: [d6560052c2f73db59834e9a3c0aba20579aa7059] Merge tag 'regulator-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
git bisect nobug d6560052c2f73db59834e9a3c0aba20579aa7059
# bug: [345b17acb1aa7a443741d9220f66b30d5ddd7c39] Merge tag 'for-linus-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
git bisect bug 345b17acb1aa7a443741d9220f66b30d5ddd7c39
# nobug: [56bf6fc266ca14d2b9276c8a62e4ff6783bfe68b] Merge tag 'arm-defconfig-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect nobug 56bf6fc266ca14d2b9276c8a62e4ff6783bfe68b
# bug: [a692a610d7ed632cab31b61d6c350db68a10e574] Merge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block
git bisect bug a692a610d7ed632cab31b61d6c350db68a10e574
# bug: [badc6ac3212294bd37304c56ddf573c9ba3202e6] Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
git bisect bug badc6ac3212294bd37304c56ddf573c9ba3202e6
# nobug: [295f830e53f4838344c97e12ce69637e2128ca8d] rxrpc: Fix dependency on IPv6 in udp tunnel config
git bisect nobug 295f830e53f4838344c97e12ce69637e2128ca8d
# nobug: [6016bf19b3854b6e70ba9278a7ca0fce75278d3a] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect nobug 6016bf19b3854b6e70ba9278a7ca0fce75278d3a
# nobug: [eec79181212c9c2670423400a9e78bb1f0c0075d] Merge tag 'block-5.11-2021-02-05' of git://git.kernel.dk/linux-block
git bisect nobug eec79181212c9c2670423400a9e78bb1f0c0075d
# bug: [dd86e7fa07a3ec33c92c957ea7b642c4702516a0] Merge tag 'pci-v5.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect bug dd86e7fa07a3ec33c92c957ea7b642c4702516a0
# nobug: [97ba0c7413f83ab3b43a5ba05362ecc837fce518] Merge tag 'iommu-fixes-v5.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect nobug 97ba0c7413f83ab3b43a5ba05362ecc837fce518
# nobug: [cfd4951f935c5504e887ed80abaafba210cc0a44] Merge tag 'amd-drm-fixes-5.11-2021-02-03' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
git bisect nobug cfd4951f935c5504e887ed80abaafba210cc0a44
# nobug: [58180a0cc0c57fe62a799a112f95b60f6935bd96] drm/amd/display: Release DSC before acquiring
git bisect nobug 58180a0cc0c57fe62a799a112f95b60f6935bd96
# nobug: [cd9b0159beb7787bec38eb339ed7bc167d83b4ff] drm/amdgpu: enable freesync for A+A configs
git bisect nobug cd9b0159beb7787bec38eb339ed7bc167d83b4ff
# nobug: [b99a8c8f239d76820bbed33c1a42c381cc1f16db] drm/amdkfd: fix null pointer panic while free buffer in kfd
git bisect nobug b99a8c8f239d76820bbed33c1a42c381cc1f16db
# nobug: [89fa15ecdca7eb46a711476b961f70a74765bbe4] drm/amdgpu: fix the issue that retry constantly once the buffer is oversize
git bisect nobug 89fa15ecdca7eb46a711476b961f70a74765bbe4
# first nobug commit: [89fa15ecdca7eb46a711476b961f70a74765bbe4]
drm/amdgpu: fix the issue that retry constantly once the buffer is oversize

Marc

#991546#20
Date:
2021-07-28 05:54:58 UTC
From:
To:
Hi Marc,

Thanks a lot for your investigation.

I wonder if this will be really enough. I'm asking since the commit
you identified, is marked to fix another commit, which was not
backported into the 5.10.y series.

Now that you identified possible fix for this issue, I would suggest
to report it to upstream, can you do that and keep the Debian bug into
the loop?

Regards,
Salvatore

#991546#25
Date:
2021-07-28 06:10:49 UTC
From:
To:
Marc,
it is fixed, but for looking forward for inclusion of the bugfix into
the 5.10.y stable series upstream, I guess we need to clarify the
above.

Regards,
Salvatore

#991546#30
Date:
2021-07-28 14:40:49 UTC
From:
To:
I've opened an issue upstream:

https://bugzilla.kernel.org/show_bug.cgi?id=213889

I've tried the "bts forwarded ..." as instructed on
https://wiki.debian.org/DebianKernelReportingBugs but I think it only
works for DD ?

Thanks,
Marc