#1012100 linux-image-5.17-1: KVM LIBVIRT fails to start, slow disk access, and a kernel thread goes wild on Intel Xeon X3430

#1012100#5
Date:
2022-05-30 07:59:06 UTC
From:
To:
Dear Maintainer,

   * What led up to the situation?
     Upgrading my Debian/testing installation with apt -u dist-upgrade
   * What exactly did you do (or not do) that was effective (or
     ineffective)?
     apt -u dist-upgrade
   * What was the outcome of this action?
     The new 5.17-1 kernel has several issues
   * What outcome did you expect instead?
     Working Linux kernel with no issues

currently in Debian/testing, the new 5.17 kernel has several issues.

My LIBVIRT KVM VM instances fail to load with this new kernel.

The disk access is slow in some programs, as for example showing the KVM VM
instances in virt-manager fails due to broken disk access.

After some hours of uptime, a kernel thread goes wild and uses 100% of the CPU.

I am also running this new kernel (5.17) on a Debian/testing VMWARE ESXI VPS
instance, at my providers place, where everything works fine. Thus this most
probably an issue with my used hardware.

My hardware is the following:

adrian@g6 ~ % inxi -F
System:
  Host: g6.lan.dac Kernel: 5.16.0-6-amd64 arch: x86_64 bits: 64
    Desktop: Enlightenment v: 0.25.3 Distro: Debian GNU/Linux bookworm/sid
Machine:
  Type: Desktop System: HP product: ProLiant ML110 G6 v: N/A
    serial: <superuser required>
  Mobo: Wistron model: ProLiant ML110 G6 serial: <superuser required>
    BIOS: HP v: O27 date: 08/26/2011
CPU:
  Info: quad core model: Intel Xeon X3430 bits: 64 type: MCP cache:
    L2: 1024 KiB
  Speed (MHz): avg: 1635 min/max: 1197/2394 cores: 1: 1692 2: 1617 3: 1649
    4: 1585
Graphics:
  Device-1: AMD Oland GL [FirePro W2100] driver: radeon v: kernel
  Display: x11 server: X.Org v: 1.21.1.3 with: Xwayland v: 22.1.0 driver:
    X: loaded: radeon gpu: radeon resolution: 1920x1200
  OpenGL: renderer: AMD OLAND (DRM 2.50.0 5.16.0-6-amd64 LLVM 13.0.1)
    v: 4.5 Mesa 21.3.8
Audio:
  Device-1: AMD Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000
  Series]
    driver: snd_hda_intel
  Device-2: ASUSTek Xonar U1 Audio Station type: USB
    driver: hid-generic,snd-usb-audio,usbhid
  Sound Server-1: ALSA v: k5.16.0-6-amd64 running: yes
  Sound Server-2: PipeWire v: 0.3.51 running: yes
Network:
  Device-1: Broadcom NetXtreme BCM5723 Gigabit Ethernet PCIe driver: tg3
  IF: eth0 state: up speed: 1000 Mbps duplex: full mac: 64:31:50:d3:c0:f8
  IF-ID-1: br0 state: up speed: 1000 Mbps duplex: unknown
    mac: fe:40:ab:83:94:4a
  IF-ID-2: vnet0 state: unknown speed: 10 Mbps duplex: full
    mac: fe:54:00:c2:24:94
  IF-ID-3: vnet1 state: unknown speed: 10 Mbps duplex: full
    mac: fe:54:00:bf:35:8b
  IF-ID-4: vnet2 state: unknown speed: 10 Mbps duplex: full
    mac: fe:54:00:25:b0:8b
  IF-ID-5: vnet3 state: unknown speed: 10 Mbps duplex: full
    mac: fe:54:00:4a:c8:69
Drives:
  Local Storage: total: 11.79 TiB used: 5.87 TiB (49.8%)
  ID-1: /dev/sda vendor: Toshiba model: Q300 size: 447.13 GiB
  ID-2: /dev/sdb vendor: A-Data model: SP550 size: 447.13 GiB
  ID-3: /dev/sdc vendor: Toshiba model: HDWE140 size: 3.64 TiB
  ID-4: /dev/sdd vendor: Toshiba model: MG06ACA800E size: 7.28 TiB
Partition:
  ID-1: / size: 437.53 GiB used: 91.98 GiB (21.0%) fs: ext4 dev: /dev/sda3
  ID-2: /boot size: 1.2 GiB used: 272.2 MiB (22.2%) fs: ext4 dev: /dev/sda1
  ID-3: /home size: 3.64 TiB used: 1.76 TiB (48.4%) fs: btrfs
    dev: /dev/sdc1
Swap:
  ID-1: swap-1 type: partition size: 46.66 GiB used: 139.5 MiB (0.3%)
    dev: /dev/sdb2
Sensors:
  Permissions: Unable to run ipmi sensors. Root privileges required.
  System Temperatures: cpu: 49.0 C mobo: N/A gpu: radeon temp: 39.0 C
  Fan Speeds (RPM): N/A
Info:
  Processes: 410 Uptime: 1h 1m Memory: 15.62 GiB used: 9.43 GiB (60.4%)
  Shell: Zsh inxi: 3.3.16

Thank you very much, for your kind attention.

Sincerely,

Adrian Kiess

#1012100#10
Date:
2022-05-30 09:45:29 UTC
From:
To:
Does everything work correctly with kernel 5.16.0-6 ?
Sid/Unstable currently has kernel 5.17.11 and it would be useful to know if
the issue is still present in that version. Can you test that?

If it is, then hopefully `dmesg` can give some clues. After you've noticed the
described symptoms again, can you do `dmesg --level emerg,alert,crit,err,warn`
and send that to this bug report?

#1012100#15
Date:
2022-05-30 11:14:26 UTC
From:
To:
Dear Diederik,

yes, it works with the kernel 5.16.0-6, but disk access is still slow.
For example, virt-manager/viewer sometimes needs a minute to connect to
the KVM instances on localhost. But not all applications are this slow;
for example the E-Mail client Sylpheed starts as fast as before and is
operating at fast speed.

I assume there is also another bug now in the system, not only due to
the new kernel. There is also another bug in GDM3, which I also
reported: Loading GDM3 after bootup and logging in as normal user is
also very, very slow.

As you suggested, I installed the kernel 5.17.11 from Debian/unstable
and booted into this kernel.

virt-manager and my KVM VM instances do work again, but one VM instance
failed to load after bootup. I restarted the VM instance, and it is now
also operating fine.

When opening the virt-viewer instance from virt-manager, connecting to
the VM is still very slow with kernel 5.17.11. Something must be wrong
I/O wise.

I attached the dmesg output, you requested, as TXT file to this E-mail.

Thank you very much for your answer!

Sincerely,

Adrian Kiess

On Mon, 30 May 2022 11:45:29 +0200 Diederik de Haas <didi.debian@cknow.org> wrote:

#1012100#20
Date:
2022-05-30 12:29:29 UTC
From:
To:
Hi Adrian,

Ok, but that issue was also happening before 5.17 and is not a new problem.
Do you have a(n old) kernel (still) installed which does NOT have this slow
disk access issue? If it happens on all kernel versions, then a hardware issue
becomes much more likely to be the real culprit.

In your initial report I noticed the following:
If not, that may be worth looking into.

Which sounds like the moment lots of files/data is read from disk to initialize
the session, which does point to a disk issue.
But if the initial boot isn't terribly slow as well, that would be odd.
Or is /home mounted from another disk?

Good, that sounds like major progress :) It looks to me that the KVM problem
is now (mostly?) fixed.

That still/all points to a disk problem
1) "[Firmware Warn]: HEST: Duplicated hardware error source ID: 9."
https://lkml.org/lkml/2011/6/27/370 seems relevant for that as it provided the
better warning, but it also points out that it *is* considered a firmware bug.
I noticed your BIOS is from 2011. Is there a newer version available? If so,
it may be worth trying that out to see if that improves things.

2) Several ACPI related warnings.
No idea if or what should be done with that.

3) "kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using
workaround" and "kvm: KVM_SET_TSS_ADDR need to be called before entering vcpu"
That looks like there are still KVM related issues (just not or less fatal)
There have been other bug reports related to KVM.

4) BUG: kernel NULL pointer dereference, address: 000000000000000b
That's never good. The dmesg output also contains a Call Trace and several
mentions of KVM, so it looks like there's still something not right about it.
I have no idea how to interpret those Call (or Stack) Traces, so hopefully
someone else chimes in who is familiar with that.

Cheers,
  Diederik

#1012100#25
Date:
2022-05-30 12:43:32 UTC
From:
To:
Dear Diederik,

I booted again into kernel 5.16.0-6, since the kernel 5.17.11 from
Debian/unstable also crashes when using LIBVIRT KVM.

I am now in Kernel 5.16.0-6 and the kernel oops seems not to happen
here!

I attached the output of journalctl with the Kernel oopses from Kernel
5.17.11 to this E-Mail.

The Kernel oops in Kernel 5.17.11 makes my system unusable after
running for a while longer.

The kernel oops seems to be this:

mai 30 14:09:10 g6.lan.dac kernel: RIP:
0010:kvm_replace_memslot+0xcf/0x390 [kvm] mai 30 14:09:10 g6.lan.dac
kernel: Code: 44 24 08 48 85 db 0f 84 3b 02 00 00 48 89 ea 48 c1 e2 04
48 01 da 48 8b 4a 08 48 85 c9 74 1e 48 8b 32 48 89 31 48 85 f6 74 04
<48> 89 4e 08 48 c7 02 00 00 00 00 48 c7 42 08 00 00 00 00 48 8d 54 mai
30 14:09:10 g6.lan.dac kernel: RSP: 0018:ffffa1fd8904fd70 EFLAGS:
00010206 mai 30 14:09:10 g6.lan.dac kernel: RAX: ffffa1fd89069058 RBX:
ffff955df85d7000 RCX: ffffa1fd89069298 mai 30 14:09:10 g6.lan.dac
kernel: RDX: ffff955df85d7000 RSI: 0000000000000003 RDI:
ffffa1fd89069000 mai 30 14:09:10 g6.lan.dac kernel: RBP:
0000000000000000 R08: 0000000000000000 R09: 0000000000000000 mai 30
14:09:10 g6.lan.dac kernel: R10: 0000000000000000 R11: 0000000000000004
R12: 0000000000000000 mai 30 14:09:10 g6.lan.dac kernel: R13:
0000000000000000 R14: 0000000000000000 R15: ffffa1fd89069000 mai 30
14:09:10 g6.lan.dac kernel: FS:  00007ff61d140640(0000)
GS:ffff95606fd80000(0000) knlGS:0000000000000000 mai 30 14:09:10
g6.lan.dac kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
mai 30 14:09:10 g6.lan.dac kernel: CR2: 000000000000000b CR3:
000000024deaa000 CR4: 00000000000026e0 mai 30 14:09:10 g6.lan.dac
kernel: Call Trace: mai 30 14:09:10 g6.lan.dac kernel:  <TASK> mai 30
14:09:10 g6.lan.dac kernel:  ? _raw_read_unlock+0x18/0x30 mai 30
14:09:10 g6.lan.dac kernel:  kvm_set_memslot+0x3c2/0x4a0 [kvm] mai 30
14:09:10 g6.lan.dac kernel:  kvm_vm_ioctl+0x2cb/0xd80 [kvm] mai 30
14:09:10 g6.lan.dac kernel:  ? __seccomp_filter+0x38c/0x5a0 mai 30
14:09:10 g6.lan.dac kernel:  __x64_sys_ioctl+0x82/0xb0 mai 30 14:09:10
g6.lan.dac kernel:  do_syscall_64+0x3b/0xc0 mai 30 14:09:10 g6.lan.dac
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae mai 30 14:09:10
g6.lan.dac kernel: RIP: 0033:0x7ff62085a397 mai 30 14:09:10 g6.lan.dac
kernel: Code: 3c 1c e8 1c ff ff ff 85 c0 79 87 49 c7 c4 ff ff ff ff 5b
5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 da 0d 00 f7 d8 64 89 01 48 mai
30 14:09:10 g6.lan.dac kernel: RSP: 002b:00007ff61d13ef98 EFLAGS:
00000246 ORIG_RAX: 0000000000000010 mai 30 14:09:10 g6.lan.dac kernel:
RAX: ffffffffffffffda RBX: 000000004020ae46 RCX: 00007ff62085a397 mai
30 14:09:10 g6.lan.dac kernel: RDX: 00007ff61d13f060 RSI:
000000004020ae46 RDI: 000000000000000e mai 30 14:09:10 g6.lan.dac
kernel: RBP: 000056326db4b1b0 R08: 0000000000000000 R09:
0000000000100000 mai 30 14:09:10 g6.lan.dac kernel: R10:
0000000000100000 R11: 0000000000000246 R12: 00007ff61d13f060 mai 30
14:09:10 g6.lan.dac kernel: R13: 000000007ff00000 R14: 000056326db25460
R15: 0000000000100000 mai 30 14:09:10 g6.lan.dac kernel:  </TASK>

Thank you very much!

Adrian Kiess

On Mon, 30 May 2022 14:29:29 +0200 Diederik de Haas <didi.debian@cknow.org> wrote:

#1012100#30
Date:
2022-05-30 18:30:07 UTC
From:
To:
That dmesg output sounds a lot like bugs 1010916 and 1011168 and a Xeon
X3430 CPU would be another older one that predates XSAVE.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1011168#15

#1012100#35
Date:
2022-06-20 15:19:36 UTC
From:
To:
Dear Diederik,

the new kernel:

root@g6 /opt # uname -a
Linux g6.lan.dac 5.18.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.18.2-1
(2022-06-06) x86_64 GNU/Linux

in Debian/testing works again in the way, that LIBVIRT KVM works again!

I most probably found the reason for the slow disk access on my machine:

Please see this new bug report:

Debian Bug report logs - #1013260
coreutils: /bin/chown very slow in conjunction with storebackup

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1013260

Thank you very much for your answer!

Sincerely,

Adrian Kieß

On Mon, 30 May 2022 14:29:29 +0200 Diederik de Haas <didi.debian@cknow.org> wrote:



echo "g6.lan.dac uptime: " && /usr/bin/uptime

#1012100#40
Date:
2022-06-20 16:39:31 UTC
From:
To:
Version: linux/5.18.2-1

Excellent! Closing this bug with that version then. Thanks for reporting back.

I look a quick look at that bug and I would suggest to try if a 'chown'
operation on some file (can be a temp file), also takes that long.
Both the Debian package and upstream of storebackup doesn't look too healthy
to me (Debian package hasn't had a real update in 10 years), so it seems
worthwhile to verify whether the issue also occurs outside storebackup.

Also, coreutils was last updated 7 months ago and I got the impression the
problems are (far) more recent. Often times a new problem coincides with a new
program version.

You're welcome and good luck!