Dear Maintainer,
* What led up to the situation?
Upgrading my Debian/testing installation with apt -u dist-upgrade
* What exactly did you do (or not do) that was effective (or
ineffective)?
apt -u dist-upgrade
* What was the outcome of this action?
The new 5.17-1 kernel has several issues
* What outcome did you expect instead?
Working Linux kernel with no issues
currently in Debian/testing, the new 5.17 kernel has several issues.
My LIBVIRT KVM VM instances fail to load with this new kernel.
The disk access is slow in some programs, as for example showing the KVM VM
instances in virt-manager fails due to broken disk access.
After some hours of uptime, a kernel thread goes wild and uses 100% of the CPU.
I am also running this new kernel (5.17) on a Debian/testing VMWARE ESXI VPS
instance, at my providers place, where everything works fine. Thus this most
probably an issue with my used hardware.
My hardware is the following:
adrian@g6 ~ % inxi -F
System:
Host: g6.lan.dac Kernel: 5.16.0-6-amd64 arch: x86_64 bits: 64
Desktop: Enlightenment v: 0.25.3 Distro: Debian GNU/Linux bookworm/sid
Machine:
Type: Desktop System: HP product: ProLiant ML110 G6 v: N/A
serial: <superuser required>
Mobo: Wistron model: ProLiant ML110 G6 serial: <superuser required>
BIOS: HP v: O27 date: 08/26/2011
CPU:
Info: quad core model: Intel Xeon X3430 bits: 64 type: MCP cache:
L2: 1024 KiB
Speed (MHz): avg: 1635 min/max: 1197/2394 cores: 1: 1692 2: 1617 3: 1649
4: 1585
Graphics:
Device-1: AMD Oland GL [FirePro W2100] driver: radeon v: kernel
Display: x11 server: X.Org v: 1.21.1.3 with: Xwayland v: 22.1.0 driver:
X: loaded: radeon gpu: radeon resolution: 1920x1200
OpenGL: renderer: AMD OLAND (DRM 2.50.0 5.16.0-6-amd64 LLVM 13.0.1)
v: 4.5 Mesa 21.3.8
Audio:
Device-1: AMD Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000
Series]
driver: snd_hda_intel
Device-2: ASUSTek Xonar U1 Audio Station type: USB
driver: hid-generic,snd-usb-audio,usbhid
Sound Server-1: ALSA v: k5.16.0-6-amd64 running: yes
Sound Server-2: PipeWire v: 0.3.51 running: yes
Network:
Device-1: Broadcom NetXtreme BCM5723 Gigabit Ethernet PCIe driver: tg3
IF: eth0 state: up speed: 1000 Mbps duplex: full mac: 64:31:50:d3:c0:f8
IF-ID-1: br0 state: up speed: 1000 Mbps duplex: unknown
mac: fe:40:ab:83:94:4a
IF-ID-2: vnet0 state: unknown speed: 10 Mbps duplex: full
mac: fe:54:00:c2:24:94
IF-ID-3: vnet1 state: unknown speed: 10 Mbps duplex: full
mac: fe:54:00:bf:35:8b
IF-ID-4: vnet2 state: unknown speed: 10 Mbps duplex: full
mac: fe:54:00:25:b0:8b
IF-ID-5: vnet3 state: unknown speed: 10 Mbps duplex: full
mac: fe:54:00:4a:c8:69
Drives:
Local Storage: total: 11.79 TiB used: 5.87 TiB (49.8%)
ID-1: /dev/sda vendor: Toshiba model: Q300 size: 447.13 GiB
ID-2: /dev/sdb vendor: A-Data model: SP550 size: 447.13 GiB
ID-3: /dev/sdc vendor: Toshiba model: HDWE140 size: 3.64 TiB
ID-4: /dev/sdd vendor: Toshiba model: MG06ACA800E size: 7.28 TiB
Partition:
ID-1: / size: 437.53 GiB used: 91.98 GiB (21.0%) fs: ext4 dev: /dev/sda3
ID-2: /boot size: 1.2 GiB used: 272.2 MiB (22.2%) fs: ext4 dev: /dev/sda1
ID-3: /home size: 3.64 TiB used: 1.76 TiB (48.4%) fs: btrfs
dev: /dev/sdc1
Swap:
ID-1: swap-1 type: partition size: 46.66 GiB used: 139.5 MiB (0.3%)
dev: /dev/sdb2
Sensors:
Permissions: Unable to run ipmi sensors. Root privileges required.
System Temperatures: cpu: 49.0 C mobo: N/A gpu: radeon temp: 39.0 C
Fan Speeds (RPM): N/A
Info:
Processes: 410 Uptime: 1h 1m Memory: 15.62 GiB used: 9.43 GiB (60.4%)
Shell: Zsh inxi: 3.3.16
Thank you very much, for your kind attention.
Sincerely,
Adrian Kiess
Does everything work correctly with kernel 5.16.0-6 ? Sid/Unstable currently has kernel 5.17.11 and it would be useful to know if the issue is still present in that version. Can you test that? If it is, then hopefully `dmesg` can give some clues. After you've noticed the described symptoms again, can you do `dmesg --level emerg,alert,crit,err,warn` and send that to this bug report?
Dear Diederik, yes, it works with the kernel 5.16.0-6, but disk access is still slow. For example, virt-manager/viewer sometimes needs a minute to connect to the KVM instances on localhost. But not all applications are this slow; for example the E-Mail client Sylpheed starts as fast as before and is operating at fast speed. I assume there is also another bug now in the system, not only due to the new kernel. There is also another bug in GDM3, which I also reported: Loading GDM3 after bootup and logging in as normal user is also very, very slow. As you suggested, I installed the kernel 5.17.11 from Debian/unstable and booted into this kernel. virt-manager and my KVM VM instances do work again, but one VM instance failed to load after bootup. I restarted the VM instance, and it is now also operating fine. When opening the virt-viewer instance from virt-manager, connecting to the VM is still very slow with kernel 5.17.11. Something must be wrong I/O wise. I attached the dmesg output, you requested, as TXT file to this E-mail. Thank you very much for your answer! Sincerely, Adrian Kiess On Mon, 30 May 2022 11:45:29 +0200 Diederik de Haas <didi.debian@cknow.org> wrote:
Hi Adrian, Ok, but that issue was also happening before 5.17 and is not a new problem. Do you have a(n old) kernel (still) installed which does NOT have this slow disk access issue? If it happens on all kernel versions, then a hardware issue becomes much more likely to be the real culprit. In your initial report I noticed the following: If not, that may be worth looking into. Which sounds like the moment lots of files/data is read from disk to initialize the session, which does point to a disk issue. But if the initial boot isn't terribly slow as well, that would be odd. Or is /home mounted from another disk? Good, that sounds like major progress :) It looks to me that the KVM problem is now (mostly?) fixed. That still/all points to a disk problem 1) "[Firmware Warn]: HEST: Duplicated hardware error source ID: 9." https://lkml.org/lkml/2011/6/27/370 seems relevant for that as it provided the better warning, but it also points out that it *is* considered a firmware bug. I noticed your BIOS is from 2011. Is there a newer version available? If so, it may be worth trying that out to see if that improves things. 2) Several ACPI related warnings. No idea if or what should be done with that. 3) "kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround" and "kvm: KVM_SET_TSS_ADDR need to be called before entering vcpu" That looks like there are still KVM related issues (just not or less fatal) There have been other bug reports related to KVM. 4) BUG: kernel NULL pointer dereference, address: 000000000000000b That's never good. The dmesg output also contains a Call Trace and several mentions of KVM, so it looks like there's still something not right about it. I have no idea how to interpret those Call (or Stack) Traces, so hopefully someone else chimes in who is familiar with that. Cheers, Diederik
Dear Diederik, I booted again into kernel 5.16.0-6, since the kernel 5.17.11 from Debian/unstable also crashes when using LIBVIRT KVM. I am now in Kernel 5.16.0-6 and the kernel oops seems not to happen here! I attached the output of journalctl with the Kernel oopses from Kernel 5.17.11 to this E-Mail. The Kernel oops in Kernel 5.17.11 makes my system unusable after running for a while longer. The kernel oops seems to be this: mai 30 14:09:10 g6.lan.dac kernel: RIP: 0010:kvm_replace_memslot+0xcf/0x390 [kvm] mai 30 14:09:10 g6.lan.dac kernel: Code: 44 24 08 48 85 db 0f 84 3b 02 00 00 48 89 ea 48 c1 e2 04 48 01 da 48 8b 4a 08 48 85 c9 74 1e 48 8b 32 48 89 31 48 85 f6 74 04 <48> 89 4e 08 48 c7 02 00 00 00 00 48 c7 42 08 00 00 00 00 48 8d 54 mai 30 14:09:10 g6.lan.dac kernel: RSP: 0018:ffffa1fd8904fd70 EFLAGS: 00010206 mai 30 14:09:10 g6.lan.dac kernel: RAX: ffffa1fd89069058 RBX: ffff955df85d7000 RCX: ffffa1fd89069298 mai 30 14:09:10 g6.lan.dac kernel: RDX: ffff955df85d7000 RSI: 0000000000000003 RDI: ffffa1fd89069000 mai 30 14:09:10 g6.lan.dac kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 mai 30 14:09:10 g6.lan.dac kernel: R10: 0000000000000000 R11: 0000000000000004 R12: 0000000000000000 mai 30 14:09:10 g6.lan.dac kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffffa1fd89069000 mai 30 14:09:10 g6.lan.dac kernel: FS: 00007ff61d140640(0000) GS:ffff95606fd80000(0000) knlGS:0000000000000000 mai 30 14:09:10 g6.lan.dac kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 mai 30 14:09:10 g6.lan.dac kernel: CR2: 000000000000000b CR3: 000000024deaa000 CR4: 00000000000026e0 mai 30 14:09:10 g6.lan.dac kernel: Call Trace: mai 30 14:09:10 g6.lan.dac kernel: <TASK> mai 30 14:09:10 g6.lan.dac kernel: ? _raw_read_unlock+0x18/0x30 mai 30 14:09:10 g6.lan.dac kernel: kvm_set_memslot+0x3c2/0x4a0 [kvm] mai 30 14:09:10 g6.lan.dac kernel: kvm_vm_ioctl+0x2cb/0xd80 [kvm] mai 30 14:09:10 g6.lan.dac kernel: ? __seccomp_filter+0x38c/0x5a0 mai 30 14:09:10 g6.lan.dac kernel: __x64_sys_ioctl+0x82/0xb0 mai 30 14:09:10 g6.lan.dac kernel: do_syscall_64+0x3b/0xc0 mai 30 14:09:10 g6.lan.dac kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae mai 30 14:09:10 g6.lan.dac kernel: RIP: 0033:0x7ff62085a397 mai 30 14:09:10 g6.lan.dac kernel: Code: 3c 1c e8 1c ff ff ff 85 c0 79 87 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 da 0d 00 f7 d8 64 89 01 48 mai 30 14:09:10 g6.lan.dac kernel: RSP: 002b:00007ff61d13ef98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 mai 30 14:09:10 g6.lan.dac kernel: RAX: ffffffffffffffda RBX: 000000004020ae46 RCX: 00007ff62085a397 mai 30 14:09:10 g6.lan.dac kernel: RDX: 00007ff61d13f060 RSI: 000000004020ae46 RDI: 000000000000000e mai 30 14:09:10 g6.lan.dac kernel: RBP: 000056326db4b1b0 R08: 0000000000000000 R09: 0000000000100000 mai 30 14:09:10 g6.lan.dac kernel: R10: 0000000000100000 R11: 0000000000000246 R12: 00007ff61d13f060 mai 30 14:09:10 g6.lan.dac kernel: R13: 000000007ff00000 R14: 000056326db25460 R15: 0000000000100000 mai 30 14:09:10 g6.lan.dac kernel: </TASK> Thank you very much! Adrian Kiess On Mon, 30 May 2022 14:29:29 +0200 Diederik de Haas <didi.debian@cknow.org> wrote:
That dmesg output sounds a lot like bugs 1010916 and 1011168 and a Xeon X3430 CPU would be another older one that predates XSAVE. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1011168#15
Dear Diederik, the new kernel: root@g6 /opt # uname -a Linux g6.lan.dac 5.18.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.18.2-1 (2022-06-06) x86_64 GNU/Linux in Debian/testing works again in the way, that LIBVIRT KVM works again! I most probably found the reason for the slow disk access on my machine: Please see this new bug report: Debian Bug report logs - #1013260 coreutils: /bin/chown very slow in conjunction with storebackup https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1013260 Thank you very much for your answer! Sincerely, Adrian Kieß On Mon, 30 May 2022 14:29:29 +0200 Diederik de Haas <didi.debian@cknow.org> wrote: echo "g6.lan.dac uptime: " && /usr/bin/uptime
Version: linux/5.18.2-1 Excellent! Closing this bug with that version then. Thanks for reporting back. I look a quick look at that bug and I would suggest to try if a 'chown' operation on some file (can be a temp file), also takes that long. Both the Debian package and upstream of storebackup doesn't look too healthy to me (Debian package hasn't had a real update in 10 years), so it seems worthwhile to verify whether the issue also occurs outside storebackup. Also, coreutils was last updated 7 months ago and I got the impression the problems are (far) more recent. Often times a new problem coincides with a new program version. You're welcome and good luck!