Dear Maintainer,
* What led up to the situation?
A shutdown or reboot leads to a kernel crash at almost the last step. The problem is 100% reproducible. After the crash the machine is frozen/locked. It must be manually reset or powered off.
* What exactly did you do (or not do) that was effective (or
ineffective)?
shutdown -r now
* What was the outcome of this action?
The system goes through the shutdown routines as normal. When it reaches the (second to last) message from hpwdt about an unexpected close, the next (and final) message should say that the system is rebooting. Instead, the kernel crashes.
[ 2024.457121] hpwdt: Unexpected close, not stopping watchdog!
[ 2026.782750] Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
[ 2026.782750] 1. Integrated Management Log (IML)
[ 2026.782750] 2. OA Syslog
[ 2026.782750] 3. OA Forward Progress Log
[ 2026.782750] 4. iLO Event Log
[ 2026.782755] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
[ 2026.782756] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 02/10/2014
[ 2026.782758] Call Trace:
[ 2026.782763] <NMI>
[ 2026.782784] dump_stack+0x5c/0x85
[ 2026.782800] panic+0xe4/0x252
[ 2026.782807] nmi_panic+0x35/0x40
[ 2026.782812] hpwdt_pretimeout+0x27/0x48 [hpwdt]
[ 2026.782816] nmi_handle+0x72/0x120
[ 2026.782820] io_check_error+0x16/0x90
[ 2026.782823] default_do_nmi+0xa2/0x100
[ 2026.782825] do_nmi+0xe5/0x130
[ 2026.782829] end_repeat_nmi+0x16/0x50
[ 2026.782836] RIP: 0010:intel_idle+0x76/0x120
[ 2026.782837] RSP: 0018:ffffffffa9c03e50 EFLAGS: 00000046
[ 2026.782840] RAX: 0000000000000030 RBX: 0000000000000030 RCX: 0000000000000001
[ 2026.782842] RDX: 0000000000000000 RSI: ffffffffa9cb2d20 RDI: 0000000000000000
[ 2026.782843] RBP: 0000000000000005 R08: 00000000ffffffff R09: 0000000000000008
[ 2026.782845] R10: 0000000000000d12 R11: 0000000000000f29 R12: ffffffffa9cb2f18
[ 2026.782847] R13: 0000000000000005 R14: 0000000000000005 R15: 000001d7ac91a64e
[ 2026.782853] ? intel_idle+0x76/0x120
[ 2026.782857] ? intel_idle+0x76/0x120
[ 2026.782858] </NMI>
[ 2026.782863] cpuidle_enter_state+0x72/0x2b0
[ 2026.782868] do_idle+0x193/0x200
[ 2026.782872] cpu_startup_entry+0x6f/0x80
[ 2026.782877] start_kernel+0x458/0x478
[ 2026.782884] secondary_startup_64+0xa5/0xb0
[ 2026.915611] Kernel Offset: 0x27c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2027.554878] ---[ end Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
[ 2027.554878] 1. Integrated Management Log (IML)
[ 2027.554878] 2. OA Syslog
[ 2027.554878] 3. OA Forward Progress Log
[ 2027.554878] 4. iLO Event Log
* What outcome did you expect instead?
The system should reboot or shut down.
[ OK ] Stopped Remount Root and Kernel File Systems. [ OK ] Reached target Shutdown. [ 1125.048758] hpwdt: Unexpected close, not stopping watchdog! [ 1129.467546] Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources: [ 1129.467546] 1. Integrated Management Log (IML) [ 1129.467546] 2. OA Syslog [ 1129.467546] 3. OA Forward Progress Log [ 1129.467546] 4. iLO Event Log [ 1129.467551] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-0.bpo.2-amd64 #1 Debian 4.16.12-1~bpo9+1 [ 1129.467553] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 02/10/2014 [ 1129.467554] Call Trace: [ 1129.467559] <NMI> [ 1129.467569] dump_stack+0x5c/0x85 [ 1129.467575] panic+0xe4/0x252 [ 1129.467581] nmi_panic+0x35/0x40 [ 1129.467587] hpwdt_pretimeout+0x27/0x48 [hpwdt] [ 1129.467591] nmi_handle+0x72/0x120 [ 1129.467594] io_check_error+0x16/0x90 [ 1129.467597] default_do_nmi+0xa2/0x100 [ 1129.467600] do_nmi+0xe5/0x130 [ 1129.467604] end_repeat_nmi+0x16/0x50 [ 1129.467610] RIP: 0010:intel_idle+0x76/0x120 [ 1129.467612] RSP: 0018:ffffffffa9403e50 EFLAGS: 00000046 [ 1129.467615] RAX: 0000000000000030 RBX: 0000000000000030 RCX: 0000000000000001 [ 1129.467617] RDX: 0000000000000000 RSI: ffffffffa94b2ba0 RDI: 0000000000000000 [ 1129.467618] RBP: 0000000000000005 R08: 0000000000000f05 R09: 0000000000000018 [ 1129.467620] R10: 0000000000000c62 R11: 0000000000000f05 R12: ffffffffa94b2d98 [ 1129.467621] R13: 0000000000000005 R14: 0000000000000005 R15: 00000106b559f2b2 [ 1129.467628] ? intel_idle+0x76/0x120 [ 1129.467631] ? intel_idle+0x76/0x120 [ 1129.467633] </NMI> [ 1129.467638] cpuidle_enter_state+0x72/0x2b0 [ 1129.467643] do_idle+0x193/0x200 [ 1129.467646] cpu_startup_entry+0x6f/0x80 [ 1129.467652] start_kernel+0x458/0x478 [ 1129.467658] secondary_startup_64+0xa5/0xb0 [ 1129.467748] Kernel Offset: 0x27400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 1130.244043] ---[ end Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources: [ 1130.244043] 1. Integrated Management Log (IML) [ 1130.244043] 2. OA Syslog [ 1130.244043] 3. OA Forward Progress Log [ 1130.244043] 4. iLO Event Log
I can confirm (somewhat) similar issues with the newest Debian 11 (Bullseye) and Kernel 5.10 on HP Proliant DL380 G7 servers. Boot does sometimes work, then it doesn't work (with a freeze caused by NMI), leading to a very unstable OS. After disabling hpwdt (module blacklist), the problems were gone. I documented the troubleshooting and solution on my blog: https://www.claudiokuenzler.com/blog/1125/debian-11-bullseye-boot-freeze-kernel-panic-hp-proliant-dl380 Ubuntu seems to have disabled/blacklisted the hpwdt module since 2015 already: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837 Maybe something Debian should also consider?
Just FYI, this bug still happens on the new Debian 12 Bookworm Alpha2, currently Debian Testing. Solution is the same as previously stated: Disable (blacklsit) hpwdt Kernel module.
Still happens in Debian 12.7 Blacklisting hpwdt seems to fix it. HPE DL380 G7
Hi Jerry, The Debian kernel team received a number of reports over the past few years of instability of the Proliant DL380 G7 and DL380p G8, seemingly related to the hpwdt driver (in that this goes away if it is not loaded). These reports can be seen at <https://bugs.debian.org/898336>. The instability has been seen with kernel versions ranging from 4.16 to 6.1.y, including after the backport of commit dced0b3e51dd "watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO"). I can see that hpwdt seems to be used for error reporting so it's not clear to me whether these are problems caused by the driver, or the driver is only reporting that something bad happened. Do you have any ideas about what's going wrong here? Is there something odd about these models that needs to be handled in hpwdt, or are they just popular models? Ben.
There are a couple things that come to mind. As you mentioned, hpwdt is used for error containment on ProLiants. (Especially on the older generations) Errors would be raised as NMI and the expectation was that hpwdt would handle the NMI and initiate a kdump. I have seen cases where shutting down file systems can raise PCIe errors which would be transmitted to the SUT as NMI and handled by hpwdt. The second issue is that systemd enables WDT (not just hpwdt) during shutdown. This is to handle the case where shutdown hangs. The WDT is supposed to break the system out of such situations. The default timeout is 10 minutes: /etc/systemd/system.conf: #RebootWatchdogSec=10min (note, I'm not a Debian user, but i believe the systemd behavior is the same on Debian as it is on rhel/sles.) While a ten minute delay to shutdown would be fairly obvious if you're doing interactive testing, it might not be noticed if the testing is automated. To determine if either of the above is happening, you can: o) do the testing interactively and time the test. Does the NMI come in roughly 10 minutes after the shutdown? o) Check the IEL and IML on the iLO web interface. Do you see any errors reported during the shutdown? Questions: 1) The Debian bug above mentions only Gen 7 and 8 systems. Are you seeing this issue on other ProLiant systems? 2) You mentioned back-porting commit dced0b3e51dd. Does your drivers/watchdog/hpwdt.c source match upstream Linux? Or do you cherry pick patches? (sorry, not knowing Debian, I don't know how find/navigate your kernel source.) Please let me know what you find. Jerry ----------------------------------------------------------------------------- Jerry Hoemann Software Engineer Hewlett Packard Enterprise -----------------------------------------------------------------------------
I have a DL380 G7 which I got for free a few months ago and IO setup in my home lab, I installed Debian 12.5 and got lots of errors in the logs and random freezes a few times a day. Investigating I came across posts saying that the problem was hpwdt and to blacklist it. Since I did this server has been an absolute beauty with no issues at all. Happy to run tests for you on weekends, but although I am not a noob on USING Debian (sys admin here), I have no idea of kernel and modules programming, so you may need to tell me exactly what to do to collect data for you. Cheers Marcos inxi CPU: 2x 6-core Intel Xeon X5680 (-MT MCP SMP-) speed/min/max: 2487/1596/3326 MHz Kernel: 6.10.6+bpo-amd64 x86_64 Up: 5d 16h 55m Mem: 35.15/188.88 GiB (18.6%) Storage: 34.83 TiB (3.5% used) Procs: 422 Shell: Bash inxi: 3.3.36 uname -a Linux Earth2 6.10.6+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.10.6-1~bpo12+1 (2024-08-26) x86_64 GNU/Linux lsmod Module Size Used by cpuid 12288 0 vhost_net 36864 5 vhost 65536 1 vhost_net vhost_iotlb 16384 1 vhost tap 32768 1 vhost_net tun 69632 13 vhost_net bridge 389120 0 stp 12288 1 bridge llc 16384 2 bridge,stp rfkill 40960 1 qrtr 53248 2 cpufreq_powersave 16384 0 amdgpu 12939264 0 amdxcp 12288 1 amdgpu drm_exec 12288 1 amdgpu binfmt_misc 28672 1 gpu_sched 65536 1 amdgpu drm_buddy 20480 1 amdgpu ipmi_ssif 45056 0 radeon 1888256 1 intel_powerclamp 16384 0 kvm_intel 413696 32 drm_suballoc_helper 12288 2 amdgpu,radeon drm_display_helper 266240 2 amdgpu,radeon kvm 1343488 21 kvm_intel cec 69632 1 drm_display_helper rc_core 73728 1 cec drm_ttm_helper 12288 2 amdgpu,radeon ttm 102400 3 amdgpu,radeon,drm_ttm_helper ghash_clmulni_intel 16384 0 drm_kms_helper 253952 3 drm_display_helper,amdgpu,radeon sha512_ssse3 45056 0 sha256_ssse3 32768 0 sha1_ssse3 32768 0 i2c_algo_bit 12288 2 amdgpu,radeon video 77824 2 amdgpu,radeon wmi 28672 1 video aesni_intel 364544 0 crypto_simd 16384 1 aesni_intel cryptd 28672 2 crypto_simd,ghash_clmulni_intel sg 45056 0 hpilo 20480 0 joydev 24576 0 intel_cstate 24576 0 serio_raw 16384 0 evdev 28672 7 pcspkr 12288 0 ipmi_si 86016 1 intel_uncore 258048 0 iTCO_wdt 12288 0 intel_pmc_bxt 16384 1 iTCO_wdt i7core_edac 32768 0 iTCO_vendor_support 12288 1 iTCO_wdt watchdog 49152 1 iTCO_wdt acpi_power_meter 24576 0 acpi_cpufreq 32768 0 acpi_ipmi 20480 1 acpi_power_meter ipmi_devintf 16384 0 ipmi_msghandler 86016 4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif button 24576 0 scsi_dh_alua 24576 1 dm_service_time 12288 0 dm_multipath 45056 1 dm_service_time coretemp 16384 0 drm 749568 12 gpu_sched,drm_kms_helper,drm_exec,drm_suballoc_helper,drm_display_helper,drm_buddy,amdgpu,radeon,drm_ttm_helper,ttm,amdxcp msr 12288 0 efi_pstore 12288 0 loop 40960 0 configfs 69632 1 ip_tables 28672 0 x_tables 53248 1 ip_tables autofs4 57344 2 ext4 1130496 7 crc16 12288 1 ext4 mbcache 16384 1 ext4 jbd2 196608 1 ext4 efivarfs 28672 0 raid10 73728 0 raid456 196608 0 async_raid6_recov 20480 1 raid456 async_memcpy 16384 2 raid456,async_raid6_recov async_pq 16384 2 raid456,async_raid6_recov async_xor 16384 3 async_pq,raid456,async_raid6_recov async_tx 16384 5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov xor 20480 1 async_xor raid6_pq 122880 3 async_pq,raid456,async_raid6_recov libcrc32c 12288 1 raid456 crc32c_generic 12288 0 raid1 61440 0 raid0 24576 0 md_mod 225280 4 raid1,raid10,raid0,raid456 dm_mod 208896 25 dm_multipath hid_generic 12288 0 usbhid 77824 0 hid 253952 2 usbhid,hid_generic qla2xxx 1171456 2 sd_mod 81920 8 nvme_fc 53248 1 qla2xxx nvme_fabrics 32768 1 nvme_fc nvme_core 192512 2 nvme_fc,nvme_fabrics t10_pi 20480 2 sd_mod,nvme_core uhci_hcd 61440 0 crc64_rocksoft 16384 1 t10_pi ehci_pci 16384 0 crc64 16384 1 crc64_rocksoft hpsa 122880 6 ehci_hcd 110592 1 ehci_pci crc_t10dif 16384 1 t10_pi crct10dif_generic 12288 0 scsi_transport_fc 102400 1 qla2xxx scsi_transport_sas 57344 1 hpsa usbcore 401408 4 ehci_pci,usbhid,ehci_hcd,uhci_hcd psmouse 208896 0 scsi_mod 319488 8 scsi_transport_sas,sd_mod,dm_multipath,qla2xxx,scsi_dh_alua,scsi_transport_fc,hpsa,sg crct10dif_pclmul 12288 1 crc32_pclmul 12288 0 crc32c_intel 16384 14 bnx2 118784 0 lpc_ich 28672 0 usb_common 16384 3 usbcore,ehci_hcd,uhci_hcd crct10dif_common 12288 3 crct10dif_generic,crc_t10dif,crct10dif_pclmul scsi_common 16384 5 scsi_mod,sd_mod,qla2xxx,hpsa,sg lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 273.4G 0 disk ├─sda1 8:1 0 487M 0 part /boot ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 272.9G 0 part ├─Earth2--vg-root 254:0 0 43.3G 0 lvm / ├─Earth2--vg-var 254:1 0 9.3G 0 lvm /var ├─Earth2--vg-swap_1 254:2 0 976M 0 lvm [SWAP] ├─Earth2--vg-tmp 254:3 0 1.9G 0 lvm /tmp └─Earth2--vg-home 254:4 0 169.5G 0 lvm /home sdb 8:16 0 17.3T 0 disk └─sdb1 8:17 0 17.3T 0 part sdc 8:32 0 17.3T 0 disk └─sdc1 8:33 0 17.3T 0 part ├─Oort-VMDisks 254:5 0 7T 0 lvm /Oort/VMDisks └─Oort-NextcloudDisk 254:6 0 5T 0 lvm /Oort/NextcloudDisk
I have a DL380 G7 which I got for free a few months ago and IO setup in my home lab, I installed Debian 12.5 and got lots of errors in the logs and random freezes a few times a day. Investigating I came across posts saying that the problem was hpwdt and to blacklist it. Since I did this server has been an absolute beauty with no issues at all. Happy to run tests for you on weekends, but although I am not a noob on USING Debian (sys admin here), I have no idea of kernel and modules programming, so you may need to tell me exactly what to do to collect data for you. Cheers Marcos inxi CPU: 2x 6-core Intel Xeon X5680 (-MT MCP SMP-) speed/min/max: 2487/1596/3326 MHz Kernel: 6.10.6+bpo-amd64 x86_64 Up: 5d 16h 55m Mem: 35.15/188.88 GiB (18.6%) Storage: 34.83 TiB (3.5% used) Procs: 422 Shell: Bash inxi: 3.3.36 uname -a Linux Earth2 6.10.6+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.10.6-1~bpo12+1 (2024-08-26) x86_64 GNU/Linux lsmod Module Size Used by cpuid 12288 0 vhost_net 36864 5 vhost 65536 1 vhost_net vhost_iotlb 16384 1 vhost tap 32768 1 vhost_net tun 69632 13 vhost_net bridge 389120 0 stp 12288 1 bridge llc 16384 2 bridge,stp rfkill 40960 1 qrtr 53248 2 cpufreq_powersave 16384 0 amdgpu 12939264 0 amdxcp 12288 1 amdgpu drm_exec 12288 1 amdgpu binfmt_misc 28672 1 gpu_sched 65536 1 amdgpu drm_buddy 20480 1 amdgpu ipmi_ssif 45056 0 radeon 1888256 1 intel_powerclamp 16384 0 kvm_intel 413696 32 drm_suballoc_helper 12288 2 amdgpu,radeon drm_display_helper 266240 2 amdgpu,radeon kvm 1343488 21 kvm_intel cec 69632 1 drm_display_helper rc_core 73728 1 cec drm_ttm_helper 12288 2 amdgpu,radeon ttm 102400 3 amdgpu,radeon,drm_ttm_helper ghash_clmulni_intel 16384 0 drm_kms_helper 253952 3 drm_display_helper,amdgpu,radeon sha512_ssse3 45056 0 sha256_ssse3 32768 0 sha1_ssse3 32768 0 i2c_algo_bit 12288 2 amdgpu,radeon video 77824 2 amdgpu,radeon wmi 28672 1 video aesni_intel 364544 0 crypto_simd 16384 1 aesni_intel cryptd 28672 2 crypto_simd,ghash_clmulni_intel sg 45056 0 hpilo 20480 0 joydev 24576 0 intel_cstate 24576 0 serio_raw 16384 0 evdev 28672 7 pcspkr 12288 0 ipmi_si 86016 1 intel_uncore 258048 0 iTCO_wdt 12288 0 intel_pmc_bxt 16384 1 iTCO_wdt i7core_edac 32768 0 iTCO_vendor_support 12288 1 iTCO_wdt watchdog 49152 1 iTCO_wdt acpi_power_meter 24576 0 acpi_cpufreq 32768 0 acpi_ipmi 20480 1 acpi_power_meter ipmi_devintf 16384 0 ipmi_msghandler 86016 4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif button 24576 0 scsi_dh_alua 24576 1 dm_service_time 12288 0 dm_multipath 45056 1 dm_service_time coretemp 16384 0 drm 749568 12 gpu_sched,drm_kms_helper,drm_exec,drm_suballoc_helper,drm_display_helper,drm_buddy,amdgpu,radeon,drm_ttm_helper,ttm,amdxcp msr 12288 0 efi_pstore 12288 0 loop 40960 0 configfs 69632 1 ip_tables 28672 0 x_tables 53248 1 ip_tables autofs4 57344 2 ext4 1130496 7 crc16 12288 1 ext4 mbcache 16384 1 ext4 jbd2 196608 1 ext4 efivarfs 28672 0 raid10 73728 0 raid456 196608 0 async_raid6_recov 20480 1 raid456 async_memcpy 16384 2 raid456,async_raid6_recov async_pq 16384 2 raid456,async_raid6_recov async_xor 16384 3 async_pq,raid456,async_raid6_recov async_tx 16384 5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov xor 20480 1 async_xor raid6_pq 122880 3 async_pq,raid456,async_raid6_recov libcrc32c 12288 1 raid456 crc32c_generic 12288 0 raid1 61440 0 raid0 24576 0 md_mod 225280 4 raid1,raid10,raid0,raid456 dm_mod 208896 25 dm_multipath hid_generic 12288 0 usbhid 77824 0 hid 253952 2 usbhid,hid_generic qla2xxx 1171456 2 sd_mod 81920 8 nvme_fc 53248 1 qla2xxx nvme_fabrics 32768 1 nvme_fc nvme_core 192512 2 nvme_fc,nvme_fabrics t10_pi 20480 2 sd_mod,nvme_core uhci_hcd 61440 0 crc64_rocksoft 16384 1 t10_pi ehci_pci 16384 0 crc64 16384 1 crc64_rocksoft hpsa 122880 6 ehci_hcd 110592 1 ehci_pci crc_t10dif 16384 1 t10_pi crct10dif_generic 12288 0 scsi_transport_fc 102400 1 qla2xxx scsi_transport_sas 57344 1 hpsa usbcore 401408 4 ehci_pci,usbhid,ehci_hcd,uhci_hcd psmouse 208896 0 scsi_mod 319488 8 scsi_transport_sas,sd_mod,dm_multipath,qla2xxx,scsi_dh_alua,scsi_transport_fc,hpsa,sg crct10dif_pclmul 12288 1 crc32_pclmul 12288 0 crc32c_intel 16384 14 bnx2 118784 0 lpc_ich 28672 0 usb_common 16384 3 usbcore,ehci_hcd,uhci_hcd crct10dif_common 12288 3 crct10dif_generic,crc_t10dif,crct10dif_pclmul scsi_common 16384 5 scsi_mod,sd_mod,qla2xxx,hpsa,sg lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 273.4G 0 disk ├─sda1 8:1 0 487M 0 part /boot ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 272.9G 0 part ├─Earth2--vg-root 254:0 0 43.3G 0 lvm / ├─Earth2--vg-var 254:1 0 9.3G 0 lvm /var ├─Earth2--vg-swap_1 254:2 0 976M 0 lvm [SWAP] ├─Earth2--vg-tmp 254:3 0 1.9G 0 lvm /tmp └─Earth2--vg-home 254:4 0 169.5G 0 lvm /home sdb 8:16 0 17.3T 0 disk └─sdb1 8:17 0 17.3T 0 part sdc 8:32 0 17.3T 0 disk └─sdc1 8:33 0 17.3T 0 part ├─Oort-VMDisks 254:5 0 7T 0 lvm /Oort/VMDisks └─Oort-NextcloudDisk 254:6 0 5T 0 lvm /Oort/NextcloudDisk