Console output attached.
Hello Aaron, I guess this problem is independent of ARCH=powerpc, still I added the debian-powerpc list to Cc: in case someone there has an idea. Is this problem reproducible? Does it also happen on a cold start? Does it make the machine boot if you add module_blacklist=ehci-pci,ehci-hcd,ehea to the kernel command line? If yes, can you please try to reduce the list of modules to blacklist and report back which is the relevant (or relevant combination) here? Best regards Uwe
Uwe writes:
Perhaps? My Dell servers with EHCI hardware seem to be OK, but they
are all running 6.12.x kernels at present.
It is very reproducible. And "cold start" is weird on this platform.
I have tried completely powering off the server as part of the
troubleshooting process, but usually do use the much faster "activate
partition" action from the management server.
Well, no. But having looked at the kernel module loader source it
seems that the friendly '-' to '_' feature isn't there for the kernel
command line. So,
module_blacklist=ehci_pci
does the right thing. Which is good for me as 'ehea' is the Ethernet
driver.
Unfortunately for our purposes here, this IBM beast is being gifted to
a friend sometime soon. I think I have it until 10 March. But am
happy to try new kernel packages until such time as it leaves my
possession.
Successful boot log attached.
Thanks!
- Aaron
Hello Aaron, that 6.1 works fine? If so, can you bisect over the packaged kernel packages to find out when the problem was introduced? Also trying 6.19 (currently in experimental) would be a good test. Another thing that might be worth to try: There are two different flavours for powerpc kernel: the normal ones and the 64k pages ones. IIUC your machine should be able to work with both flavours. It would be obscure but not totally unheard of, if this makes a difference. Other than that it might be worth to forward upstream to the USB and powerpc maintainers. But if you can do the bisection I suggest to wait with that and then include your findings in that report. If this happens before the bug details are worked out and said friend won't continue the investigation, we'll probably close this bug. Best regards Uwe
Uwe =?utf-8?Q?Kleine-K=C3=B6nig?= writes: Heh. linux-image-6.1.0-9-powerpc64 works without any issues, yes. No issues requiring modules be blacklisted. And, while I am at it, linux-image-6.18.15+deb14-powerpc64 works more or less as expected with ehci_pci blacklisted, too. The 6.1.0-9 kernel package (and the grub packages installed on this machine) are from an 11.0.0 vintage ppc64 netinstall ISO. I think from https://cdimage.debian.org/cdimage/ports/snapshots/2023-05-14/debian-11.0.0-ppc64-NETINST-1.iso There were grub bugs that on the 12.x and 13.x installer ISOs that prevented me getting the installation image to even start a kernel. And I haven't yet tried the grub-ieee1275 2.14-2 packages. But that is an issue for different bug report. Back to the topic at hand... I'd be happy bisecting kernel packages and reporting results. But will need a little help locating the archives for unreleased architectures. Any pointers in that direction would be great. I will give that a try and report results, too. Probably sooner than the kernel package bisecting process. Sounds like a reasonable plan. I understand. Unreleased architecture, unreproducible... I've started working with the -64k kernel package. Expect some updates soon. Thanks! - Aaron
More console output updates with the currently available -64k kernel package, 6.18.15-1. I will pull a 6.19 image from expermental at some point today. In short, the bigger page size -64k kernel image still needs the ehci_pci module blacklisted to boot far enough to get a getty running on the console. Two bootlogs attached here. First without ehci_pci blacklisted and the hung boot. Second with ehci_pci blacklisted, and a (more or less) successfully boot. This one still has an Oops in about the same place. It just does not seem to have as great an impact on the kernel and userspace startup process. Thanks! - Aaron
Looking at the oops, the crash occurs in pci_msi_domain_supports() at
offset +0x54, specifically on the dereference of domain->host_data (stored
in the 'info' variable) when accessing info->flags. The domain pointer
itself is not NULL and passes irq_domain_is_hierarchy(), so the early-exit
guard is bypassed, but host_data is NULL on this platform.
I noticed a relevant change between 6.11 and 6.12 in that function. In 6.11
the early-exit path was unconditional:
if (!domain || !irq_domain_is_hierarchy(domain))
return mode == ALLOW_LEGACY;
In 6.12 it became conditional on CONFIG_PCI_MSI_ARCH_FALLBACKS:
if (!domain || !irq_domain_is_hierarchy(domain)) {
if (IS_ENABLED(CONFIG_PCI_MSI_ARCH_FALLBACKS))
return mode == ALLOW_LEGACY;
return false;
}
Perhaps CONFIG_PCI_MSI_ARCH_FALLBACKS=y isn't set for ppc64?
I have done as much bisecting as I can with kernel-image packages found inide the installer ISO image archives at https://cdimage.debian.org/cdimage/ports/snapshots/. Here are my results from today: kernel-image-6.12.6-powerpc64 works as expected. No blacklisting of ehci_pci is required. OS boots full multiuser mode successfully. kernel-image-6.12.17-powerpc64 fails in a manner similar to the original report. Kernel Oops inside pci_msi_domain_supports(). kernel-image-6.12.17-powerpc64 with ehci_pci blacklisted fails in a new fashion. There is still an Ooops inside pci_msi_domain_supports(). But, as expected ehci_pci_probe is not in the stack trace. Console output for each of these is attached. And following up on my friend Hayden's suggestion, CONFIG_PCI_MSI_ARCH_FALLBACKS is set on all of the installed kernel packages: $ grep MSI_ARCH /boot/config-6.1* /boot/config-6.1.0-9-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.11.7-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.12.17-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.12.6-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.16.3+deb14-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.18.15+deb14-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.18.15+deb14-powerpc64-64k:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.18.9+deb14-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.19-powerpc64:CONFIG_PCI_MSI_ARCH_FALLBACKS=y /boot/config-6.19-powerpc64-64k:CONFIG_PCI_MSI_ARCH_FALLBACKS=y $ So code changes in that area are likely to blame, but the config itself seems to be OK. Is there an archive of correspondinng kernel package sources (or even better, ppc64 binaries) available somewhere? If there is anything between 6.12.6 and 6.12.17 releases, I'm happy to build (or install) as much as necessary to find the initial package with the issue and do further bisection. At least for as long as I have the machine in my possession. It looks like there were 6.12.16-1, 6.12.15-1, 6.12.13-1, 6.12.12-1, 6.12.11-1, 6.12.10-1, 6.12.9-1, 6.12.8-1, and 6.12.6-1 kernel-image packages in the archive at some point. Thanks! - Aaron
Bisecting has been completed. And commit
aed157301c659a48f5564cc4568cf0e5c8831af0 has been identified as the
source of the issue:
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [e9d65b48ce1aba50e9ec7eab6d9f73d1ba88420e] Linux 6.12.6
git bisect good e9d65b48ce1aba50e9ec7eab6d9f73d1ba88420e
# status: waiting for bad commit, 1 good commit known
# bad: [41b222412985dc8410b88fb7a0fda87e6640d4df] Linux 6.12.17
git bisect bad 41b222412985dc8410b88fb7a0fda87e6640d4df
# bad: [33e47d9573075342a41783a55c8c67bc71246fc1] bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT
git bisect bad 33e47d9573075342a41783a55c8c67bc71246fc1
# bad: [e7960da6f2f438d907c17d463364dae6d242f775] sched_ext: switch class when preempted by higher priority scheduler
git bisect bad e7960da6f2f438d907c17d463364dae6d242f775
# bad: [9da1cfc4f111b7e4ea3d7f388b16b17bb881795e] Bluetooth: btusb: mediatek: add callback function in btusb_disconnect
git bisect bad 9da1cfc4f111b7e4ea3d7f388b16b17bb881795e
# good: [416226eb3f3a3008456903fc3695b0efa8ceafa1] selftests/bpf: Use asm constraint "m" for LoongArch
git bisect good 416226eb3f3a3008456903fc3695b0efa8ceafa1
# good: [6d9cd27105459f169993a4c5f216499a946dbf34] powerpc/pseries/vas: Add close() callback in vas_vm_ops struct
git bisect good 6d9cd27105459f169993a4c5f216499a946dbf34
# good: [16b54ee81d8a44781ddeb5e577262d22c4e6c853] blk-mq: register cpuhp callback after hctx is added to xarray table
git bisect good 16b54ee81d8a44781ddeb5e577262d22c4e6c853
# bad: [5e44779d12bd3cd3930d4b7566edec2ea69972b7] perf/x86/intel: Fix bitmask of OCR and FRONTEND events for LNC
git bisect bad 5e44779d12bd3cd3930d4b7566edec2ea69972b7
# good: [8659da87d21678088ddbed263ad713381db86745] perf/x86/intel/uncore: Add Clearwater Forest support
git bisect good 8659da87d21678088ddbed263ad713381db86745
# good: [8e8494c83cf73168118587e9567e4f7e50ce4fd8] io_uring/sqpoll: fix sqpoll error handling races
git bisect good 8e8494c83cf73168118587e9567e4f7e50ce4fd8
# good: [b939f108e86b76119428a6fa4e92491e09ac7867] x86/fred: Clear WFE in missing-ENDBRANCH #CPs
git bisect good b939f108e86b76119428a6fa4e92491e09ac7867
# bad: [aed157301c659a48f5564cc4568cf0e5c8831af0] PCI/MSI: Handle lack of irqdomain gracefully
git bisect bad aed157301c659a48f5564cc4568cf0e5c8831af0
# good: [1429ae7b7d4759a1e362456b8911c701bae655b4] virt: tdx-guest: Just leak decrypted memory on unrecoverable errors
git bisect good 1429ae7b7d4759a1e362456b8911c701bae655b4
# first bad commit: [aed157301c659a48f5564cc4568cf0e5c8831af0] PCI/MSI: Handle lack of irqdomain gracefully
$
And from there, I can see that commit was meant to fix a PCI MSI issue
seen on RISCV. Which, to be fair, I am sure it did.
I have built a kernel-image-6.19.6 package with commit
aed157301c659a48f5564cc4568cf0e5c8831af0 reverted. It boots
successfully. Boot log attached for that.
Not sure what my next steps are here. This server is likely going
away tomorrow. And the recipient is planning to run an inferior,
proprietary OS on it. I will get console output from the bisection
process saved and some notes prepared in the hopes that the IRQ domain
maintainer can figure find a more architecture neutral solution.
Thanks!
- Aaron
#regzbot introduced: a60b990798eb17433d0283788280422b1bd94b18 #regzbot from: "Aaron D. Johnson" <debbugreporter@fnord.greeley.co.us> #regzbot monitor: https://bugs.debian.org/1127635 Hello, and was backported to 6.12.y and 6.6.y (aed157301c65 and b1f7476e07b9 respectively). A Debian user (Aaron, on Cc:) on powerpc has boot problems and bisected them to this commit. The relevant boot log of the failure is: [ 2.643879] BUG: Kernel NULL pointer dereference on read at 0x00000000 [ 2.643891] Faulting instruction address: 0xc000000000a39514 [ 2.643902] Oops: Kernel access of bad area, sig: 11 [#1] [ 2.643909] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 2.643920] Modules linked in: ohci_pci(+) ehci_hcd nvme_fabrics ohci_hcd nvme_keyring nvme_core usbcore nvme_auth scsi_transport_fc ipr configfs ehea(+) usb_common [ 2.643965] CPU: 5 UID: 0 PID: 250 Comm: (udev-worker) Not tainted 6.12.17-powerpc64 #1 Debian 6.12.17-1 [ 2.643976] Hardware name: IBM,8204-E8A POWER6 (architected) 0x3e0302 0xf000002 of:IBM,EL350_118 hv:phyp pSeries [ 2.643986] NIP: c000000000a39514 LR: c000000000a36ed8 CTR: c000000000a35820 [ 2.643995] REGS: c0000000351f6f60 TRAP: 0300 Not tainted (6.12.17-powerpc64 Debian 6.12.17-1) [ 2.644004] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 24222288 XER: 00000000 [ 2.644031] CFAR: c00000000000cfc4 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0 [ 2.644031] GPR00: c000000000a36ed8 c0000000351f7200 c00000000182e200 c0000003df294000 [ 2.644031] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2.644031] GPR08: 0000000000000001 0000000000000000 c00000000228fcc0 0000000044222288 [ 2.644031] GPR12: c000000000a35820 c00000000eeacb00 0000000000000020 0000010037fcab20 [ 2.644031] GPR16: 0000000022222248 0000000000020000 0000000000000000 00003fffebe8bb80 [ 2.644031] GPR20: 0000000000000000 c00000000204db60 c00000000204dd60 c00000000b1ae780 [ 2.644031] GPR24: 0000000000000000 00003fff8c9ac758 0000000000000000 c0000003df294000 [ 2.644031] GPR28: 0000000000000001 0000000000000000 c0000003df294000 0000000000000001 [ 2.644164] NIP [c000000000a39514] pci_msi_domain_supports (drivers/pci/msi/irqdomain.c:366) [ 2.644181] LR [c000000000a36ed8] __pci_enable_msi_range (drivers/pci/msi/msi.c:437) [ 2.644192] Call Trace: [ 2.644197] [c0000000351f7200] [c0000000351f7304] 0xc0000000351f7304 (unreliable) [ 2.644211] [c0000000351f7340] [c000000000a3578c] pci_alloc_irq_vectors_affinity (drivers/pci/msi/api.c:277) [ 2.644225] [c0000000351f73d0] [c0003d0007d2f4d4] usb_hcd_pci_probe (drivers/usb/core/hcd-pci.c:192) usbcore [ 2.644246] [c0000000351f7470] [c0003d00084e6030] ohci_pci_probe (drivers/usb/host/ohci-pci.c:285) ohci_pci [ 2.644260] [c0000000351f7490] [c000000000a260e8] local_pci_probe (drivers/pci/pci-driver.c:324) [ 2.644274] [c0000000351f7510] [c000000000a26218] pci_call_probe (drivers/pci/pci-driver.c:392 (discriminator 1)) [ 2.644287] [c0000000351f7670] [c000000000a27348] pci_device_probe (drivers/pci/pci-driver.c:452) [ 2.644300] [c0000000351f76b0] [c000000000b2e658] really_probe (drivers/base/dd.c:579 drivers/base/dd.c:658) [ 2.644314] [c0000000351f7740] [c000000000b2eb24] __driver_probe_device (drivers/base/dd.c:800) [ 2.644327] [c0000000351f77c0] [c000000000b2edc4] driver_probe_device (drivers/base/dd.c:831) [ 2.644340] [c0000000351f7800] [c000000000b2f188] __driver_attach (drivers/base/dd.c:1217) [ 2.644352] [c0000000351f7880] [c000000000b2ac64] bus_for_each_dev (drivers/base/bus.c:370) [ 2.644365] [c0000000351f78e0] [c000000000b2dac4] driver_attach (drivers/base/dd.c:1234) [ 2.644377] [c0000000351f7900] [c000000000b2cd98] bus_add_driver (drivers/base/bus.c:675) [ 2.644389] [c0000000351f7990] [c000000000b30ae4] driver_register (drivers/base/driver.c:246) [ 2.644402] [c0000000351f7a00] [c000000000a24f88] __pci_register_driver (drivers/pci/pci-driver.c:1450) [ 2.644415] [c0000000351f7a20] [c0003d00084e6800] ohci_pci_init (drivers/usb/host/ohci-pci.c:308) ohci_pci [ 2.644429] [c0000000351f7a50] [c00000000000fd60] do_one_initcall (init/main.c:1269) [ 2.644444] [c0000000351f7b30] [c0000000002760f8] do_init_module (kernel/module/main.c:2543) [ 2.644460] [c0000000351f7bb0] [c000000000278fe4] init_module_from_file (kernel/module/main.c:3199) [ 2.644473] [c0000000351f7c90] [c0000000002793e0] sys_finit_module (kernel/module/main.c:3211 kernel/module/main.c:3238 kernel/module/main.c:3221) [ 2.644487] [c0000000351f7da0] [c00000000002c084] system_call_exception (arch/powerpc/kernel/syscall.c:171) [ 2.644500] [c0000000351f7e50] [c00000000000cb54] system_call_common (arch/powerpc/kernel/interrupt_64.S:292) [ 2.644515] --- interrupt: c00 at 0x3fff8d653d8c [ 2.644522] NIP: 00003fff8d653d8c LR: 00003fff8c9a4680 CTR: 0000000000000000 [ 2.644531] REGS: c0000000351f7e80 TRAP: 0c00 Not tainted (6.12.17-powerpc64 Debian 6.12.17-1) [ 2.644541] MSR: 800000000200f032 <SF,VEC,EE,PR,FP,ME,IR,DR,RI> CR: 22222222 XER: 00000000 [ 2.644573] IRQMASK: 0 [ 2.644573] GPR00: 0000000000000161 00003fffebe8b640 00003fff8d757100 0000000000000052 [ 2.644573] GPR04: 00003fff8c9ac758 0000000000000004 0000000000000058 000000000000005a [ 2.644573] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2.644573] GPR12: 0000000000000000 00003fff8de947c0 0000000000000020 0000010037fcab20 [ 2.644573] GPR16: 0000000022222248 0000000000020000 0000000000000000 00003fffebe8bb80 [ 2.644573] GPR20: 0000000000000000 00003fffebe8bb70 0000000000000007 0000010037fca210 [ 2.644573] GPR24: 0000000000000000 0000000000000000 0000010037f6be40 0000000000000004 [ 2.644573] GPR28: 00003fff8c9ac758 0000000000020000 0000000000000004 0000010037fca210 [ 2.644698] NIP [00003fff8d653d8c] 0x3fff8d653d8c [ 2.644705] LR [00003fff8c9a4680] 0x3fff8c9a4680 [ 2.644713] --- interrupt: c00 [ 2.644719] Code: 4182002c e92a0088 80690000 7c632038 7c632278 7c630034 5463d97e 786307e0 4e800020 60000000 60000000 e92a0020 <80690000> 4bffffd8 60000000 7ca50034 All code ======== 0:* 41 82 00 2c beq 0x2c <-- trapping instruction 4: e9 2a 00 88 ld r9,136(r10) 8: 80 69 00 00 lwz r3,0(r9) c: 7c 63 20 38 and r3,r3,r4 10: 7c 63 22 78 xor r3,r3,r4 14: 7c 63 00 34 cntlzw r3,r3 18: 54 63 d9 7e srwi r3,r3,5 1c: 78 63 07 e0 clrldi r3,r3,63 20: 4e 80 00 20 blr 24: 60 00 00 00 nop 28: 60 00 00 00 nop 2c: e9 2a 00 20 ld r9,32(r10) 30: 80 69 00 00 lwz r3,0(r9) 34: 4b ff ff d8 b 0xc 38: 60 00 00 00 nop 3c: 7c a5 00 34 cntlzw r5,r5 Code starting with the faulting instruction =========================================== 0: 80 69 00 00 lwz r3,0(r9) 4: 4b ff ff d8 b 0xffffffffffffffdc 8: 60 00 00 00 nop c: 7c a5 00 34 cntlzw r5,r5 [ 2.644769] ---[ end trace 0000000000000000 ]--- (That's the bug splat from the bug report piped through scripts/decode_stacktrace.sh) The kernel has CONFIG_PCI_MSI_ARCH_FALLBACKS=y, so the first hunk shouldn't change anything. The disassembly of pci_msi_domain_supports in the kernel looks as follows: c000000000a394c0 <pci_msi_domain_supports>: pci_msi_domain_supports(): debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:334 c000000000a394c0: 60 00 00 00 nop c000000000a394c4: 60 00 00 00 nop debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:353 c000000000a394c8: e9 43 02 e8 ld r10,744(r3) c000000000a394cc: 2c 2a 00 00 cmpdi r10,0 c000000000a394d0: 41 82 00 50 beq c000000000a39520 <pci_msi_domain_supports+0x60> irq_domain_is_hierarchy(): debian/build/build_powerpc_none_powerpc64/include/linux/irqdomain.h:661 c000000000a394d4: 81 2a 00 28 lwz r9,40(r10) pci_msi_domain_supports(): debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:353 (discriminator 1) c000000000a394d8: 71 28 00 01 andi. r8,r9,1 c000000000a394dc: 41 82 00 44 beq c000000000a39520 <pci_msi_domain_supports+0x60> debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:359 (discriminator 1) c000000000a394e0: 71 29 01 00 andi. r9,r9,256 c000000000a394e4: 41 82 00 2c beq c000000000a39510 <pci_msi_domain_supports+0x50> debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:375 c000000000a394e8: e9 2a 00 88 ld r9,136(r10) c000000000a394ec: 80 69 00 00 lwz r3,0(r9) debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:378 c000000000a394f0: 7c 63 20 38 and r3,r3,r4 c000000000a394f4: 7c 63 22 78 xor r3,r3,r4 c000000000a394f8: 7c 63 00 34 cntlzw r3,r3 c000000000a394fc: 54 63 d9 7e srwi r3,r3,5 debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:379 c000000000a39500: 78 63 07 e0 clrldi r3,r3,63 c000000000a39504: 4e 80 00 20 blr c000000000a39508: 60 00 00 00 nop c000000000a3950c: 60 00 00 00 nop debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:366 c000000000a39510: e9 2a 00 20 ld r9,32(r10) c000000000a39514: 80 69 00 00 lwz r3,0(r9) c000000000a39518: 4b ff ff d8 b c000000000a394f0 <pci_msi_domain_supports+0x30> c000000000a3951c: 60 00 00 00 nop debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:355 c000000000a39520: 7c a5 00 34 cntlzw r5,r5 c000000000a39524: 54 a3 d9 7e srwi r3,r5,5 debian/build/build_powerpc_none_powerpc64/drivers/pci/msi/irqdomain.c:379 c000000000a39528: 78 63 07 e0 clrldi r3,r3,63 c000000000a3952c: 4e 80 00 20 blr so the trapping happens in drivers/pci/msi/irqdomain.c:366 which is: 365 info = domain->host_data; 366 supported = info->flags; According to the register dump domain == r10 == NULL, but then this code would not have been reached and the faulting instruction would be at c000000000a39510. So maybe it's only .host_data = NULL and the register dump is unreliable?? The offsets match: .host_data is at offset 32 of struct irq_domain and .flags is at offset 0 of struct msi_domain_info. For more details see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1127635 . Does someone spot the issue? Best regards Uwe
("PCI/MSI: Handle lack of irqdomain gracefully") [v6.13-rc5]
I missing something? If so: is that okay for everybody, or should we do
anything about this?
BTW, did anyone check if this happens with mainline (6.13/7.0) as well
to rule out that this is something that only happenens in the stable
series it was backported too? If it's the latter I wonder if reverting
it there might be a easy way to resolve this.
Ciao, Thorsten
Thorsten Leemhuis writes: I have not seen any discussion. As the initial reporter, I can accept that it is a very niche hardware platform and may well go unfixed forever. The machine in question has left my possession. It may come back in the future. Or maybe not. The next-generation newer ppc64 machine I have (an IBM 8205-E6C p740) does not exhibit this behavior. I didn't check mainline specifically, no. All bisection was done against the stable git repo. Thanks! - Aaron
Yes. That fell through the cracks. Let me have a look.
So R9 is NULL, R10 is the domain pointer.
Correct. But the Ooops is in the unchanged code part of
pci_msi_domain_supports().
user space register set from the syscall entry.
R10 contains the domain pointer and R9 is NULL, which does not make any
sense.
On 6.12 power64 still uses the global PCI/MSI domain model. According to
the splat this is pseries so the global PCI/MSI domain is created in
__pseries_msi_allocate_domains() via pci_msi_create_irq_domain(). The
latter takes a pointer to
static struct msi_domain_info_pseries_msi_domain_info;
which is assigned to the global PCI/MSI domain::host_data.
Upstream got rid of that and uses per device domains, so it might have
been magically fixed by now, but I doubt it:
That new check in __pci_enable_msi_range() is benign as the actual
allocation code further down relies on domain::host_data being a valid
pointer as well.
It might not reach that point due to the subsequent checks, but if the
PCI device has pdev::dev::msi::domain populated, then this has to be
either a global PCI/MSI domain or a MSI parent domain. Both have
domain::host_data populated with a msi_domain_info pointer.
Something is mighty fishy here.
Aaron, can you please apply the patch below and see whether it fixes the
issue and provide the dmesg with the output of those pr_warn()'s?
The other information which would be useful: When you boot a kernel with
the commit reverted and look at that OHCI controller with lspci -vvv
then you should see whether it has MSI enabled or not. If it has MSI
enabled, then please provide the output of
/sys/kernel/debug/irq/irqs/$IRQNR
You need to enable CONFIG_GENERIC_IRQ_DEBUGFS for that.
And that's actually useful for the debug patch below too because you can
then look at the domain name output and gather more information from
/sys/kernel/debug/irq/domains/$NAME
Thanks,
tglx
---
drivers/pci/msi/irqdomain.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -115,6 +115,8 @@ struct irq_domain *pci_msi_create_irq_do
struct msi_domain_info *info,
struct irq_domain *parent)
{
+ struct irq_domain *domain;
+
if (WARN_ON(info->flags & MSI_FLAG_LEVEL_CAPABLE))
info->flags &= ~MSI_FLAG_LEVEL_CAPABLE;
@@ -135,7 +137,12 @@ struct irq_domain *pci_msi_create_irq_do
/* Let the core update the bus token */
info->bus_token = DOMAIN_BUS_PCI_MSI;
- return msi_create_irq_domain(fwnode, info, parent);
+ domain = msi_create_irq_domain(fwnode, info, parent);
+ if (domain) {
+ pr_warn("Created global PCI/MSI domain %lx %s flags: %x\n",
+ (unsigned long)domain, domain->name, domain->flags);
+ }
+ return domain;
}
EXPORT_SYMBOL_GPL(pci_msi_create_irq_domain);
@@ -356,6 +363,12 @@ bool pci_msi_domain_supports(struct pci_
return false;
}
+ if (!domain->host_data) {
+ pr_warn("Device MSI domain %lx %s %x lacks host data\n",
+ (unsigned long)domain, domain->name, domain->flags);
+ return false;
+ }
+
if (!irq_domain_is_msi_parent(domain)) {
/*
* For "global" PCI/MSI interrupt domains the associated
Thomas Gleixner writes: I can build it. But as stated earlier, the machine is no longer in my possession. If it comes back, (and it might -- its new owner is having problems supplying sufficient input power in his apartment), I will be happy to test patches. Noted. And thanks for the time looking into it. - Aaron