- Package:
- debian-installer
- Source:
- debian-installer
- Description:
- Debian Installer documentation
- Submitter:
- Phillip Susi
- Date:
- 2021-09-21 15:15:06 UTC
- Severity:
- normal
Every bullseye netinst image I have tried to boot in a xen domU has crashed and rebooted the domU after choosing any entry from the boot menu. I thought there might have been something wrong with my xen server, which was running Ubuntu 18.04, but I rebuilt it the other day using bullseye as the dom0, and I still can't install bullseye in a domU.
reassign 983357 linux severity grave thanks I rebuilt the iso using the version of isolinux from stable and it still crashed the domU. When I rebuilt it using the vmlinux and initrd.gz from the stable iso, it successfully boots, so it appears to be caused by the kernel. Interestingly, there appears to be a different kernel build just for use under xen in install.amd/xen and using that one also works. Maybe we need a menu option in isolinux to load that kernel instead?
The netinst image I had contained the -3 kernel. I rebuit the image with the current -6 kernel and it worked. I downloaded the latest weekly netinst iso and it already contains the -6 kernel, so it appears that this has been fixed in -4, -5, or -6.
reopen 983357 thanks I don't know what happened, but it is back to not working, even with the install.amd/xen/ kernel. Phillip Susi writes:
affects 983357 + debian-installer severify 983357 serious thanks It appears that the root cause of this bug has been reported upstream here: https://bugzilla.kernel.org/show_bug.cgi?id=207695 It seems that there is an error trying to udev trigger the Xen virtual keyboard, and this causes start-udev to bail out, which causes init to bail out and the kernel to panic. Removing the set -e from start-udev appears to be a viable workaround that d-i might want to consider.
I bisected the problem to this commit:
commit df44b479654f62b478c18ee4d8bc4e9f897a9844
Author: Peter Rajnoha <prajnoha@redhat.com>
Date: Wed Dec 5 12:27:44 2018 +0100
kobject: return error code if writing /sys/.../uevent fails
Propagate error code back to userspace if writing the /sys/.../uevent
file fails. Before, the write operation always returned with success,
even if we failed to recognize the input string or if we failed to
generate the uevent itself.
With the error codes properly propagated back to userspace, we are
able to react in userspace accordingly by not assuming and awaiting
a uevent that is not delivered.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
So it appears that the Xen Virtual Keyboard driver was always broken but
the error in triggering the uevent was not previously reported. The
upstream bug report notes another driver failing. There are probably
other drivers too.
I will continue to try to find and fix the Xen keyboard error so it no
longer fails anyway, but it is probably a good idea to patch the
start-udev script in d-i to ignore errors. It is better to continue
with some device not triggering a cold plug event than to instantly panic
the kernel in early boot.
I dug down to the the -ENOMEM coming from the fact that the modalias is over 2KB of crap so it won't fit in the environment block when the input core tries to add it for the uevent. I don't see how it gets this way though because the MODULE_ALIAS() statement in the code just says it should be "xen: vkbd". When I read the modalias in sysfs, it says: input:b0001v5853pFFFFe0000-e0,1,k71,72,73,74,75,76,77,78,79,7A,7B,7C,7D,\ 7E,7F,80,81,82,83,84,85,86,87,88,89,8A,8B,8C,8D,8E,8F,90,91,92,93,94,95,\ 96,97,98,99,9A,9B,9C,9D,9E,9F,A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,AA,AB,AC,AD,\ AE,AF,B0,B1,B2,B3,B4,B5,B6,B7,B8,B9,BA,BB,BC,BD,BE,BF,C0,C1,C2,C3,C4,C5,\ C6,C7,C8,C9,CA,CB,CC,CD,CE,CF,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,DA,DB,DC,DD,\ DE,DF,E0,E1,E2,E3,E4,E5,E6,E7,E8,E9,EA,EB,EC,ED,EE,EF,160,161,162,163,16\ 4,165,166,167,168,169,16A,16B,16C,16D,16E,16F,170,171,172,173,174,175,17\ 6,177,178,179,17A,17B,17C,17D,17E,17F,180,181,182,183,184,185,186,187,18\ 8,189,18A,18B,18C,18D,18E,18F,190,191,192,193,194,195,196,197,198,199,19\ A,19B,19C,19D,19E,19F,1A0,1A1,1A2,1A3,1A4,1A5,1A6,1A7,1A8,1A9,1AA,1AB,1A\ C,1AD,1AE,1AF,1B0,1B1,1B2,1B3,1B4,1B5,1B6,1B7,1B8,1B9,1BA,1BB,1BC,1BD,1B\ E,1BF,1C0,1C1,1C2,1C3,1C4,1C5,1C6,1C7,1C8,1C9,1CA,1CB,1CC,1CD,1CE,1CF,1D\ 0,1D1,1D2,1D3,1D4,1D5,1D6,1D7,1D8,1D9,1DA,1DB,1DC,1DD,1DE,1DF,1E0,1E1,1E\ 2,1E3,1E4,1E5,1E6,1E7,1E8,1E9,1EA,1EB,1EC,1ED,1EE,1EF,1F0,1F1,1F2,1F3,1F\ 4,1F5,1F6,1F7,1F8,1F9,1FA,1FB,1FC,1FD,1FE,1FF,200,201,202,203,204,205,20\ 6,207,208,209,20A,20B,20C,20D,20E,20F,210,211,212,213,214,215,216,217,21\ 8,219,21A,21B,21C,21D,21E,21F,220,221,222,223,224,225,226,227,228,229,22\ A,22B,22C,22D,22E,22F,230,231,232,233,234,235,236,237,238,239,23A,23B,23\ C,23D,23E,23F,240,241,242,243,244,245,246,247,248,249,24A,24B,24C,24D,24\ E,24F,250,251,252,253,254,255,256,257,258,259,25A,25B,25C,25D,25E,25F,26\ 0,261,262,263,264,265,266,267,268,269,26A,26B,26C,26D,26E,26F,270,271,27\ 2,273,274,275,276,277,278,279,27A,27B,27C,27D,27E,27F,280,281,282,283,28\ 4,285,286,287,288,289,28A,28B,28C,28D,28E,28F,290,291,292,293,294,295,29\ 6,297,298,299,29A,29B,29C,29D,29E,29F,2A0,2A1,2A2,2A3,2A4,2A5,2A6,2A7,2A\ 8,2A9,2AA,2AB,2AC,2AD,2AE,2AF,2B0,2B1,2B2,2B3,2B4,2B5,2B6,2B7,2B8,2B9,2B\ A,2BB,2BC,2BD,2BE,2BF,2C0,2C1,2C2,2C3,2C4,2C5,2C6,2C7,2C8,2C9,2CA,2CB,2C\ C,2CD,2CE,2CF,2D0,2D1,2D2,2D3,2D4,2D5,2D6,2D7,2D8,2D9,2DA,2DB,2DC,2DD,2D\ E,2DF,2E0,2E1,2E2,2E3,2E4,2E5,2E6,2E7,2E8,2E9,2EA,2EB,2EC,2ED,2EE,2EF,2F\ 0,2F1,2F2,2F3,2F4,2F5,2F6,2F7,2F8,2F9,2FA,2FB,2FC,2FD,2FE,ramlsfw
The discussion upstream does not seem to be converging on a proper fix in the kernel, so I'm going to clone this bug and suggest that debian-installer patch the start-udev script to ignore the failure of the udevadm trigger command. To summarize: init ends up calling start-udev which calls udevadm trigger to cold plug all devices. Both scripts are set -e. The Xen Virtual Keyboard driver and at least one other driver have always failed to trigger due to having absurdly long modalias, but the error used to be ignored. The kernel now returns the error to udevadm, so it exits with an error, so start-udev exits with an error, so init exits with an error, causing the kernel to panic.
Hi Phillip,
And thanks for debugging this… I must confess I've never touched
anything Xen related and I'd like to keep it that way in the near
future. ;-)
Phillip Susi <phill@thesusis.net> (2021-05-19):
Well, it's a little more complicated:
- start-udev is actually a script shipped by the udev udeb, i.e. the
responsibility of systemd maintainers;
- based on a quick grep, the installer contains two calls to that
start-udev script, in the rootskel source package:
+ rootskel/src/init (shipped as /init in the udeb)
+ rootskel/src/sbin/init-linux (shipped as /sbin/init in the udeb)
I'd be happy to have a comment from systemd maintainers before thinking
about patching rootskel. :)
Cheers,
Hi Phillip Am 24.05.2021 um 06:19 schrieb Cyril Brulebois: So this is a change in behaviour in the kernel? What happens if you boot the installed system? Does udevadm trigger fail there as well? I feel a bit uneasy changing the udev start script this late in the release cycle (especially when it appears like covering up an issue someplace else). I'll let Marco make the judgement on this though, as he has the most experience with those udev udeb start scripts as the original author. Michael
Michael Biebl writes:
silently failing:
commit df44b479654f62b478c18ee4d8bc4e9f897a9844
Author: Peter Rajnoha <prajnoha@redhat.com>
Date: Wed Dec 5 12:27:44 2018 +0100
kobject: return error code if writing /sys/.../uevent fails
Propagate error code back to userspace if writing the /sys/.../uevent
file fails. Before, the write operation always returned with success,
even if we failed to recognize the input string or if we failed to
generate the uevent itself.
With the error codes properly propagated back to userspace, we are
able to react in userspace accordingly by not assuming and awaiting
a uevent that is not delivered.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Yes, it does; that is how I was able to track down the problem.
So far I have been removing the -e from the shbang line in the
start-udev script and remastering the iso so I can get it to boot. It
would probably be a better idea to just add a || true to the udevadm
trigger call. I feel fairly certain that no matter what the cause of
the coldplug failure, the user is going to be better off ignoring it and
trying to proceed than a kernel panic.
Hello, This bug was noticed on the debian-user list recently and I have been testing various workarounds and instead of removing -e from the shbang line I came up with prepending the udevadm trigger call in the start-udev script with dmesg | grep DMI: | grep 'Xen HVM domU' || This causes the offending udevadm trigger call to never be invoked when running in a Xen HVM DomU. On all other systems, the call should be invoked like normal. With this hack, I was able to create a modified ISO and run the bullseye installer from it in a Xen HVM DomU and complete an install without the crash and reboot. I also can confirm that I always see the coldplug failure on the installed system in a Xen HVM DomU, but in that case the failure does not cause a crash and the system boots normally after reporting the failure. I also do not see the problem in a Xen PV DomU, which I think is what the /install.amd/xen folder on the installation media is for. Chuck Zmudzinski
After reviewing Philip's message at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=983357#43 which seems to point to the root cause of this bug, I can add: On my Xen HVM DomU I see the absurdly long modalias for the Xen Virtual keyboard that seems to be causing this crash in sysfs at /sys/devices/virtual/input/input2/modalias But at /sys/devices/vkbd-0/modalias, I see just 'xen:vkbd', which would probably not result in an error in the udev script if this was also written as the modalias at /sys/devices/virtual/input/input2/modalias So the Xen virtual keyboard appears more than once in sysfs, and modalias is not the same in the different places. This seems to be a problem. I understand the correct way to fix this bug is by modifying the Xxen virtual keyboard (and any other devices that might cause this crash) and not the start-udev script on the netinst installation media, which is so far the only available workaround. Hopefully Xen will accept a fix if we can come up with a fix. I am willing to try to debug this by testing patches to the Xen virtual keyboard, and anyone who has any tips on how udev works would be helpful. Is there documentation in udev for device developers somewhere to consult that explains how to update old device drivers so they are compatible with the modern version? Does the Xen virtual keyboard need to be managed by udev? Is there a simple way to disable incompatible devices so udev ignores them? Chuck Zmudzinski
They are two different devices, and they should have different modaliases. Linux has code for discovering devices on each kind of bus, including virtual buses, and that code creates "bus devices" such as vkbd-0. At this point the kernel doesn't know what the device is capable of. The modalias for a bus device carries some identifying information that can be used to select a driver module for it. The driver does know what the device is capable of, and how to use it. It will normally create one or more "class devices" that support a particular set of operations; in this case input device operations. Class devices typically don't have modaliases, since they don't need another layer of drivers on top. However, for input devices the modalias carries information about the device's capabilities. These may trigger loading of the evdev or joydev module. [...] I think a proper fix would be one of: a. If the Xen virtual keyboard driver is advertising capabilities it doesn't have, stop it doing that. b. Change the implementation of modalias attributes to allow longer values. It's not clear to me whether the Xen driver is advertising correctly or not. If it is, then the solution should be b, but that may be too disruptive a change to the kernel. So a reasonable workaround might be: c. Change the input subsystem to limit the length of the capabilities part of the modalias. Ben.
So workaround c would not involve disruptions to the kernel or systemd? Workaround c seems too disruptive for stable to me, but maybe could go into unstable and eventually into testing. A problem with the approach of fixing this bug in the Xen keyboard driver is that the fix must be implemented in the underlying Dom0 system, which could be almost anything - another Linux distro or Debian stable or oldstable. Any fix upstream would probably get into a bullseye Dom0, but not oldstable Dom0, but perhaps it could be provided as a backport for anyone who is still on oldstable for their Xen Dom0. Anyway, I will look into the Xen virtual keyboard capabilities. The only capability I can think of that would be useful in this context is that it supports live migration of a VM through some sort of hot-swapping capability. If it has that capability, a workaround to support it would be good. But if it does not have that capability or if such a capability is not needed for a keyboard, then it should probably stop advertising itself as being able or needing to do that. Ultimately, it is up to Xen to decide if they are going to make changes to its virtual keyboard. Chuck
Ben Hutchings <ben@decadent.org.uk> writes: and so it has no way of knowing what keys it actually has. It is a fake input device designed to pass through whatever input the Xen hypervisor sends down. As such, any key could come in. If it doesn't advertise that it has all of these keys, then they would not be accepted by libinput when the hypervisor sends them down. This seems to be the heart of the problem: libinput was designed assuming that all keyboards can and must report what keys are actually present, and then libinput tries to cram that information into the modalias rather than some other sysfs attribute as it should ( or not at all... I still don't see how this information is actually supposed to be useful to userspace ). As for b), the problem isn't with the modalias attribute itself, but when the kernel tries to copy it into the environment block for the udev callout. The environment block is only a single page, and so limited to 4 KB. And that's for everything else that goes into the environment, not just the modalias.
Right, that's what I feared. xen-kbdfront is setting the bits for keys in the ranges [KEY_ESC, KEY_UNKNOWN) and [KEY_OK, KEY_MAX), which I think works out to 654 keys and 2362 bytes in the modalias. I think modaliases aren't intended to be interpreted by user-space, other than processing wildcards when matching to modules. For input devices, the same information is available through other variables in the uevent, in a more compact form. The information *is* useful for user-space; e.g. in initramfs-tools we recognise keyboard devices and add their drivers to the initramfs but ignore other input devices. Text-based sysfs attributes are limited to a page, but udev receives uevents through netlink, not sysfs. The current limit on the environment of a uevent appears to be 2 KB (UEVENT_BUFFER_SIZE defined in <linux/kobject.h>). That seems like it *might* be easier to change, so long as user-space doesn't have a similar limit. I looked into systemd/udev, and it seems to use an 8 KB buffer for receiving uevents: https://sources.debian.org/src/systemd/247.9-1/src/libsystemd/sd-device/device-monitor.c/?hl=390#L390 But as a first step I think increasing the kernel buffer size to 4 KB would be enough. Perhaps someone could test whether this patch to the domU kernel makes udev happier:--- a/include/linux/kobject.h +++ b/include/linux/kobject.h @@ -30,7 +30,7 @@ #define UEVENT_HELPER_PATH_LEN 256 #define UEVENT_NUM_ENVP 64 /* number of env pointers */ -#define UEVENT_BUFFER_SIZE 2048 /* buffer for the variables */ +#define UEVENT_BUFFER_SIZE 4096 /* buffer for the variables */ #ifdef CONFIG_UEVENT_HELPER /* path to the userspace helper executed on an event */ --- END --- ? Ben.
[...] I don't think it would be very disruptive. It might require a kernel ABI bump, but we do those regularly during a stable release. And this bug is severe enough that I think a fix would be suitable for Debian stable. [...] I agree that we need to fix this for domU independently of any protocol change to allow discovery of which keys the underlying input device has. So we can't solve this with approach a. Ben.
I will try it in my bullseye Xen HVM DomU. I am not sure how to rebuild the installation media with a patched systemd, but I can patch my installed Xen HVM DomU system with a patched systemd with the increased buffer size and see if the Coldplug failure early in the boot process goes away. If so, then it is likely this patch to systemd would also fix the installation media. If it doesn't work, I am also willing to try approach a by patching the Linux kernel xen-kbdfront driver by removing the for loops that advertise those 654 keys. I tend to agree with Philip that this is totally unnecessary, but I suppose I could be wrong about that. I read the discussion Philip had with the Xen developers and they seemed to want to keep the Xen keyboard driver as it is. Chuck
On Wed, 2021-08-25 at 12:45 -0400, Chuck Zmudzinski wrote: [...] [...] Sorry for not being clear - this is a patch for the kernel. Instructions for rebuilding the kernel package are at <https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official>. I agree that you should check whether this fixes the coldplug error before we try rebuilding the installer. Ben.
The build failed with an error. I used the test-patches script to start the build: chuckz@debian:~/linuxdata/sources-bullseye/kernel/linux-5.10.46$ bash debian/bin/test-patches ../patch with Ben's patch to UEVENT_BUFFER_SIZE in ../patch. The build was running for over an hour and then failed with the last few lines on the console as: RT_SYMBOL zl10039_attach���������������������������������� module: drivers/media/dvb-frontends/zl10039, version: 0xc2effb6f -> 0x603a565b, export: EXPORT_SYMBOL zl10353_attach���������������������������������� module: drivers/media/dvb-frontends/zl10353, version: 0x1faf92c1 -> 0x0baa0cfe, export: EXPORT_SYMBOL zpa2326_isreg_precious�������������������������� ignored, module: drivers/iio/pressure/zpa2326, version: 0xc887d5f5 -> 0xed2234b3, export: EXPORT_SYMBOL_GPL zpa2326_isreg_readable�������������������������� ignored, module: drivers/iio/pressure/zpa2326, version: 0x55c1d540 -> 0x70643406, export: EXPORT_SYMBOL_GPL zpa2326_isreg_writeable������������������������� ignored, module: drivers/iio/pressure/zpa2326, version: 0x0d49987b -> 0x28ec793d, export: EXPORT_SYMBOL_GPL zpa2326_pm_ops���������������������������������� ignored, module: drivers/iio/pressure/zpa2326, version: 0xf9a2894f -> 0x709ae67b, export: EXPORT_SYMBOL_GPL zpa2326_probe����������������������������������� ignored, module: drivers/iio/pressure/zpa2326, version: 0x76b08b58 -> 0xeb45a43b, export: EXPORT_SYMBOL_GPL zpa2326_remove���������������������������������� ignored, module: drivers/iio/pressure/zpa2326, version: 0xdb120e61 -> 0x1121e8d3, export: EXPORT_SYMBOL_GPL zpool_register_driver��������������������������� module: vmlinux, version: 0x2caae392 -> 0x4e86309a, export: EXPORT_SYMBOL zpool_unregister_driver������������������������� module: vmlinux, version: 0x29f4da85 -> 0x4bd8098d, export: EXPORT_SYMBOL make[1]: *** [debian/rules.real:214: debian/stamps/build_amd64_none_amd64] Error 1 make[1]: Leaving directory '/home/chuckz/linuxdata/sources-bullseye/kernel/linux-5.10.46' make: *** [debian/rules.gen:27: binary-arch_amd64_none_amd64_real] Error 2 I think I have all the prerequisites, and I could not find a build log to find a more specific error. I know debuild creates a buildlog in the .. folder when building packages, but the test-patches script didn't do that. Chuck
Chuck Zmudzinski <brchuckz@netscape.net> writes: That was the first thing I tried and the libinput maintainer pointed out that if you don't advertise the keys, you can't use the keys. In other words, somebody presses that key on their keyboard and the domU won't recognize it.
I tried this patch but the build failed - it ran for over an hour. I am not sure why as I have not built a Linux kernel in many years. So I will this: 1) Try to build the unmodified kernel on my system just to be sure I am building the kernel correctly and that my hardware is OK. Once I could not build the Linux kernel until I replaced a bad memory card. 2) If that succeeds, I will try the patch with a bump to the abi version. From the output of the failed build and what I read in the section on the Debian kernel ABI name, I think that the system detected an ABI change and so it failed. The build was checking symbols when it failed. This will take a little while because it takes over an hour to build the kernel on my system. Chuck
Well, good news - It looks like Ben's patch works, I just tested it in my full install in a Xen HVM domU and all looks good. I did not see the Coldplug failure at the beginning of the boot - it is hard to miss in the bright red letters on the console, and even more convincing is the fact that another symptom of the bug is gone. This bug manifests itself in udev not being able to write uevent data to sysfs for the Xen Virtual Keyboard. With Ben's patch of increasing the UEVENT_BUFFER_SIZE from 2048 to 4096, udev can write its uevent data to sysfs for the Xen Virtual Keyboard: With the current 5.10.0-8 kernel: chuckz@debian:~$ cat /sys/devices/virtual/input/input2/uevent chuckz@debian:~$ With the patched kernel with a change to the ABI version from 8 to 8.1: chuckz@debian:~$ uname -r 5.10.0-8.1-amd64 chuckz@debian:~$ cat /sys/devices/virtual/input/input2/uevent PRODUCT=1/5853/ffff/0 NAME="Xen Virtual Keyboard" PHYS="xenbus/device/vkbd/0" PROP=0 EV=3 KEY=7fffffffffffffff ffffffffffffffff ffffffffffffffff... MODALIAS=input:b0001v5853pFFFFe0000-e0,1,k71,72... really long MODALIAS I expect with that patch the installation media will work in a Xen HVM domU. Cheers, Chuck
I tested this patch on my Xen HVM bullseye system and it appears 4k is enough for the UEVENT_BUFFER_SIZE to accommodate the Xen Virtual Keyboard's large modalias. I needed to follow the instructions in the Kernel team's handbook for changing the ABI name of the kernel for the build to succeed with the patch. I just bumped it from 8 to 8.1. Results: 1. No coldplug failure reported at boot time. 2. With the patch the system can write uevent data to sysfs for the Xen Virtual Keyboard device. With the current 5.10.0-8 kernel: chuckz@debian:~$ cat /sys/devices/virtual/input/input2/uevent chuckz@debian:~$ With the patched kernel with a change to the ABI version from 8 to 8.1: chuckz@debian:~$ uname -r 5.10.0-8.1-amd64 chuckz@debian:~$ cat /sys/devices/virtual/input/input2/uevent PRODUCT=1/5853/ffff/0 NAME="Xen Virtual Keyboard" PHYS="xenbus/device/vkbd/0" PROP=0 EV=3 KEY=7fffffffffffffff ffffffffffffffff ffffffffffffffff... MODALIAS=input:b0001v5853pFFFFe0000-e0,1,k71,72... really long MODALIAS --------------------------------------------------------------------------- So I think a test of the installation media in a Xen HVM with the 4k buffer in the kernel is the next step. I would also like to test a live CD in a Xen HVM with this patch. It was also reported to fail to boot in a Xen HVM on the debian-user list. BTW, my complements to the Debian Kernel Team for the excellent handbook on building kernels for Debian. It is easy to understand and made it very easy for me to build and test the patch even though I have not built a Linux kernel in many years, and I never built a Debian kernel before. All the best, Chuck
Results of more tests with the patched kernel: 1. Boot on dom0 - works normally, can create VMs, run Liinux container, etc. 2. Boot in Xen PV - works normally 3. Boot on bare hardware - works normally I do not see any issues with the patched kernel on my system. Cheers, Chuck
Even though this patch has been tested to apparently fix this bug and the bug has been elevated to important and tagged patch and upstream, AFAICT there is no action yet upstream or anywhere else after more than three weeks. Is this patch dead as a possible fix for this bug? Best wishes, Chuck