#1098973 grub - fails to start zboot linux on risvc64: Unhandled exception: Store/AMO access fault

#1098973#5
Date:
2025-02-22 10:59:38 UTC
From:
To:
Hi,

Starting with version 6.13.3-1~exp1, the riscv64 kernel is shipped as a
EFI binary with the payload compressed with zstd (using the EFI_ZBOOT
config option). In addition to breaking non-EFI systems, this change
simply prevents the kernel to boot on a VisionFive 2 board:

| Loading Linux 6.13-riscv64 ...
| Loading initial ramdisk ...
| EFI stub: Decompressing Linux Kernel...
| Unhandled exception: Store/AMO access fault
| EPC: 00000000fb64a6ea RA: 00000000fb64a6da TVAL: 0000000040020020
| EPC: 000000003b9046ea RA: 000000003b9046da reloc adjusted
|
| Code: 0506 9526 4783 0015 4703 0005 3583 ed84 (0e23 fef9)
| UEFI image [0x00000000fe6aa000:0x00000000fe6d0fff] '/efi\boot\bootriscv64.efi'
| UEFI image [0x00000000fb646000:0x00000000fbe933ff] pc=0x46ea
|
|
| resetting ...
| reset not supported yet
| ### ERROR ### Please RESET the board ###

Regards
Aurelien

#1098973#10
Date:
2025-02-23 20:45:28 UTC
From:
To:
Please re-assign to the bootloader package.

Bastian

#1098973#15
Date:
2025-02-23 21:07:33 UTC
From:
To:
I disagree. The bootloader is u-boot and while it might be fixable at
this level, debian should be bootable on the original firmware.

BTW, you never explained the reason for your changes. It only brings
smaller kernel nothing more. And a working kernel is better than a
smaller kernel that does not work.

#1098973#20
Date:
2025-02-23 21:24:27 UTC
From:
To:
It needs to be fixed nevertheless.  What do you mean with "original
firmware"?  What is this setup anyway?

Smaller images, so often faster load times.  Feature parity between
architectures.  Fullfils the interface (U)EFI and works fine in edk2.

As I currently try to assemble a list of all the interfaces the kernel
fullfils:  How would you define this?  Running this in u-boot is not
(U)EFI, but something more strict, or there is a bug in the kernel
decompressor.

Bastian

#1098973#25
Date:
2025-02-23 21:41:44 UTC
From:
To:
- Vision Five 2 board: https://www.starfivetech.com/en/site/boards
- Using U-Boot as the firmware
- Booting is done through grub (grub-efi-riscv64 package)
- Installed with debian-installer

Smaller image is nice, but not mandatory. Other architectures also use
uncompressed kernel.

Feature parity, do you mean only with arm64 and loong64? EFI_ZBOOT is
not enabled on other architectures.

Just like the kernel before your change.

The uncompressed kernel is a perfectly valid EFI binary that can be run
under U-Boot with either Distro Boot and Grub or with the loadefi
command. It can also be run under EDK2 either directly or also through
Grub.

#1098973#30
Date:
2025-02-24 18:18:17 UTC
From:
To:
Hi,

Let me summarize the situation for external reviewers.

The kernel for riscv64 used to rely on CONFIG_EFI_STUB=y, enabling the
kernel to be used either as an EFI executable or as conventional ELF
file. Unlike x86, this requires the kernel to be uncompressed, which is
why it was shipped as vmlinux. Note that this is not the only
architecture where the kernel is uncompressed, this is also the case for
ppc64el and a many ports architectures.

Commit 16b5ae589a679 ("[arm64, riscv64] Enable EFI_ZBOOT") [1] changed
three things for riscv64:
1) Changed the kernel file that ends up in the package from the
   uncompressed one (arch/riscv/boot/Image) to the compressed one
   (arch/riscv/boot/vmlinuz.efi)
2) Enabled EFI_ZBOOT to compress the kernel payload and include a
   decompressor in the EFI binary
3) Changed the kernel compression from GZIP to ZSTD

Note that technically changes 2 and 3 have basically no effect on the
resulting package without change 1. Please also note that change 1 was
done without renaming the kernel from vmlinux to vmlinuz to match the
(probably non-written) standard so to ship compressed kernels as vmlinuz
and uncompressed ones as vmlinux. OTOH such a change would have
probably broken many things.

This change was made without checking with the porters and without any
justification. I quickly noticed the commit and was worried about change
1, as it basically enforces UEFI booting. Although Debian Installer
defaults to a UEFI installation with the standard ISO media or UKI
image, it is technically possible to use a system booting directly from
U-Boot, which some users prefer (this is particularly useful for
switching between non-UEFI vendor kernels and debian kernels). In
addition a non-UEFI kernel is important for KVM, as it currently doesn't
support running in S-mode, therefore requiring a non-UEFI kernel to be
loaded directly without any firmware.

As a porter I requested on IRC for the riscv64 part of the code to be
reverted. I was told this is not possible, as Debian Installer does not
support non-UEFI, that this change will target forky only, and that I
can simply use a script to extract the payload from the UEFI kernel.

The situation worsened when I realized that the changes do not even work
on a real riscv64 board installed using the standard Debian installer:

| Loading Linux 6.13-riscv64 ...
| Loading initial ramdisk ...
| EFI stub: Decompressing Linux Kernel...
| Unhandled exception: Store/AMO access fault
| EPC: 00000000fb64a6ea RA: 00000000fb64a6da TVAL: 0000000040020020
| EPC: 000000003b9046ea RA: 000000003b9046da reloc adjusted
|
| Code: 0506 9526 4783 0015 4703 0005 3583 ed84 (0e23 fef9)
| UEFI image [0x00000000fe6aa000:0x00000000fe6d0fff] '/efi\boot\bootriscv64.efi'
| UEFI image [0x00000000fb646000:0x00000000fbe933ff] pc=0x46ea
|
|
| resetting ...
| reset not supported yet
| ### ERROR ### Please RESET the board ###

Sure this has been tested as mentioned in the MR [2], but it appears
that booting a kernel with QEMU + EDK2 is not comparable to booting a
kernel with a real board + U-Boot + Grub. I agree that there is an issue
in the firmware / bootloader / kernel stack (my current wild guess is
that it's a Grub issue), but still that change currently results in a
non-working kernel.

At this stage I have not seen a strong arguments for the original
commit. The reason that have been given a posteriori are:
- Smaller images, so often faster load times.
- Feature parity between architectures.
- Fullfils the interface (U)EFI and works fine in edk2.

I don't believe the above reasons are  enough to enforce UEFI only
kernel and break the boot on existing boards. In addition the "forky
only" argument doesn't stand as many newer riscv64 devices are expected
during the lifetime of trixie and will require a kernel from
trixie-backports. That is why I submitted a MR [3] to revert the riscv64
specific part of the commit.

Regards
Aurelien


[1] https://salsa.debian.org/kernel-team/linux/-/commit/16b5ae589a679acbc9e43de9cb691f42fe058068
[2] https://salsa.debian.org/kernel-team/linux/-/merge_requests/1362
[3] https://salsa.debian.org/kernel-team/linux/-/merge_requests/1384

#1098973#35
Date:
2025-02-24 19:23:33 UTC
From:
To:
Linux both with zboot and without zboot are valid EFI binary.  But zboot
seems to uncover a bug in u-boot.

So, now we have the options:

- We target EFI, the decompressor is correct, then u-boot is broken.
- We target EFI, the decompressor is invalue, then the kernel is broken.
- We target u-boot restricted EFI, then we have to revert that for all
  three architectures.

What we still can do is workaround this bug.  But this is a defined
state and requires both sides.

Bastian

#1098973#40
Date:
2025-02-24 21:32:37 UTC
From:
To:
I digged a bit.  Yes, this is the file from
linux-image-6.13-riscv64_6.13.3-1~exp1_riscv64.deb.  It contains the
mentioned instructions:

|     46da:       0506                    slli    a0,a0,0x1
|     46dc:       9526                    add     a0,a0,s1
|     46de:       00154783                lbu     a5,1(a0)
|     46e2:       00054703                lbu     a4,0(a0)
|     46e6:       ed843583                ld      a1,-296(s0)
|     46ea:       fef90e23                sb      a5,-4(s2)

I did not manage to get the crash you mentioned.  The u-boot out of
u-boot-qemu_2024.01+dfsg-7_all.deb can start both the uncompressed EFI
file and the zboot compressed one. Sadly it fails unrelated shortly
after that in both cases.

Using the uncompressed file:

| % qemu-system-riscv64 -m 1024 -nographic -machine virt -device virtio-rng-pci -bios ../qemu-riscv64/u-boot.bin -device loader,file=../../../../boot/plain,addr=0x84000000
| U-Boot 2024.01+dfsg-7 (Jan 09 2025 - 19:14:04 +0000)
| CPU:   rv64imafdch_zic64b_zicbom_zicbop_zicboz_ziccamoa_ziccif_zicclsm_ziccrse_zicntr_zicsr_zifencei_zihintntl_zihintpause_zihpm_zmmul_za64rs_zaamo_zalrsc_zawrs_zfa_zca_zcd_zba_zbb_zbc_zbs_ssccptr_sscounterenw_sstc_sstvala_sstvecd_svadu_svvptc
| Model: riscv-virtio,qemu
| DRAM:  1 GiB
| Core:  25 devices, 12 uclasses, devicetree: board
| Flash: 32 MiB
| Loading Environment from nowhere... OK
| In:    serial,usbkbd
| Out:   serial,vidconsole
| Err:   serial,vidconsole
| No working controllers found
| Net:   No ethernet found.
[…]
| => bootefi 0x84000000:0x1a61000
| No EFI system partition
| No EFI system partition
| Failed to persist EFI variables
| Booting /MemoryMapped(0x0,0x84000000,0x1a61000)
| EFI stub: Booting Linux Kernel...
| EFI stub: Using DTB from configuration table
| EFI stub: Exiting boot services...
| Unhandled exception: Environment call from M-mode
| EPC: 00000000baa1bd6c RA: 00000000baa1be9c TVAL: 0000000000000000
| EPC: 000000007b2ddd6c RA: 000000007b2dde9c reloc adjusted
|
| Code: 8562 85de 865a 86d6 8752 87ce 8866 88a6 (0073 0000)
| UEFI image [0x00000000bc488000:0x00000000bdee8fff]

Using the zboot compressed file:

|  % qemu-system-riscv64 -m 1024 -nographic -machine virt -device virtio-rng-pci -bios ../qemu-riscv64/u-boot.bin -device loader,addr=0x84000000,file=../../../../boot/vmlinux-6.13-riscv64
| U-Boot 2024.01+dfsg-7 (Jan 09 2025 - 19:14:04 +0000)
[…]
| => bootefi 0x84000000:0x80d200
| No EFI system partition
| No EFI system partition
| Failed to persist EFI variables
| Booting /MemoryMapped(0x0,0x84000000,0x80d200)
| EFI stub: Decompressing Linux Kernel...
| EFI stub: Using DTB from configuration table
| EFI stub: Exiting boot services...
| Unhandled exception: Environment call from M-mode
| EPC: 000000008001bd6c RA: 000000008001be9c TVAL: 0000000000000000
| EPC: 00000000408ddd6c RA: 00000000408dde9c reloc adjusted
|
| Code: 8562 85de 865a 86d6 8752 87ce 8866 88a6 (0073 0000)
| UEFI image [0x00000000bd69b000:0x00000000bdee83ff]

The executed code is bogus, but identical both times.  It lives at
different adresses.

Bastian

#1098973#45
Date:
2025-02-24 21:50:58 UTC
From:
To:
I have not been able to reproduce the crash under QEMU. I believe it
could be due to the fact that QEMU doesn't trap unaligned accesses. So
far I only reproduced the issue on real hardware.

It works fine when the kernel is directly started from U-Boot with
bootefi. It only fails when U-Boot launches Grub and Grub launches the
EFI file.

You should use OpenSBI as the bios, and U-Boot in S-mode as the kernel.

#1098973#50
Date:
2025-02-25 12:27:37 UTC
From:
To:
does not work.  Also I found reports that it seems to work for others on
this hardware.[1]  So this whole ordeal is not a bug fix, but a workaround
for another as yet not identified bug in either of the components.

So, I see the following steps to see what the heck happens:
- Upgrade u-boot.  The version in Debian is one year old and several new
  releases exist since then.
- Build u-boot with SHOW_REGS to see what exactly it failed on.  The
  already shown TVAL register should contain the trapping address and
  that is pretty near to the loaded u-boot.
- Try to find what this code is for.  Sadly the Linux package does not
  retain debugging infos for the EFI wrappers.
- Change the instruction into a trap to be able to see the same error in
  other environments and compare.

Yeah, thanks, found that as well.  With that Linux is able to boot
correctly.

Bastian

[1]: At least I read the last lines in this log this way
https://libera.irclog.whitequark.org/u-boot/2024-05-10

#1098973#55
Date:
2025-02-26 12:02:33 UTC
From:
To:
Breaks non-EFI systems. Isn't that like 95+% of arm64 boards?

Wrong.

Testing on real hardware seems useful ...
Someone said: "arm64 build *looks* good to me" (emphasis mine)

If it was tested on real hardware it would have said so and mentioned
on which hardware. It doesn't, so it's safe to assume it was NOT tested
on real hardware.

Indeed. You can configure QEMU to have the features you want/need. That
does not mean that all real boards support that.

That's due to compression. You can have compression without EFI.

Looking at https://github.com/edk2-porting I see the following repos:
- edk2-rk3588 ("EDK2 UEFI firmware for Rockchip RK3588 platforms")
- edk2-msm ("Broken edk2 port for Qualcomm platforms xD")

So there is *partial* support for some rk3588 based devices and broken
support for (some?) Qualcomm based devices. That's it.
Looking at the contributors for edk2-rk3588 I see there are *3* people
with more then 10 commits ... and one indicates he's inactive.
I haven't found any other indication it has some real momentum.
by upgrading Debian's 6.13.2 kernel (which works) to the 6.13.4 kernel.
FWIW/FTR: My Q64-A board has a self-compiled U-Boot 2024.10-rc6.

Aurelien indicated he wanted this bug to be about RISC-V, so I'll just
attach my serial log in case ppl want to see that.

TL;DR: My U-Boot found out that it CAN'T load Debian's 6.13.4 kernel and
tries the next one till it finds one which it can boot ...
which was my 6.13 kernel (without EFI_ZBOOT).

Most people use the bootloader/U-Boot that was shipped with the product
and never update it. I can understand why as the goal of the bootloader
is to boot the device, so when it does that ... why upgrade?

https://bugs.debian.org/1095745 is about broken backward compatibility
and that is a *kernel* bug.

My 0.02

#1098973#60
Date:
2025-02-26 20:01:20 UTC
From:
To:
First, we talk about riscv64, nor arm64.  Second, which arm64 board can
boot from nothing?

The riscv64 installer only supports EFI.

edk2 is the reference implementation.  uboot is what everything else
uses.

This kernel (after unpacking) boots find on non-UEFI.  So, what is the
problem?

u-boot can load it as zboot image, as also mentioned.  grub(!) fails for
some reason.

Bastian

#1098973#65
Date:
2025-02-26 20:43:05 UTC
From:
To:
Control: clone -1 -2
Control: reassign -2 src:grub
Control: retitle -2 grub - fails to start zboot linux on risvc64: Unhandled exception: Store/AMO access fault
Control: severity -2 important

So cloning the bug accordingly to grub.

The kernel team intents to change riscv64 to zboot for forky, so this
bug needs to be identified.

Bastian

#1098973#82
Date:
2025-05-31 17:08:23 UTC
From:
To:
control: reassign -1 u-boot
control: found -1 2024.01+dfsg-7
control: fixed -1 2025.01-1

I have upgraded u-boot to version 2025.01, and I can't reproduce the
issue anymore. So I guess we can consider the issue fixed. Reassigning
the bug accordingly.

This means we now need to find a way for users to easily upgrade u-boot
before that happens, so that they are able to reboot their board after a
kernel upgrade.

Regards
Aurelien