Dear Maintainer, after a clean install of bullseye/testing the xen dmesg shows the following message: (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.1 d0 addr fffffffdf8000000 flags 0x8 I this is the sata device: 01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01) or on another mb 01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb in the case of write operations - ie. dbench or windows guest - there are a lot of messages sometimes the filesystem goes to read-only state, and the windows guest goes bsod tested on 3 hw: 1. asus prime b450m-a, ryzen 5 2600x, md raid1, 2x samsung 1TB 860evo, lvm: problem does appear 2. asus prime b550m-k, ryzen 5 5600x, md raid1, 2x samsung 1TB 870evo, lvm: problem does appear 3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 1TB 850evo, lvm: problem does not appear 3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 128GB 840pro, lvm: problem does not appear 3. asus prime b550m-k, ryzen 5 5600x, samsung 1TB 850evo + samsung 128GB 840pro, lvm, dbench on 2 ssds in parallel: problem does appear as i see, the problem does appear, when writes data parallel to 2 ssds Thanks!
i tested on 4th hw 4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo, lvm: problem does not appear as i see, not all mb/chipset/sata pcie device affected Thanks!
severity 988477 normal tags 988477 + moreinfo + upstream - bullseye-ignore thanks Hi! Thanks for your report, and for trying out different combinations of hardware. While doing a short internet search about the problems you're seeing while using AMD ryzen, sata, nvme and iommu, I suspect this problem does not have a lot to do with Xen specifically, but more with the hardware and its firmware. This also means that it's not a Debian packaging problem, and it cannot be fixed by me (or the Debian Xen team). If you want to research this problem more, I can maybe be of some help by providing suggestions. Still, you will have to do all of the actual work, since I do not have your hardware here. The first thing I would suggest is to try reproduce the problem when booting with just Linux without Xen, and then trying the dbench test. If you don't actually need to directly pass-through hardware to a Xen guest, you can also try disabling iommu, or researching other iommu= options that can serve as a workaround. In any case, further reports will need to have more detailed information. For example, instead of "there are a lot of messages", provide a text attachment with a piece of logging that shows these messages. I'm tagging this bug 'moreinfo' now, since it will depend on your availability and abilities to work on it to have it advance. Have fun, Hans van Kranenburg
tags 988477 - moreinfo found 988477 4.17.2+76-ge1f9cb16e2-1~deb12u1 affects 988477 src:linux severity 988477 critical quit I am also observing #988477 occur. This machine has a AMD Zen 4 processor. The first observation was when motherboard/processor was swapped out, the older motherboard/processor was several generations old. The pattern which is emerging is Linux MD RAID1 plus recent AMD processor which has full IOMMU functionality. The older machine was believed to have an IOMMU, but the BIOS wasn't creating appropriate ACPI tables (IVRS) and thus Xen was unable to utilize it. This seems to be occuring with a small percentage of write operations. Subsequent read operations appear to be fine. I am not convinced this is a Xen bug. I suspect this is instead a bug in the Linux MD subsystem. In particular if the DMA interface was designed assuming only a single device would ever access any page, but the MD RAID1 driver is reusing the same page for both devices. IOMMU page release could be handled by marking the page unused in a device data structure and later removed by sweeping a table. In such case if the MD-RAID1 driver was to redirect the page to another device between these two steps, the entry for a subsequent device could be wiped out when trying to invalidate an entry for a prior device. Anyway, I'm also observing bug #988477. This could also be a kernel bug. So far no crashes/confirmed data loss have occured, but sweeping the mirror does turn up small numbers of inconsistencies.
It was suggested as a debugging step, but adding the option "iommu=no-intremap" to Xen's command-line may work as a short-term mitigation for #988477.
Hi Elliott, I am changing the severity back to normal as the xen package works fine for many people without any serious issues. From your last message it also seems you found a workaround for your problem. Please don't change the bug severity without at least giving an explanation why you think the new severity is justified. From the few log lines in this bug report this seems to be an upstream issue with xen or the linux kernel. Please report your observations upstream. The Debian xen team does not have the resources and knowledge to debug or fix such problems. Once the issue has been identified and fixed upstream we can see if we can backport a fix to our Debian packages, but this is only possible once an upstream fix has landed. Thanks, Maxi
Yet for some lucky people data is corrupted/lost. There could be other people who reproduce this, but don't send e-mail saying "me too" to this bug report. Presently the main reason there aren't very many reproductions is few people are bothering to use RAID with flash. The initial reports are SSDs have a lower failure rate than disks, but the failure rate isn't even close to zero. Whereas the data loss/corruption easily reproduces. While both cases in #988477 were on systems with AMD hardware, I am presently doubtful that is a requirement. The most similar known bug was found to be more severe on AMD hardware, but also occur on Intel hardware. I suspect this issue may be similar, simply no one has noticed the problem yet... Something was found which seems to have made another issue more prominent. It may reduce the rate at which data corruption occurs, but I've since confirmed data loss/corruption continues to occur. I had thought the original reporter's justification was sufficient. This appears to have some specific requirement to meet, but if you meet them you may be in trouble before alerts trigger. So far both reports are with AMD machines with IOMMUv2 functionality (I tried on a machine with IOMMUv1/GART and it didn't reproduce). Both reports feature Samsung SATA devices. A NVMe device from another manufacturer also showed the issue (I'm almost certain Samsung NVMe devices will also show the issue). I suspect Intel machines may also be effected by this issue, but it may not manifest as severely. I suspect this is a case of people with AMD machines being a bit more wary of hardware failure (thus actually bothering to use RAID1 even with flash devices). Perhaps it has become easier to report things upstream, but the original procedure was reportters were supposed to report to bugs.debian.org and NOT forward upstream. Other problem is I've run into a chasm with upstream and no way to build a bridge across. I do have one more thing to try, but don't yet have a time-frame for when I'll check that.
found 988477 4.17.3+10-g091466ba55-1~deb12u1
severity 988477 critical
quit
Justification is same as original, data loss. I'm unsure about of the
border between "data loss" and "serious data loss" is, but the original
reportter declared it so and I don't disagree.
critical
makes unrelated software on the system (or the whole system) break,
or causes serious data loss, or introduces a security hole on systems
where you install the package.
grave
makes the package in question unusable or mostly so, or causes data
loss, or introduces a security hole allowing access to the accounts
of users who use the package.
Both of those are lists of conditions. Since the conditions are
"causes serious data loss" and "causes data loss", those have been met
as there is no mention of "and cannot work acceptably for anyone".
The key word was "may". I was being cautious when testing due to the
severity of the issue. As stated in the previous message, it was found
to merely mildly change the messages and not fix the issue.
My understanding is being an upstream issue has no effect on severity.
It allows tagging as "upstream", but does not allow reducing severity.
The severity is meant as an alert to others there is a *severe* problem
lurking.
I've tried interacting with upstream, yet there has been a demand to
release `xl dmesg` to a public area. While I cannot state any
information in `xl dmesg` can be used to compromise systems, nor can
point to hardware serial numbers or other private data which leak in, it
still triggers the TMI detector.
As such I'm uncomfortable with that being public and I don't know any way
to bridge that chasm. If I was an installation of 10K nodes I wouldn't
be too bothered with details of a single test machine leaking, alas I'm
not in that category.
I could also send someone a pair of SATA devices known to manifest the
issue, but that has failed to generate interest. As such I'm stuck.
Question for the original submitter, Imre Szőllősi, what was your
situation prior to seeing #988477 manifest?
Were you installing Xen 4.14 for the first time on Debian 11/bullseye?
Had you previously used Xen 4.11 with Debian 10/buster or earlier?
Knowing whether the bug was introduced between Xen 4.11 and Xen 4.14
would be valuable knowledge if you have it. I had been using an older
processor with 4.14, so I hadn't observed it until 4.17.
A fix [1] for the IO_PAGE_FAULT went into xen 4.20 which is now available in testing and unstable. The 4.20.0-1 Debian source package can also be compiled for bookworm if you have a bookworm system running and want to test there. Please not that qemu also needs to be recompiled for this xen version if you are using qemu. Can anyone affected by this bug conform if their issue is fixed in xen 4.20 or is still there? [1] https://salsa.debian.org/xen-team/debian-xen/-/commit/b953a99da98d63a7c827248abc450d4e8e015ab6
user debian-release@lists.debian.org usertag 1091027 + bsp-2025-04-at-vienna usertag 1057462 + bsp-2025-04-at-vienna usertag 994274 + bsp-2025-04-at-vienna tag 1091027 + pending tag 1057462 + pending tag 994274 + pending thanks Uploaded an NMU to DELAYED/0-day: Kind regards Philipp Kern
software RAID data loss are distinct bugs. That patch/commit likely makes the correlated message disappear, but almost certainly leaves the software RAID data loss behind. Do any of the Debian maintainers have an AMD machine setup for debugging? I'm not very well setup for debugging this particular issue. If you've got an AMD machine with a pair of available SATA ports (including SATA power!), I could send a pair of SATA devices known to readily reproduce the issue.
I'm not aware of anybody in our team having hardware where they can reproduce this issue, else I'm sure they would have already provided feedback here. There are also not many reports here of people running into this problem. Thus I assume it needs a special (and probably rare) hardware combination to trigger this. One thing I can add is that I have been running software raid1 with Xen on two SATA SSDs on an Intel CPU since many years without seeing any data corruption. As Debian packages versions of xen, linux, etc. have changed a bit since the last time this issue was reported as reproduced in this bug, it would be good to get confirmation the problem is still there in Debian unstable or testing.
I'm skeptical of it being rare, but certainly uncommon. You've got some similarity to the reproductions, but there are differences. First question, what brand/model are the SSDs? Samsung SSDs are known to be effected (severely effected for some models), while Crucial/Micron SSDs are uneffected (some models might be mildly effected). Second question, where are the SATA ports? They on-motherboard? Add-on card? The reproductions were with on-motherboard ports. What generation is your processor? Are you sure it has an IOMMU and Xen is driving the IOMMU? I had suspected Intel systems would be effected, but you may have disproven this. This is possible.
Uh. I did hope you could help narrowing things down some. Right now we've got two confirmed reproductions, while you're the only person who isn't seeing this reproduce. The biggest difference is you've got a system with an Intel processor. Yet we already know not all SSDs are effected, so could be your pair are ones which won't reproduce the issue. On top of that, similar to the spurious interrupt issue, could be it is less severe on Intel processors and that has kept you safe. Presently the shortage of reports seems mostly attributable to few people using RAID1 with SSDs.
It is too soon for me to declare it conclusively, but it appears the Xen 4.20.0+68-g35cb38b222-1 and/or Linux 6.12.48-1 packages may resolve bug #988477. I mention kernel version since there is a possibility that was part of the issue. The bug will need to remain open for some time while I monitor things though. Quite a few mitigations were tried and removing those may reveal the bug is still present.
That would be good news. Let's keep the bug open for a bit and if it turns out this issue is solved it can be closed. For the record the SSDs in my system are Crucial and are connected to the onboard SATA ports. The CPU is Intel Xeon Processor E3 v6 Family.
Which would not expected to be effected by the issue. Crucial SATA devices were already known to be uneffected. Offerings sold under the Micron brand might have exhibited the issue, but not Crucial SATA (Crucial NVMe had mild issues). Devices sold under the Samsung brand had moderate to severe issues. I suspect Intel flash devices would also have been effected. There was speculation as to whether systems with Intel IOMMUs would be effected or not (alas you couldn't answer that). Seems you got very lucky this time. I lucked out in not ignoring warning messages and having backups.