#988477 xen-hypervisor-4.14-amd64: xen dmesg shows (XEN) AMD-Vi: IO_PAGE_FAULT on sata pci device

Package:
src:xen
Source:
xen
Submitter:
Imre Szőllősi
Date:
2025-12-01 01:05:02 UTC
Severity:
normal
Tags:
#988477#5
Date:
2021-05-13 19:13:44 UTC
From:
To:
Dear Maintainer,

after a clean install of bullseye/testing the xen dmesg shows the following message:
(XEN) AMD-Vi: IO_PAGE_FAULT: 0000:01:00.1 d0 addr fffffffdf8000000 flags 0x8 I
this is the sata device:
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)
or on another mb
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb
in the case of write operations - ie. dbench or windows guest - there are a lot of messages
sometimes the filesystem goes to read-only state, and the windows guest goes bsod
tested on 3 hw:
1. asus prime b450m-a, ryzen 5 2600x, md raid1, 2x samsung 1TB 860evo, lvm: problem does appear
2. asus prime b550m-k, ryzen 5 5600x, md raid1, 2x samsung 1TB 870evo, lvm: problem does appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 1TB 850evo, lvm: problem does not appear
3. asus prime b550m-k, ryzen 5 5600x, 1x samsung 128GB 840pro, lvm: problem does not appear
3. asus prime b550m-k, ryzen 5 5600x, samsung 1TB 850evo + samsung 128GB 840pro, lvm, dbench on 2 ssds in parallel: problem does appear

as i see, the problem does appear, when writes data parallel to 2 ssds

Thanks!

#988477#10
Date:
2021-06-13 13:58:52 UTC
From:
To:
i tested on 4th hw

4. asus m4n78 pro, phenom ii x4 905e, md raid1, 2x samsung 1TB 860evo,
lvm: problem does not appear

as i see, not all mb/chipset/sata pcie device affected

Thanks!

#988477#17
Date:
2021-08-05 20:46:39 UTC
From:
To:
severity 988477 normal
tags 988477 + moreinfo + upstream - bullseye-ignore
thanks

Hi!

Thanks for your report, and for trying out different combinations of
hardware.

While doing a short internet search about the problems you're seeing
while using AMD ryzen, sata, nvme and iommu, I suspect this problem does
not have a lot to do with Xen specifically, but more with the hardware
and its firmware.

This also means that it's not a Debian packaging problem, and it cannot
be fixed by me (or the Debian Xen team). If you want to research this
problem more, I can maybe be of some help by providing suggestions.
Still, you will have to do all of the actual work, since I do not have
your hardware here.

The first thing I would suggest is to try reproduce the problem when
booting with just Linux without Xen, and then trying the dbench test.

If you don't actually need to directly pass-through hardware to a Xen
guest, you can also try disabling iommu, or researching other iommu=
options that can serve as a workaround.

In any case, further reports will need to have more detailed
information. For example, instead of "there are a lot of messages",
provide a text attachment with a piece of logging that shows these messages.

I'm tagging this bug 'moreinfo' now, since it will depend on your
availability and abilities to work on it to have it advance.

Have fun,
Hans van Kranenburg

#988477#30
Date:
2021-08-08 13:34:42 UTC
From:
To:

#988477#35
Date:
2024-01-18 16:04:13 UTC
From:
To:
tags 988477 - moreinfo
found 988477 4.17.2+76-ge1f9cb16e2-1~deb12u1
affects 988477 src:linux
severity 988477 critical
quit

I am also observing #988477 occur.  This machine has a AMD Zen 4
processor.  The first observation was when motherboard/processor was
swapped out, the older motherboard/processor was several generations old.

The pattern which is emerging is Linux MD RAID1 plus recent AMD processor
which has full IOMMU functionality.  The older machine was believed to
have an IOMMU, but the BIOS wasn't creating appropriate ACPI tables
(IVRS) and thus Xen was unable to utilize it.

This seems to be occuring with a small percentage of write operations.
Subsequent read operations appear to be fine.

I am not convinced this is a Xen bug.  I suspect this is instead a bug
in the Linux MD subsystem.  In particular if the DMA interface was
designed assuming only a single device would ever access any page, but
the MD RAID1 driver is reusing the same page for both devices.

IOMMU page release could be handled by marking the page unused in a
device data structure and later removed by sweeping a table.  In such
case if the MD-RAID1 driver was to redirect the page to another device
between these two steps, the entry for a subsequent device could be wiped
out when trying to invalidate an entry for a prior device.


Anyway, I'm also observing bug #988477.  This could also be a kernel bug.
So far no crashes/confirmed data loss have occured, but sweeping the
mirror does turn up small numbers of inconsistencies.

#988477#48
Date:
2024-07-10 19:25:06 UTC
From:
To:
It was suggested as a debugging step, but adding the option
"iommu=no-intremap" to Xen's command-line may work as a short-term
mitigation for #988477.

#988477#53
Date:
2024-08-25 21:41:44 UTC
From:
To:
Hi Elliott,

I am changing the severity back to normal as the xen package works fine for
many people without any serious issues. From your last message it also seems
you found a workaround for your problem. Please don't change the bug severity
without at least giving an explanation why you think the new severity is
justified.

From the few log lines in this bug report this seems to be an upstream issue
with xen or the linux kernel. Please report your observations upstream. The
Debian xen team does not have the resources and knowledge to debug or fix such
problems. Once the issue has been identified and fixed upstream we can see if
we can backport a fix to our Debian packages, but this is only possible once
an upstream fix has landed.

Thanks,
Maxi

#988477#60
Date:
2024-08-25 22:58:30 UTC
From:
To:
Yet for some lucky people data is corrupted/lost.  There could be other
people who reproduce this, but don't send e-mail saying "me too" to this
bug report.

Presently the main reason there aren't very many reproductions is few
people are bothering to use RAID with flash.  The initial reports are
SSDs have a lower failure rate than disks, but the failure rate isn't
even close to zero.  Whereas the data loss/corruption easily reproduces.

While both cases in #988477 were on systems with AMD hardware, I am
presently doubtful that is a requirement.  The most similar known bug was
found to be more severe on AMD hardware, but also occur on Intel
hardware.  I suspect this issue may be similar, simply no one has noticed
the problem yet...

Something was found which seems to have made another issue more
prominent.  It may reduce the rate at which data corruption occurs, but
I've since confirmed data loss/corruption continues to occur.

I had thought the original reporter's justification was sufficient.  This
appears to have some specific requirement to meet, but if you meet them
you may be in trouble before alerts trigger.

So far both reports are with AMD machines with IOMMUv2 functionality (I
tried on a machine with IOMMUv1/GART and it didn't reproduce).  Both
reports feature Samsung SATA devices.  A NVMe device from another
manufacturer also showed the issue (I'm almost certain Samsung NVMe
devices will also show the issue).

I suspect Intel machines may also be effected by this issue, but it may
not manifest as severely.  I suspect this is a case of people with AMD
machines being a bit more wary of hardware failure (thus actually
bothering to use RAID1 even with flash devices).

Perhaps it has become easier to report things upstream, but the original
procedure was reportters were supposed to report to bugs.debian.org and
NOT forward upstream.

Other problem is I've run into a chasm with upstream and no way to build
a bridge across.

I do have one more thing to try, but don't yet have a time-frame for
when I'll check that.

#988477#65
Date:
2024-09-03 21:58:18 UTC
From:
To:
found 988477 4.17.3+10-g091466ba55-1~deb12u1
severity 988477 critical
quit

Justification is same as original, data loss.  I'm unsure about of the
border between "data loss" and "serious data loss" is, but the original
reportter declared it so and I don't disagree.

critical
    makes unrelated software on the system (or the whole system) break,
    or causes serious data loss, or introduces a security hole on systems
    where you install the package.

grave
    makes the package in question unusable or mostly so, or causes data
    loss, or introduces a security hole allowing access to the accounts
    of users who use the package.

Both of those are lists of conditions.  Since the conditions are
"causes serious data loss" and "causes data loss", those have been met
as there is no mention of "and cannot work acceptably for anyone".

The key word was "may".  I was being cautious when testing due to the
severity of the issue.  As stated in the previous message, it was found
to merely mildly change the messages and not fix the issue.

My understanding is being an upstream issue has no effect on severity.
It allows tagging as "upstream", but does not allow reducing severity.
The severity is meant as an alert to others there is a *severe* problem
lurking.

I've tried interacting with upstream, yet there has been a demand to
release `xl dmesg` to a public area.  While I cannot state any
information in `xl dmesg` can be used to compromise systems, nor can
point to hardware serial numbers or other private data which leak in, it
still triggers the TMI detector.

As such I'm uncomfortable with that being public and I don't know any way
to bridge that chasm.  If I was an installation of 10K nodes I wouldn't
be too bothered with details of a single test machine leaking, alas I'm
not in that category.

I could also send someone a pair of SATA devices known to manifest the
issue, but that has failed to generate interest.  As such I'm stuck.



Question for the original submitter, Imre Szőllősi, what was your
situation prior to seeing #988477 manifest?

Were you installing Xen 4.14 for the first time on Debian 11/bullseye?

Had you previously used Xen 4.11 with Debian 10/buster or earlier?

Knowing whether the bug was introduced between Xen 4.11 and Xen 4.14
would be valuable knowledge if you have it.  I had been using an older
processor with 4.14, so I hadn't observed it until 4.17.

#988477#74
Date:
2025-03-14 21:42:24 UTC
From:
To:
A fix [1] for the IO_PAGE_FAULT went into xen 4.20 which is now available in
testing and unstable.
The 4.20.0-1 Debian source package can also be compiled for bookworm if you
have a bookworm system running and want to test there. Please not that qemu
also needs to be recompiled for this xen version if you are using qemu.

Can anyone affected by this bug conform if their issue is fixed in xen 4.20 or
is still there?

[1] https://salsa.debian.org/xen-team/debian-xen/-/commit/b953a99da98d63a7c827248abc450d4e8e015ab6

#988477#81
Date:
2025-04-13 11:22:07 UTC
From:
To:
user debian-release@lists.debian.org
usertag 1091027 + bsp-2025-04-at-vienna
usertag 1057462 + bsp-2025-04-at-vienna
usertag 994274 + bsp-2025-04-at-vienna
tag 1091027 + pending
tag 1057462 + pending
tag 994274 + pending
thanks

Uploaded an NMU to DELAYED/0-day:

Kind regards
Philipp Kern

#988477#86
Date:
2025-04-13 22:22:01 UTC
From:
To:
software RAID data loss are distinct bugs.  That patch/commit likely
makes the correlated message disappear, but almost certainly leaves the
software RAID data loss behind.

Do any of the Debian maintainers have an AMD machine setup for debugging?
I'm not very well setup for debugging this particular issue.  If you've
got an AMD machine with a pair of available SATA ports (including SATA
power!), I could send a pair of SATA devices known to readily reproduce
the issue.

#988477#91
Date:
2025-05-18 12:10:25 UTC
From:
To:
I'm not aware of anybody in our team having hardware where they can reproduce
this issue, else I'm sure they would have already provided feedback here.
There are also not many reports here of people running into this problem. Thus
I assume it needs a special (and probably rare) hardware combination to
trigger this.
One thing I can add is that I have been running software raid1 with Xen on two
SATA SSDs on an Intel CPU since many years without seeing any data corruption.

As Debian packages versions of xen, linux, etc. have changed a bit since the
last time this issue was reported as reproduced in this bug, it would be good
to get confirmation the problem is still there in Debian unstable or testing.

#988477#96
Date:
2025-05-29 00:20:52 UTC
From:
To:
I'm skeptical of it being rare, but certainly uncommon.  You've got some
similarity to the reproductions, but there are differences.

First question, what brand/model are the SSDs?  Samsung SSDs are known to
be effected (severely effected for some models), while Crucial/Micron
SSDs are uneffected (some models might be mildly effected).

Second question, where are the SATA ports?  They on-motherboard?  Add-on
card?  The reproductions were with on-motherboard ports.

What generation is your processor?  Are you sure it has an IOMMU and Xen
is driving the IOMMU?  I had suspected Intel systems would be effected,
but you may have disproven this.

This is possible.

#988477#101
Date:
2025-07-04 00:25:27 UTC
From:
To:
Uh.  I did hope you could help narrowing things down some.  Right now
we've got two confirmed reproductions, while you're the only person who
isn't seeing this reproduce.

The biggest difference is you've got a system with an Intel processor.
Yet we already know not all SSDs are effected, so could be your pair are
ones which won't reproduce the issue.  On top of that, similar to the
spurious interrupt issue, could be it is less severe on Intel processors
and that has kept you safe.

Presently the shortage of reports seems mostly attributable to few people
using RAID1 with SSDs.

#988477#108
Date:
2025-11-13 04:37:31 UTC
From:
To:
It is too soon for me to declare it conclusively, but it appears the
Xen 4.20.0+68-g35cb38b222-1 and/or Linux 6.12.48-1 packages may resolve
bug #988477.  I mention kernel version since there is a possibility that
was part of the issue.

The bug will need to remain open for some time while I monitor things
though.  Quite a few mitigations were tried and removing those may reveal
the bug is still present.

#988477#113
Date:
2025-11-30 21:56:44 UTC
From:
To:
That would be good news. Let's keep the bug open for a bit and if it turns out
this issue is solved it can be closed.

For the record the SSDs in my system are Crucial and are connected to the
onboard SATA ports. The CPU is Intel Xeon Processor E3 v6 Family.

#988477#118
Date:
2025-12-01 01:03:00 UTC
From:
To:
Which would not expected to be effected by the issue.  Crucial SATA
devices were already known to be uneffected.  Offerings sold under the
Micron brand might have exhibited the issue, but not Crucial SATA
(Crucial NVMe had mild issues).  Devices sold under the Samsung brand had
moderate to severe issues.  I suspect Intel flash devices would also have
been effected.

There was speculation as to whether systems with Intel IOMMUs would be
effected or not (alas you couldn't answer that).  Seems you got very
lucky this time.  I lucked out in not ignoring warning messages and
having backups.