#810964 only partial EDAC information with Xen

Package:
src:xen
Source:
xen
Submitter:
Andreas Pflug
Date:
2023-10-03 05:18:05 UTC
Severity:
normal
Tags:
#810964#5
Date:
2016-01-14 09:39:57 UTC
From:
To:
Debian 8.2 installed on a supermicro H8SGL Board, AMD 6128 with 4x4GB
ECC RAM.

When booting the plain kernel (stock Jessie 3.16 or backport 4.1 or
4.3), both memory controllers (mc0 and mc1) appear under
/sys/devices/system/edac/mc with two csrow* each as expected. Same
happens, when booted with Xen 4.1.4-3+deb7u1.

When booted with Xen 4.4.1, only mc1 with two RAM modules is visible,
although all 16GB RAM is available in the OS (xl info).

#810964#10
Date:
2016-01-14 10:22:56 UTC
From:
To:
/sys/devices/system/edac/mc
#810964#15
Date:
2016-01-20 11:33:41 UTC
From:
To:
issue as a hypervisor one, but I'm not sure.

Would you mind reporting this to upstream per:
    http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen
please.

We could forward it but I expect there will need to be some back and forth
with the maintainers so it makes sense for you to speak to them directly.
You can CC 810964@bugs.debian.org to keep this bug in the loop.

In addition to the information you provide here I would expect upstream to
want to see the full "xl dmesg" and "dmesg" from dom0 with and without Xen
and with 4.1.4 as well as something exhibiting the issue (4.6 is probably
best)

Thanks,
Ian.

#810964#20
Date:
2016-01-20 15:01:52 UTC
From:
To:
Initially reported to debian
(http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=810964), redirected here:

With AMD Opteron 6xxx processors, half of the memory controllers are
missing from /sys/devices/system/edac/mc
Checked with single 6120 (dual memory controller) and twin 6344 (2x dual
MC), other dual-module CPUs might be affected too.

Booting plain Linux (3.2, 3.16, 4.1, 4.3), all memory controllers are
listed under /sys/devices/system/edac/mc as expected. Same happens, when
Xen 4.1 is used: all MCs present.

Starting with Xen 4.4 (Debian Jessie), only mc1 (on the single CPU
machine) or mc2/mc3 (dual CPU machine) are present, although the full
system memory is accessible. Checked versions were 4.1.4 (Debian
Wheezy), 4.4.1 (Jessie) and 4.6.0 (Sid)

#810964#27
Date:
2016-01-21 16:41:29 UTC
From:
To:
As already indicated by Ian in that bug, you should supply us with
full kernel and hypervisor logs for both the good and bad cases
(ideally with the same kernel version use in both runs, so that we
can exclude kernel behavior differences).

Jan

#810964#32
Date:
2016-01-22 09:09:04 UTC
From:
To:
Am 21.01.16 um 17:41 schrieb Jan Beulich:
Here are some dmesg excerpts, all performed with Linux 4.1.3.

When booting with Xen 4.1.4:

AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV
0000:00:18.2 (INTERRUPT)
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV
0000:00:19.2 (INTERRUPT)

When booting with Xen 4.4.1:

AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will
not load.
 Either enable ECC checking or force module loading by setting
'ecc_enable_override'.
 (Note that use of the override may cause unknown side effects.)
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV
0000:00:19.2 (INTERRUPT)

Apparently Xen4.4 doesn't report the BIOS flag correctly. I added
ecc_enable_override=1 to amd64_edac_mod, and then I get

EDAC MC: Ver: 3.0.0
AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will
not load.
EDAC amd64: Forcing ECC on!
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV
0000:00:18.2 (INTERRUPT)
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:  2048MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV
0000:00:19.2 (INTERRUPT)

This restored both MCs, so the BIOS flag seems to be the culprit.

Regards,
Andreas

#810964#37
Date:
2016-01-22 10:40:52 UTC
From:
To:
I wonder how valid his message is. We actually write this MSR with
all ones during boot.

However, considering involved functions like
nb_mce_bank_enabled_on_node() or node_to_amd_nb() taking
node IDs as inputs, and considering that PV guests (including
Dom0) don't have a topology matching that of the host, I doubt
very much that this driver is even remotely prepared to run
under Xen. It working on Xen 4.1.x would then be by pure
accident.

Jan

#810964#42
Date:
2016-01-22 11:33:53 UTC
From:
To:
Am 22.01.16 um 11:40 schrieb Jan Beulich:
The dmesg is identical with or without Xen4.1, so I'd guess it does work
if flags are detected correctly.

Regards
Andreas

#810964#47
Date:
2017-05-13 22:36:56 UTC
From:
To:
I haven't yet done as much experimentation as Andreas Pflug has, but I
can confirm I'm also running into this bug with Xen 4.4.1.

I've only tried Linux kernel 3.16.43, but as Dom0:

EDAC MC: Ver: 3.0.0
AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.

Whereas directly booting:

EDAC MC: Ver: 3.0.0
AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:     0MB 1:     0MB
EDAC amd64: MC: 2:     0MB 3:     0MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0:  4096MB 1:  4096MB
EDAC amd64: MC: 2:     0MB 3:     0MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: using x4 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS0: Unbuffered DDR3 RAM
EDAC amd64: CS1: Unbuffered DDR3 RAM
EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV 0000:00:18.2 (INTERRUPT)
EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.2 (POLLED)

I have not tried force-enabling ECC checking.  Since I place high value
on my data, I rate this as a rather important bug.

#810964#54
Date:
2017-05-15 08:02:53 UTC
From:
To:
Afaict the driver as is simply can't work in a Xen Dom0; it needs
enabling (read: para-virtualizing). I'm actually glad to see it doesn't
load (the worse alternative would be for it to load and then do the
wrong thing or give you a false sense of safety of your data).

Jan

#810964#59
Date:
2017-05-16 03:47:04 UTC
From:
To:
I'm unsure of how to evaluate the situation.  Since ECC is enabled in the
BIOS, data should be safe whether or not the EDAC driver loads.  I
/suspect/ the EDAC driver failing to load merely means reportting of ECC
errors won't happen.  I suspect the only paravirtualization needed is to
map the physical address of the soft|hard errors to which VM's memory
range was effected.  What this effects is which VM should panic in case
of hard errors.

Depending upon the environment there may or may not be cause to report
soft errors anywhere beside Dom0.  In most cases a soft error will at
worst trigger a desire to replace the memory module, but not trigger a
panic for the affected VM.  It is only once a hard error occurs that it
is urgent to warn the effected VM and cause a panic; in this case it
may also be desireable to first alert Dom0 anyway.

As such I'm inclined to think force-enabling ECC EDAC monitoring in Dom0
is the best approach for now.  As long as a hard error doesn't occur in
Dom0's address range, Dom0 is in the best position to deal with the
situation.  The worst case is a hard error occuring in Xen's address
range, since that will mean all VMs on the machine are likely to be
toast.

I think this should be a fairly high priority for Xen since ECC memory is
a feature very common on systems running with a hypervisor.

#810964#64
Date:
2017-05-16 09:54:37 UTC
From:
To:
"Merely" being relative here: The missing reports mean a false feeling
of safety, as they may be early indications of later double-bit errors.
clear to me whether perhaps the driver would better live in the
hypervisor in the first place for that reason.

And there's a second piece of paravirtualization needed: The driver
doesn't distinguish physical and machine address spaces, yet the
addresses reported by hardware are machine ones and hence would
generally need translation to physical ones in order to assign Dom0-
local meaning to them (or to determine that the address belongs to
another VM or the hypervisor).

Jan

#810964#69
Date:
2017-05-16 10:08:11 UTC
From:
To:
The driver should probably live directly in Xen; it needs to program a
number of nothbridge and CPU registers including interrupt information.

For the reporting side of things, it looks like it would require vMCE to
pass on fault information to guests.

~Andrew

#810964#74
Date:
2017-05-16 18:02:30 UTC
From:
To:
Merely reporting the machine address to Dom0 is already high value since
it lets you attribute the failure to a memory module.  Without that you
may have a VM or whole machine randomly crash for a completely unknown
reason.

#810964#79
Date:
2019-02-09 05:37:25 UTC
From:
To:
I'm seeing bug #810964 occur in Xen 4.8 as well.  Perhaps #810964
should be reassigned to xen-hypervisor-common or src:xen ?

I don't know whether it effects Xen 4.11 yet...

#810964#84
Date:
2019-02-11 23:11:11 UTC
From:
To:
Hi,

Since the issue seems to be a lack of functionality to support certain
hardware in the upstream Xen product, I would recommend to not have a
bug open against Debian at all.

It's not that I don't value your use case, but I just think as a package
maintainer team that ships the currently released upstream software,
*we* cannot be of any value to you for this, sitting in between you and
upstream developers.

Sometimes we can work around some things by tweaking the build scripts
or other things, but I just need to be honest here about the fact that
we will not be able to help you getting low level hypervisor features
implemented.

This means you will have to do things like hop on the upstream
development mailing list, build a reproducable failure case, search for
a developer that has similar hardware and wants to spend time on it,
donate hardware to someone to reproduce the error scenarios or learn how
to do it yourself, or whatever it takes. :)

Hans

#810964#89
Date:
2019-02-13 06:36:11 UTC
From:
To:
I had hopes of avoiding doing such.  Problem is there are so many pieces
of software I have to use that if I jumped on the mailing lists of each
of them would be akin to trying to read all of Usenet.  I may not be able
to avoid that here, but...


Looks like Xen's MCE support is in near-useless shape.  The code in the
git repository mention documentation for family 10h, problem is that is
almost entirely decade-old processors.  The last apparently significant
change was in 2014.  The copyright is to AMD, so I guess that means they
need more funding.

Looks like Intel has been offering more support to Xen.  :-(

I'm surprised at Xen's handling of MCE.  Given Xen's approach to things I
would expect MCE handling to be done more by Domain 0.  Let Domain 0
handle talking to the memory controller and merely have Xen map the
physical address to a domain and domain address.  Domain 0 can log all
correctable memory errors to a single location, and in case of an
uncorrectable error it can panic the machine.  (plus Linux's MCE support
is in better shape)

Handling MCE errors in non-Domain 0 only seems to make sense in HVM where
you want to simulate memory errors.

#810964#94
Date:
2019-02-22 05:11:18 UTC
From:
To:
to find appropriate information.

I'm thinking Debian's default Xen log level is a little too high.  Adding
"loglvl=info" doesn't put all that much in Xen's dmesg.  I'm suspecting
"mce_verbosity=verbose" may be a different story though.

"loglvl=info" gets me "AMD Fam10h machine check reporting enabled", so
looks like Xen is successfully getting its MCE support operational.

Taking a closer look at Dom0's dmesg though:

MCE: In-kernel MCE decoding enabled.
EDAC amd64: DRAM ECC enabled.
EDAC amd64: NB MCE bank disabled, set MSR 0x0000017b[4] on node 0 to enable.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
 (Note that use of the override may cause unknown side effects.)

So it seems Linux wants bit 4 of MSR_IA32_MCG_CTL set before it will
willingly enable MCE support (I've no idea what this does).

This was done in commit b272353fe98db5bdc73fff3c60a0574835df4c87.

True, they might have merely been noticed at the same time and in fact
be two distinct issues.  Having EDAC reporting broken is *very* bad.

I am left though noticing how the state of Xen's EDAC support looks
rather odd from how other bits of Xen are evolving.  Rather than going
more in a direction of para-virtualization, this code looks to be heading
more towards true virtualization.

A more PV type approach might be to let Dom0 handle decoding the machine
check registers.  Then Dom0 asks Xen for what is at physical address X,
then potentially turns this into a PV message to the appropriate
domain and potentially logs the event.  Such an approach could be used
to synthesize machine check events for testing VMs.  Qemu would then
need code to simulate the appropriate register values for a HVM.

#810964#109
Date:
2023-10-02 17:18:49 UTC
From:
To:
reassign 810964 src:linux
tags 810964 -moreinfo
affects 810964 src:xen
found 810964 5.10.191-1
found 810964 6.1.52-1
found 810964 6.5.3-1
found 810964 5.10.127-2~bpo10+1
found 810964 6.1.38-4~bpo11+1
found 810964 6.4.4-3~bpo12+1
quit

Upon further investigation, while some part of #810964 may be in Xen,
the biggest issue is in the Linux kernel.

Appears MCE/EDAC support for Xen was implemented around 2008-2012.  Since
that time the maintainer has changed and the new maintainer was unaware
the driver was supposed to function on Xen.

As such the current maintainer has been adding in constructs which are
incompatible with operation on Xen, and at 767f4b620eda overtly broke Xen
support.

Part of the fix may require adjustments to Xen, but right now the
immediate source of breakage is the Linux kernel.

As such I'm reassigning this to src:linux.

#810964#132
Date:
2023-10-02 17:18:49 UTC
From:
To:
reassign 810964 src:linux
tags 810964 -moreinfo
affects 810964 src:xen
found 810964 5.10.191-1
found 810964 6.1.52-1
found 810964 6.5.3-1
found 810964 5.10.127-2~bpo10+1
found 810964 6.1.38-4~bpo11+1
found 810964 6.4.4-3~bpo12+1
quit

Upon further investigation, while some part of #810964 may be in Xen,
the biggest issue is in the Linux kernel.

Appears MCE/EDAC support for Xen was implemented around 2008-2012.  Since
that time the maintainer has changed and the new maintainer was unaware
the driver was supposed to function on Xen.

As such the current maintainer has been adding in constructs which are
incompatible with operation on Xen, and at 767f4b620eda overtly broke Xen
support.

Part of the fix may require adjustments to Xen, but right now the
immediate source of breakage is the Linux kernel.

As such I'm reassigning this to src:linux.

#810964#137
Date:
2023-10-03 05:14:27 UTC
From:
To:
Hi Elliott,

Can you report you findings upstream with the EDAC and xen subsystem
maintainers and keep this bug in the loop?

Regards,
Salvatore

#810964#142
Date:
2023-10-03 05:14:27 UTC
From:
To:
Hi Elliott,

Can you report you findings upstream with the EDAC and xen subsystem
maintainers and keep this bug in the loop?

Regards,
Salvatore