#570365 nvidia driver does not work under xen-hypervisor

#570365#5
Date:
2010-02-18 11:32:38 UTC
From:
To:
Hi,

starting X under xen leaves the console broken.


I just build a xen pv-ops kernel from git (2.6.31.6 + xen patches) and
compiled the nvidia-kernel module against it using the instructions
from http://en.opensuse.org/Talk:Use_Nvidia_driver_with_Xen:

  vi /usr/src/mdules/nvidia-kernel/Makefile.kbuild

  Insert the following code after EXTRA_CFLAGS += -Wall..
  XEN_FEATURES := $(shell grep "D xen_features" /boot/System.map-$(shell uname -r) | colrm 17)
  EXTRA_LDFLAGS := --defsym xen_features=0x$(XEN_FEATURES)

  Close the file and set some environment variables:
  export IGNORE_XEN_PRESENCE=1

make-kpkg --append-to-version -xen-2010.02.18 --revision 2.6-xen-2010.02.18-1 --added-modules nvidia-kernel modules-image


After that I installed the deb and modprobed nvidia. Then when I start
X I get the following in dmesg:

Xorg: Corrupted page table at address 7fc15c1f9000
PGD 7b488067 PUD 7b4a5067 PMD 6c132067 PTE fffffffffffff237
Bad pagetable: 000f [#4] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:10.0/0000:03:00.0/boot_vga
CPU 3
Modules linked in: nvidia(P) fuse nf_nat_irc nf_nat snd_ens1371 snd_ac97_codec ac97_bus dmfe psmouse snd_hda_codec_nvhdmi snd_hda_codec_realtek usbhid snd_hda_intel snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss ohci_hcd snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd ehci_hcd soundcore forcedeth i2c_nforce2 usbcore snd_page_alloc [last unloaded: nvidia]
Pid: 8683, comm: Xorg Tainted: P      D    2.6.31.6-xen-2010.02.18 #1 Point of View
RIP: e033:[<00007fc15632ea45>]  [<00007fc15632ea45>] 0x7fc15632ea45
RSP: e02b:00007fff3ea41448  EFLAGS: 00010246
RAX: 00007fc15c1f9000 RBX: 0000000002210840 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000002218eb0
RBP: 0000000000000001 R08: 0000000000000058 R09: 0101010101010101
R10: 0000000000000000 R11: 00007fc15632ea30 R12: 0000000002218eb0
R13: 0000000000000000 R14: 00007fc15684a7a0 R15: 0000000000000001
FS:  00007fc15c1e3790(0000) GS:ffffc90000042000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc15c1f9000 CR3: 000000006c117000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process Xorg (pid: 8683, threadinfo ffff880079b94000, task ffff88007d2673b0)

RIP  [<00007fc15632ea45>] 0x7fc15632ea45
 RSP <00007fff3ea41448>
---[ end trace d693108671cb486f ]---


The X is killed leaving the screen broken. No text console and
nothing. Luckily ssh logins are unaffected so it isn't a complete
crash.

MfG
	Goswin

#570365#10
Date:
2010-05-12 00:07:38 UTC
From:
To:
Hi,

nvidia-graphics-drivers 195.36.24 was uploaded to unstable today. Please
try to reproduce your problem with current driver and Debian stock
kernel 2.6.32-5 with xen support.

Andreas

#570365#17
Date:
2010-05-21 10:09:05 UTC
From:
To:
Hi,

I have same problem here with the nvidia-kernel-dkms
package.
I am using driver version 195.36.24-1 (testing) and kernel
version 2.6.32-5-xen-686 (unstable).

#570365#22
Date:
2010-08-14 18:09:53 UTC
From:
To:
Tested theses setup:

1/
xen hypervisor 4.0.1~rc3-1
+ linux-image "xen" (linux-image-2.6.32-bpo.5-xen-amd64)
+ nvidia 195.36.24 built from dkms

2/
linux-image "xen" (linux-image-2.6.32-bpo.5-xen-amd64)
+ nvidia 195.36.24 built from dkms

3/
linux-image "without xen" (linux-image-2.6.32-bpo.5-amd64)
+ nvidia 195.36.24 built from dkms

Test results: 1 fails, whereas 2 and 3 work.
When the system fails, we have the message:

   Xorg: Corrupted page table

Tests 2 and 3 work well: launched successfully glxinfo,
glxgears and openarena.

Maybe this bug is related to #470817 ?

Test results point at a problem between Xen Hypervisor and
NVIDIA driver memory management.

Best regards,
Michel

#570365#29
Date:
2010-09-21 08:07:48 UTC
From:
To:
Thanks for the detailed test report. Could you try the new versions of the
driver, too? 195.36.31-3 is in unstable and 256.53-1 in experimental.
Also it would be nice if you could test with the official Debian 2.6.32 kernel
(and xen hypervisor) packages (preferably the versions from unstable) instead
of a backport kernel.

Andreas

#570365#34
Date:
2010-10-31 11:51:12 UTC
From:
To:
Hello,

my latest try : X11 starts up and freeze with a black screen.

I've updated the nvidia package to 195.36.31-5.

ii  nvidia-glx                    195.36.31-5
ii  nvidia-kernel-2.6-xen-amd64   195.36.31+1
ii  nvidia-kernel-2.6.32-5-amd64  195.36.31+2+4+2.6.32-24
ii  nvidia-kernel-2.6.32-5-xen-am 195.36.31+1+2+2.6.32-21
ii  nvidia-kernel-common          20100522+1
ii  nvidia-kernel-dkms            195.36.31-5
ii  nvidia-settings               195.36.24-1
ii  nvidia-vdpau-driver           195.36.31-5

Xen and Kernel packages are up to date with testing :

ii  xen-hypervisor-4.0-amd64      4.0.1-1

ii  linux-image-2.6-amd64         2.6.32+28
ii  linux-image-2.6.32-5-amd64    2.6.32-26
ii  linux-image-2.6.32-5-xen-amd6 2.6.32-26
ii  linux-image-xen-amd64         2.6.32+28

When I boot with Xen hypervisor + Linux 2.6.32-5, X11 starts up and
freeze with a black screen.

I've attached Xorg.log and an extract of /var/log/messages :

- messages.firstTrace : is similar to which I sent to debian bug #601869
  (even if in the first place (in 601869) the nvidia module was not
  loaded) -- at this time the boot process did not reach X startup.

- messages.secondTrace : kernel messages when X has started.


Kind regards,
Michel

PS: I don't understand why some package has a number with a - and some
with a + in their version.

#570365#39
Date:
2010-11-13 11:04:55 UTC
From:
To:
some older test report that didn't make it into this bug report ...


Michel Briand <michelbriand@free.fr> - Sat, 25 Sep 2010 15:22:45 +0200

Hum.... bad news.

with 2.6.32-5-xen-amd64 under hypervisor, Xorg does start with hang with
black screen.

I used Magic SysRq to reboot.

I attach the Xorg log file.

#570365#44
Date:
2010-11-13 23:35:57 UTC
From:
To:
Hi Michel,

something you could try is to patch nv-linux.h in /usr/src/nvidia*/ and
change the following:

Change the line

    #if defined(CONFIG_XEN) && !defined(CONFIG_PARAVIRT)

to

    #if defined(CONFIG_XEN) // && !defined(CONFIG_PARAVIRT)

and rebuild the module.

Only try this with the xen kernel running under the hypervisor. This
patch will probably break the nvidia module that is currently working
for normal kernel and the xen kernel running not under the hypervisor.

The intention of this patch is to explicitely reactiviate some old style
behaviour described in this comment in nv-linux.h, but I don't know if
the kernel still compiles in this mode:

/*
 * Traditionally, CONFIG_XEN indicated that the target kernel was
 * built exclusively for use under a Xen hypervisor, requiring
 * modifications to or disabling of a variety of NVIDIA graphics
 * driver code paths. As of the introduction of CONFIG_PARAVIRT
 * and support for Xen hypervisors within the CONFIG_PARAVIRT_GUEST
 * architecture, CONFIG_XEN merely indicates that the target
 * kernel can run under a Xen hypervisor, but not that it will.
 *
 * If CONFIG_XEN and CONFIG_PARAVIRT are defined, the old Xen
 * specific code paths are disabled. If the target kernel executes
 * stand-alone, the NVIDIA graphics driver will work fine. If the
 * kernels executes under a Xen (or other) hypervisor, however, the
 * NVIDIA graphics driver has no way of knowing and is unlikely
 * to work correctly.
 */

If this still does not reactivate xen support under hypervisor, that
configuration is probably unsupported by upstream. At least upstream
docs do not say anything about xen at all ...
In that case I have no further clues and the only thing we can document
is: does not work in the xen kernel running under the hypervisor.
Or is there some xen patch for current drivers that I'm not aware of?

Andreas

#570365#51
Date:
2011-01-13 04:52:31 UTC
From:
To:
I believe this is still an ongoing issue. I have full dist-upgrade from
2 days ago. here are what should be relevant pkgs:

dpkg -l | egrep "linux-image|xen"
ii libxenstore3.0 4.0.1-1 Xenstore communications library for Xen
ii linux-headers-2.6.32-5-common-xen 2.6.32-29 Common header files for
Linux 2.6.32-5-xen
ii linux-headers-2.6.32-5-xen-amd64 2.6.32-29 Header files for Linux
2.6.32-5-xen-amd64
ii linux-image-2.6-amd64 2.6.32+28 Linux 2.6 for 64-bit PCs (meta-package)
ii linux-image-2.6.26-2-xen-amd64 2.6.26-15 Linux 2.6.26 image on AMD64,
oldstyle Xen support
ii linux-image-2.6.32-4-amd64 2.6.32-11 Linux 2.6.32 for 64-bit PCs
ii linux-image-2.6.32-5-amd64 2.6.32-29 Linux 2.6.32 for 64-bit PCs
ii linux-image-2.6.32-5-vserver-amd64 2.6.32-29 Linux 2.6.32 for 64-bit
PCs, Linux-VServer support
ii linux-image-2.6.32-5-xen-amd64 2.6.32-29 Linux 2.6.32 for 64-bit PCs,
Xen dom0 support
ii linux-image-xen-amd64 2.6.32+28 Linux for 64-bit PCs (meta-package),
Xen dom0 support, Xen dom0 support
ii linux-modules-2.6.26-2-xen-amd64 2.6.26-15 Linux 2.6.26 modules on AMD64
ii xen-hypervisor-4.0-amd64 4.0.1-1 The Xen Hypervisor on AMD64
ii xen-utils-4.0 4.0.1-1 XEN administrative tools
ii xen-utils-common 4.0.0-1 XEN administrative tools - common files
ii xenstore-utils 4.0.1-1 Xenstore utilities for Xen

i made change as requested in previous post and attempted to build, but
build failed, errors look like this:


In file included from /var/lib/dkms/nvidia/195.36.31/build/nv.c:14:
/var/lib/dkms/nvidia/195.36.31/build/nv-linux.h:153:23: error:
asm/maddr.h: No such file or directory
In file included from
/usr/src/linux-headers-2.6.32-5-common-xen/include/linux/compat.h:14,
from
/usr/src/linux-headers-2.6.32-5-common-xen/arch/x86/include/asm/mtrr.h:173,
from /var/lib/dkms/nvidia/195.36.31/build/nv-linux.h:163,
from /var/lib/dkms/nvidia/195.36.31/build/nv.c:14:
/usr/src/linux-headers-2.6.32-5-common-xen/arch/x86/include/asm/compat.h: In
function ‘arch_compat_alloc_user_space’:
/usr/src/linux-headers-2.6.32-5-common-xen/arch/x86/include/asm/compat.h:210:
warning: pointer of type ‘void *’ used in arithmetic
/var/lib/dkms/nvidia/195.36.31/build/nv.c: In function ‘nv_kern_open’:
/var/lib/dkms/nvidia/195.36.31/build/nv.c:2245: error: implicit
declaration of function ‘HYPERVISOR_memory_op’
make[4]: *** [/var/lib/dkms/nvidia/195.36.31/build/nv.o] Error 1
make[3]: *** [_module_/var/lib/dkms/nvidia/195.36.31/build] Error 2
make[2]: *** [sub-make] Error 2
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/linux-headers-2.6.32-5-xen-amd64'
make: *** [modules] Error 2
make: Leaving directory `/var/lib/dkms/nvidia/195.36.31/build'

perhaps i'm just missing a needed build/src module, lemme know if you
want me to get some source in there i'm missing. but i figure it could
just be an unavoidable invalid configuration with that modification.

anywho, i'll try to keep up on this report for a few days, i have it in
a good position to build, and the error is identical to originally
reported. frozen machine, black screen, crashed x with error about
corrupted page table.

#570365#56
Date:
2011-01-15 16:25:36 UTC
From:
To:
Hi,

I'm sorry Andreas, but I didn't have the time to test your proposed
change before now. It seems that someone did it (Joseph) and that the
driver does not compile.

It does not compile here either:

In file included from .../tmp/nvidia-195.36.31/nv.c:14:
.../tmp/nvidia-195.36.31/nv-linux.h:153:23: error: asm/maddr.h: Aucun fichier ou dossier de ce type

and

/usr/src/linux-headers-2.6.32-5-common/include/xen/interface/memory.h: At top level:
/usr/src/linux-headers-2.6.32-5-common/include/xen/interface/memory.h:32: error: expected specifier-qualifier-list before 'GUEST_HANDLE'

... plus 3 more definition errors.

and

.../tmp/nvidia-195.36.31/nv.c: In function 'nv_kern_open':
.../tmp/nvidia-195.36.31/nv.c:2245: error: implicit declaration of function 'HYPERVISOR_memory_op'

maddr.h does not exist on my machine.

I've found that HYPERVISOR_memory_op is defined in hypercall.h.
So I've added hypercall.h to try to make this compile :

#if defined(CONFIG_XEN) // && !defined(CONFIG_PARAVIRT)
//#include <asm/maddr.h>
#include <asm/xen/hypercall.h>
#include <xen/interface/memory.h>
#define NV_XEN_SUPPORT_OLD_STYLE_KERNEL
#endif

But it fails soon after :

.../tmp/nvidia-195.36.31/nv-vm.c: In function 'nv_vm_malloc_pages':
.../tmp/nvidia-195.36.31/nv-vm.c:507: error: implicit declaration of function 'phys_to_machine'

I believe this comes from :

#if defined(NV_XEN_SUPPORT_OLD_STYLE_KERNEL)
#define NV_GET_DMA_ADDRESS(phys_addr) phys_to_machine(phys_addr)
#else
#define NV_GET_DMA_ADDRESS(phys_addr) (phys_addr)
#endif
--- However I managed to try a new test at work : I've installed latest squeeze kernel, xen and nvidia packages (as of 09/01/2011). I used the dkms -- great tool ;) -- to try nvidia driver version 195 and version 260 (from experimental). Both of them are nicely compiled and installed by dkms ! But none of them would work under Xen (DOM0). Same problem with black screen and freeze a few seconds after X starts. Best regards, Michel
#570365#61
Date:
2011-03-16 11:36:48 UTC
From:
To:
Hello,
I try to run nvidia drivers(260.19.44 & 195.36.24)  on  asus at3iont-i
deluxe debian squeeze 2.6.32-5-xen-amd64 #1
SMP Wed Jan 12 05:46:49 UTC 2011 x86_64 GNU/Linux but also get errors:

Message from syslogd@neptun2 at Mar 16 12:20:13
kernel:[   36.146786] Bad pagetable: 000f [#2
SMP


Message from syslogd@neptun2 at Mar 16 12:20:13
kernel:[   36.146797] last sysfs file: /sys/bus/acpi/drivers/NVIDIA ACPI
Video Driver/uevent
Is there any workaround ?

Regards,

#570365#70
Date:
2011-11-24 15:01:28 UTC
From:
To:
Hello,
What is the conclusion of this bug?
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=570365

Debian squeeze, xen and nvidia driver are not supported and will not
work in this release?

What is causing the problem? Is it nvidia driver? Is there a version
of nvida driver where this issue is fixed?

Is there a workaround or patch?

What needs to happen for this to work in squeeze?

Bug was marked as upstream/won't fix. What does that mean for users?

Thanks,
Lucas

#570365#75
Date:
2011-11-24 15:23:43 UTC
From:
To:
Lukasz Szybalski <szybalski@gmail.com> writes:

I'm pretty sure it doesn't work in squeeze and is not going to change
there. Or did the point release contain updates for nvidia?

But it should be working in wheezy or when you use a newer kernel and
nvidia on squeeze (backports?).

For me it was that the nvidia module would conflict with
CONFIG_XEN*. They simply failed to build even when following some
workarounds I found with google.

Try it with the latest kernel+driver and google if you have problems. I
hope the issue is fixed upstream by now but I simply haven't had time to
setup a test system to try again yet.

So what needs to happen now is that someone tries this again and reports
back.

Don't expect it to work in squeeze but it should be fixable with
backports (if it isn't fixed already). Again, someone needs to test and
report back.

MfG
        Goswin

#570365#80
Date:
2011-11-24 16:34:52 UTC
From:
To:

I have installed the "wheezy" nvidia driver and it still does not work
with current 2.6.32 kernel. So future version of debian will probably
not work as well.  The module compiles fine for me using m-a, but Xorg
is running at 100% cpu and I get black screen. Unable to do anything.
Its a "grave" situation :(