#724944 xserver-xorg-video-intel: segfaults when trying to play movies

Package:
xserver-xorg-video-intel
Source:
xserver-xorg-video-intel
Description:
X.Org X server -- Intel i8xx, i9xx display driver
Submitter:
Bas Wijnen
Date:
2013-11-04 10:06:09 UTC
Severity:
important
#724944#5
Date:
2013-09-29 20:01:18 UTC
From:
To:
I can reliably crash the X server by starting almost any video with almost any
player.  At least mplayer, xine, totem, and xawtv (showing images from my
webcam) have managed to crash it.  Log file attached.  If you need debugging
symbols or anything like that, let me know how to generate it and I'm happy to
send it.

The only way I have currently found to play video without crashing X, is using
mplayer -vo x11.  I think that disables a lot of hardware acceleration, but I'm
by no means an expert on this topic.

Thanks,
Bas

#724944#10
Date:
2013-09-29 20:18:00 UTC
From:
To:
Bas Wijnen <wijnen@debian.org> (2013-09-29):

No, that isn't data loss.
http://x.debian.net/howto/use-gdb.html

Also, where's the bug script output?

Mraw,
KiBi.

#724944#17
Date:
2013-09-30 01:21:12 UTC
From:
To:
Control: notfound -1 2:2.21.15-1+b2

I sure lost data on it, but I'm not arguing that the lower severity is
reasonable.

The backtrace was included in the log file, hence I didn't think it would be
useful to generate another one.  I can see how it would be helpful to have a
trace with symbols, but I needed help with getting those; that was on the page
however, so I installed xserver-xorg-core-dbg.  For that, I needed to upgrade
to the +b2 version, which doesn't crash anymore...

I don't have a proper mail system set up on this host, so reportbug doesn't
work well (I think).  But you're right, I should have generated that list.
Sorry about that.

Thanks for your quick reply and sorry for the noise,
Bas

#724944#22
Date:
2013-09-30 02:45:32 UTC
From:
To:
Ok, so it isn't as easily reproducible as it was and now it seems many times a
video will actually play without a problem, but at random times, it still
happens.  I didn't yet change my gdm config, so I don't have a backtrace with
symbols for you yet.  (Then again, I'm not sure if that will be much better
when I do get a core; would you expect more symbols than this?)

[519676.567] (EE) Backtrace:
[519676.567] (EE) 0: /usr/bin/Xorg (xorg_backtrace+0x49) [0xb76dff89]
[519676.567] (EE) 1: /usr/bin/Xorg (0xb7540000+0x1a3d14) [0xb76e3d14]
[519676.567] (EE) 2: linux-gate.so.1 (__kernel_rt_sigreturn+0x0) [0xb751e40c]
[519676.567] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6dbb000+0xfcd79) [0xb6eb7d79]
[519676.567] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6dbb000+0xf8215) [0xb6eb3215]
[519676.567] (EE) 5: /usr/bin/Xorg (0xb7540000+0x9776c) [0xb75d776c]
[519676.567] (EE) 6: /usr/bin/Xorg (XvdiPutImage+0x1cc) [0xb762203c]
[519676.567] (EE) 7: /usr/bin/Xorg (0xb7540000+0xe3638) [0xb7623638]
[519676.567] (EE) 8: /usr/bin/Xorg (ProcXvDispatch+0x2e) [0xb7625d9e]
[519676.567] (EE) 9: /usr/bin/Xorg (0xb7540000+0x3c35d) [0xb757c35d]
[519676.567] (EE) 10: /usr/bin/Xorg (0xb7540000+0x2a38a) [0xb756a38a]
[519676.568] (EE) 11: /lib/i386-linux-gnu/i686/cmov/libc.so.6 (__libc_start_main+0xf5) [0xb710f8c5]
[519676.568] (EE) 12: /usr/bin/Xorg (0xb7540000+0x2a768) [0xb756a768]
[519676.568] (EE)
[519676.568] (EE) Segmentation fault at address 0xe
[519676.568] (EE)
Fatal server error:
[519676.568] (EE) Caught signal 11 (Segmentation fault). Server aborting

Reportbug's output is attached.  Note that the logfile it includes is from a
server I ran with X :1 vt9 and tried unsuccessfully to crash.  The above is
from a logfile where the server did crash.

Thanks,
Bas

#724944#27
Date:
2013-09-30 03:14:21 UTC
From:
To:
I've now made it crash while it was running with -core.  It took me some time
to find the core file (it would be good to mention on that page that for (new)
gdm, it is in /var/lib/gdm3).

Anyway, I attached the gdb logs of the bt and bt full commands.  I don't know
why it says it has no symbol table; I did install the -dbg package.  Should I
do something to load the symbols from that package?

Thanks,
Bas

#724944#32
Date:
2013-09-30 20:39:47 UTC
From:
To:
That trace doesn't seem to have debug symbols for the driver.

Cheers,
Julien

#724944#37
Date:
2013-10-01 00:25:28 UTC
From:
To:
I didn't realize there was a per-driver debug package; it would be useful if
this FIXME in the gdb document at least mentions that.

Hopefully the attached file is better, even though it still complains about
some missing symbols.  Please tell me what to do or install if you need more
information.

Thanks,
Bas

#724944#42
Date:
2013-10-01 14:51:01 UTC
From:
To:
Still no info about what's going on in the driver.

Cheers,
Julien

#724944#47
Date:
2013-10-01 15:01:01 UTC
From:
To:

#724944#52
Date:
2013-10-11 12:38:12 UTC
From:
To:
First of all, I can see how you're busy, but if you think my problem is
trivial, please just tell me so.  If I'm sending a message saying "I don't know
how to continue", even explicitly saying that I know this may not be what you
need, a reply only saying "this is not what we need" is totally unhelpful.  It
shouldn't be too much effort to type the extra sentence "what you describe you
did should have worked, did you try restarting the server?" (which I thought I
did, but I suppose I didn't).  A line like that helps more than you might
think; it confirms that I was on the right track.  Since I don't know much
about the code or how it's supposed to work, that is good to know.

Anyway, a server which is unable to play any video 5 minutes after starting
gives quite a strong motivation to fix things.  So after I got a backtrace, I
debugged the thing.  The problem was that the result of
intel_get_pixmap_private() could be NULL, but that wasn't checked.  So I
grepped for it and added checks to all calls of that function.  The patch is
attached.  You will want to check if I'm handling it the right way everywhere,
because I just guessed the proper course of action.  Then again, most code
would segfault without handling it, so perhaps most of these can't ever be
triggered anyay (but I'm not too sure about that; it certainly can set it to
NULL when calloc fails).

Thanks,
Bas

#724944#59
Date:
2013-10-11 12:06:00 UTC
From:
To:
First of all, I can see how you're busy, but if you think my problem is
trivial, please just tell me so.  If I'm sending a message saying "I
don't know how to continue", even explicitly saying that I know this may
not be what you need, a reply only saying "this is not what we need" is
totally unhelpful.  It shouldn't be too much effort to type the extra
sentence "what you describe you did should have worked, did you try
restarting the server?" (which I thought I did, but I suppose I didn't).
A line like that helps more than you might think; it confirms that I was
on the right track.  Since I don't know much about the code or how it's
supposed to work, that is good to know.

Anyway, a server which is unable to play any video 5 minutes after
starting gives quite a strong motivation to fix things.  So after I got
a backtrace, I debugged the thing.  The problem was that the result of
intel_get_pixmap_private() could be NULL, but that wasn't checked.  So I
grepped for it and added checks to all calls of that function.  The
patch is attached.  You will want to check if I'm handling it the right
way everywhere, because I just guessed the proper course of action.
Then again, most code would segfault without handling it, so perhaps
most of these can't ever be triggered anyay (but I'm not too sure about
that; it certainly can set it to NULL when calloc fails).

Thanks,
Bas

#724944#64
Date:
2013-10-11 18:53:03 UTC
From:
To:
Thanks.  Can you please send this upstream to
intel-gfx@lists.freedesktop.org?

Cheers,
Julien

#724944#67
Date:
2013-10-11 19:24:54 UTC
From:
To:
Hello,

My X server was crashing when playing video, and I wrote a patch to fix
it.  Please find the background and the patch at
http://bugs.debian.org/724944 .

Thanks,
Bas

#724944#72
Date:
2013-10-12 03:15:09 UTC
From:
To:
Done.  (I didn't subscribe to the list; not sure if that was required.
My mail wasn't bounced, so I suppose it worked.)

By the way, I just noticed that while the patch does prevent the server
from crashing, it doesn't actually solve the problem: videos are now all
black.  Not crashing the server is certainly an improvement, but this is
still unusable. :-(

I'm guessing the problem is whatever sets the intel_pixmap_private field
to NULL, but I have no idea where to look for that, or how to debug it.

It only happens after the server has been running for some time (a few
minutes), which sounds like it will not be easy to track down,
unfortunately.  If anyone wants to try, or can tell me what I can try,
please let me know.

Thanks,
Bas

#724944#73
Date:
2013-10-12 20:46:14 UTC
From:
To:
The patch is a shotgun solution, putting NULL pointer checks where the
pointer is explicitly not allowed to be NULL. I need an actual
stacktrace to find the root cause.
-Chris

#724944#78
Date:
2013-10-13 03:49:04 UTC
From:
To:
Sure thing; you can find it attached.  Of course it shows when the
segfault is triggered, not when the data became NULL.  And that should
be fixed, because even though the server doesn't crash with the patch,
it also doesn't play video.

If you need any more information (like debug statements in the
set_pixmap_private?), please let me know how I can generate it.

Thanks,
Bas

#724944#83
Date:
2013-10-13 09:43:49 UTC
From:
To:
commit f9a18c9f38d09c145eb513ca989966dc135c1e9b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Oct 13 10:36:35 2013 +0100

    uxa: Check for allocation failure in i915 video

    For a large screen, we have to create a temporary surface for rendering
    the textured video. If this pixmap creation fails we may be left with a
    system memory only pixmap leading to a segfault.

    Reported-by: Bas Wijnen <wijnen@debian.org>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

However, that still leaveas the question as to how you ended up being
unable to allocate bo...

You can watch /sys/kernel/debug/dri/0/i915_gem_objects (or just use
intel-gpu-overlay) and see if there is an object leak.
-Chris

#724944#88
Date:
2013-10-15 01:46:08 UTC
From:
To:
This does indeed stop the server from crashing, but actually makes the
problem worse: it used to play video for a few minutes and then crash
when trying.  With my patch it would play video for a few minutes and
then present black screens when trying.  With your patch, it presents
black screens from the start.

I must say I'm not entirely sure if the backtrace I sent you is a
"typical" case; I managed to crash it sooner than usual, so perhaps it
wasn't the bug that I triggered before.  It did stop the crashing
however.

I don't have enough knowledge about the internals to know how that
works.  I can see the file if I mount the debugfs, but what am I looking
for?

I don't seem to have intel-gpu-overlay on my system; does it make sense
to install it?  If so, where do I get it?

While looking for it I did find and try intel-gpu-time, and noticed that
it always reports the gpu 100% busy, even when running intel-gpu-time
sleep 5 from a linux virtual terminal (so not even X is displayed).  Is
that normal?

Thanks,
Bas

#724944#93
Date:
2013-10-15 08:25:41 UTC
From:
To:
Start of video, or beginning of X?

I made two changes. The first to check for a failed GPU pixmap
allocation during video playback and the second to check for a failed
malloc during Screen initialisation.

Neither should be likely.

An increase in the number of total objects and allocated bytes.

It just presents the same information, so not really important if you
are happy with catting the debugfs file.

Hmm, looks like it should report correctly on i915.
-Chris

#724944#98
Date:
2013-10-16 14:30:57 UTC
From:
To:
Beginning of X.  After starting and logging in, I can play them for a
few minutes; afterwards it will crash.

I didn't check the backtrace myself, but when I wrote my shotgun-patch,
the problem was that pixmap_private was NULL; bo is in there, right?  So
at least in that case, it could never have allocated it, or at least it
couldn't store the pointer.

Due to unrelated problems (unbearable slowness) I switched from gnome to
xfce.  It does report 0% now.  It seems gnome keeps the gpu busy even if
it's not displaying anything...

Thanks,
Bas

#724944#103
Date:
2013-10-16 15:22:43 UTC
From:
To:
Still weird. Can you attach the Xorg.log from the black screen and/or crash.

I doubt we failed to malloc the intel_pixmap, so the reason why the
intel_pixmap would be NULL is more likely due to failure to allocate the
GPU buffer object i.e. they are semantically interchangeable, an
attached intel_pixmap to a Pixmap implies we have a GPU bo. Similarly
checking for the intel_pixmap should be enough to assert that the GPU bo
exists.
-Chris

#724944#108
Date:
2013-10-23 00:30:51 UTC
From:
To:
That took some time, because since I switched to xfce, it is a lot more
stable.  However, after running for a few days it still crashed when
trying to play a video.  The log is attached.

I would have attached a detailed backtrace as well, but unfortunately I
forgot to switch the core dump option on when switching from gdm to xdm,
so I don't have a core this time.

Thanks,
Bas

#724944#113
Date:
2013-10-23 08:28:28 UTC
From:
To:
No worries, if you can run

addr2line -e /usr/lib/xorg/modules/drivers/intel_drv.so -i 0xfcd79 0xf8215

that should give me the information needed to pinpoint the crash.
-Chris

#724944#118
Date:
2013-10-25 03:46:53 UTC
From:
To:
$ addr2line -e /usr/lib/xorg/modules/drivers/intel_drv.so -i 0xfcd79
0xf8215
/build/xserver-xorg-video-intel-WbV7Z9/xserver-xorg-video-intel-2.21.15/build/src/uxa/../../../src/uxa/intel.h:138
/build/xserver-xorg-video-intel-WbV7Z9/xserver-xorg-video-intel-2.21.15/build/src/uxa/../../../src/uxa/i915_video.c:156
/build/xserver-xorg-video-intel-WbV7Z9/xserver-xorg-video-intel-2.21.15/build/src/uxa/../../../src/uxa/intel_video.c:1584

Note that I'm running the unpatched Debian version again (so not with
your or my patch), which is why it was crashing.

In case you have different sources, here's some context for those lines:

intel.h:138 is
 static inline Bool intel_pixmap_tiled(PixmapPtr pixmap)
 {
 }

i915_video.c:156 is
                /* front buffer, pitch, offset */
			tiling = BUF_3D_TILED_SURFACE;

and intel_video.c:1584 is
        } else {
		width, height, dstPitch, dstPitch2,
		src_w, src_h, drw_w, drw_h,
		pixmap);
	}

Thanks,
Bas

#724944#123
Date:
2013-10-25 08:33:12 UTC
From:
To:
Ah. Ok, but we still don't know how we end up in this situation. If you
apply the patch to prevent the crash here, can you please report what
the contents of /sys/kernel/debug/dri/0/i915_gem_objects is at the time
the video goes black?
-Chris

#724944#128
Date:
2013-11-03 15:15:18 UTC
From:
To:
Hi Chris,

I got a black screen while using your patch.
/sys/kernel/debug/dri/0/i915_gem_objects contents are shown below.  The first
time is while the video is running; the second after stopping it.  AFAICS,
there is no difference between them.

However, after starting a new video, there is a difference in active objects;
not sure if it is related (I don't really know what any of it means).  That is
the third one.

Thanks,
Bas

root@star:/sys/kernel/debug/dri/0# cat i915_gem_objects
220 objects, 36782080 bytes
131 [131] objects, 34430976 [34430976] bytes in gtt
  0 [0] active objects, 0 [0] bytes
  131 [131] inactive objects, 34430976 [34430976] bytes
49 unbound objects, 638976 bytes
1 purgeable objects, 4096 bytes
6 pinned mappable objects, 15884288 bytes
118 fault mappable objects, 27901952 bytes
536870912 [268435456] gtt total

Xorg: 217 objects, 36642816 bytes (0 active, 30703616 inactive, 5922816 unbound)
root@star:/sys/kernel/debug/dri/0# cat i915_gem_objects
220 objects, 36782080 bytes
131 [131] objects, 34430976 [34430976] bytes in gtt
  0 [0] active objects, 0 [0] bytes
  131 [131] inactive objects, 34430976 [34430976] bytes
49 unbound objects, 638976 bytes
1 purgeable objects, 4096 bytes
6 pinned mappable objects, 15884288 bytes
118 fault mappable objects, 27901952 bytes
536870912 [268435456] gtt total

Xorg: 217 objects, 36642816 bytes (0 active, 30703616 inactive, 5922816 unbound)
root@star:/sys/kernel/debug/dri/0# cat i915_gem_objects
220 objects, 36782080 bytes
131 [131] objects, 34430976 [34430976] bytes in gtt
  2 [2] active objects, 32768 [32768] bytes
  129 [129] inactive objects, 34398208 [34398208] bytes
49 unbound objects, 638976 bytes
1 purgeable objects, 4096 bytes
6 pinned mappable objects, 15884288 bytes
118 fault mappable objects, 27901952 bytes
536870912 [268435456] gtt total

Xorg: 217 objects, 36642816 bytes (32768 active, 30670848 inactive, 5922816 unbound)

#724944#133
Date:
2013-11-04 09:45:58 UTC
From:
To:
Indeed, that scuppers my theory about it being an allocation failure to
GEM space exhaustion. I am not sure what else to suggest to explain why
it spontaneously fails to allocate that BO. You could try drm.debug=7,
but searching for the failure would be like hunting for a needle in a
haystack - and I'm not sure if we give any information as to how it
fails, nor does libdrm_intel have any such debug. You can try disabling
the uxa BO cache with Option "BufferCache" "false" and seeing if that
makes any difference.
-Chris