- Package:
- xserver-xorg-video-intel
- Source:
- xserver-xorg-video-intel
- Description:
- X.Org X server -- Intel i8xx, i9xx display driver
- Submitter:
- Bas Wijnen
- Date:
- 2013-11-04 10:06:09 UTC
- Severity:
- important
I can reliably crash the X server by starting almost any video with almost any player. At least mplayer, xine, totem, and xawtv (showing images from my webcam) have managed to crash it. Log file attached. If you need debugging symbols or anything like that, let me know how to generate it and I'm happy to send it. The only way I have currently found to play video without crashing X, is using mplayer -vo x11. I think that disables a lot of hardware acceleration, but I'm by no means an expert on this topic. Thanks, Bas
Bas Wijnen <wijnen@debian.org> (2013-09-29): No, that isn't data loss. http://x.debian.net/howto/use-gdb.html Also, where's the bug script output? Mraw, KiBi.
Control: notfound -1 2:2.21.15-1+b2 I sure lost data on it, but I'm not arguing that the lower severity is reasonable. The backtrace was included in the log file, hence I didn't think it would be useful to generate another one. I can see how it would be helpful to have a trace with symbols, but I needed help with getting those; that was on the page however, so I installed xserver-xorg-core-dbg. For that, I needed to upgrade to the +b2 version, which doesn't crash anymore... I don't have a proper mail system set up on this host, so reportbug doesn't work well (I think). But you're right, I should have generated that list. Sorry about that. Thanks for your quick reply and sorry for the noise, Bas
Ok, so it isn't as easily reproducible as it was and now it seems many times a video will actually play without a problem, but at random times, it still happens. I didn't yet change my gdm config, so I don't have a backtrace with symbols for you yet. (Then again, I'm not sure if that will be much better when I do get a core; would you expect more symbols than this?) [519676.567] (EE) Backtrace: [519676.567] (EE) 0: /usr/bin/Xorg (xorg_backtrace+0x49) [0xb76dff89] [519676.567] (EE) 1: /usr/bin/Xorg (0xb7540000+0x1a3d14) [0xb76e3d14] [519676.567] (EE) 2: linux-gate.so.1 (__kernel_rt_sigreturn+0x0) [0xb751e40c] [519676.567] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6dbb000+0xfcd79) [0xb6eb7d79] [519676.567] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6dbb000+0xf8215) [0xb6eb3215] [519676.567] (EE) 5: /usr/bin/Xorg (0xb7540000+0x9776c) [0xb75d776c] [519676.567] (EE) 6: /usr/bin/Xorg (XvdiPutImage+0x1cc) [0xb762203c] [519676.567] (EE) 7: /usr/bin/Xorg (0xb7540000+0xe3638) [0xb7623638] [519676.567] (EE) 8: /usr/bin/Xorg (ProcXvDispatch+0x2e) [0xb7625d9e] [519676.567] (EE) 9: /usr/bin/Xorg (0xb7540000+0x3c35d) [0xb757c35d] [519676.567] (EE) 10: /usr/bin/Xorg (0xb7540000+0x2a38a) [0xb756a38a] [519676.568] (EE) 11: /lib/i386-linux-gnu/i686/cmov/libc.so.6 (__libc_start_main+0xf5) [0xb710f8c5] [519676.568] (EE) 12: /usr/bin/Xorg (0xb7540000+0x2a768) [0xb756a768] [519676.568] (EE) [519676.568] (EE) Segmentation fault at address 0xe [519676.568] (EE) Fatal server error: [519676.568] (EE) Caught signal 11 (Segmentation fault). Server aborting Reportbug's output is attached. Note that the logfile it includes is from a server I ran with X :1 vt9 and tried unsuccessfully to crash. The above is from a logfile where the server did crash. Thanks, Bas
I've now made it crash while it was running with -core. It took me some time to find the core file (it would be good to mention on that page that for (new) gdm, it is in /var/lib/gdm3). Anyway, I attached the gdb logs of the bt and bt full commands. I don't know why it says it has no symbol table; I did install the -dbg package. Should I do something to load the symbols from that package? Thanks, Bas
That trace doesn't seem to have debug symbols for the driver. Cheers, Julien
I didn't realize there was a per-driver debug package; it would be useful if this FIXME in the gdb document at least mentions that. Hopefully the attached file is better, even though it still complains about some missing symbols. Please tell me what to do or install if you need more information. Thanks, Bas
Still no info about what's going on in the driver. Cheers, Julien
First of all, I can see how you're busy, but if you think my problem is trivial, please just tell me so. If I'm sending a message saying "I don't know how to continue", even explicitly saying that I know this may not be what you need, a reply only saying "this is not what we need" is totally unhelpful. It shouldn't be too much effort to type the extra sentence "what you describe you did should have worked, did you try restarting the server?" (which I thought I did, but I suppose I didn't). A line like that helps more than you might think; it confirms that I was on the right track. Since I don't know much about the code or how it's supposed to work, that is good to know. Anyway, a server which is unable to play any video 5 minutes after starting gives quite a strong motivation to fix things. So after I got a backtrace, I debugged the thing. The problem was that the result of intel_get_pixmap_private() could be NULL, but that wasn't checked. So I grepped for it and added checks to all calls of that function. The patch is attached. You will want to check if I'm handling it the right way everywhere, because I just guessed the proper course of action. Then again, most code would segfault without handling it, so perhaps most of these can't ever be triggered anyay (but I'm not too sure about that; it certainly can set it to NULL when calloc fails). Thanks, Bas
First of all, I can see how you're busy, but if you think my problem is trivial, please just tell me so. If I'm sending a message saying "I don't know how to continue", even explicitly saying that I know this may not be what you need, a reply only saying "this is not what we need" is totally unhelpful. It shouldn't be too much effort to type the extra sentence "what you describe you did should have worked, did you try restarting the server?" (which I thought I did, but I suppose I didn't). A line like that helps more than you might think; it confirms that I was on the right track. Since I don't know much about the code or how it's supposed to work, that is good to know. Anyway, a server which is unable to play any video 5 minutes after starting gives quite a strong motivation to fix things. So after I got a backtrace, I debugged the thing. The problem was that the result of intel_get_pixmap_private() could be NULL, but that wasn't checked. So I grepped for it and added checks to all calls of that function. The patch is attached. You will want to check if I'm handling it the right way everywhere, because I just guessed the proper course of action. Then again, most code would segfault without handling it, so perhaps most of these can't ever be triggered anyay (but I'm not too sure about that; it certainly can set it to NULL when calloc fails). Thanks, Bas
Thanks. Can you please send this upstream to intel-gfx@lists.freedesktop.org? Cheers, Julien
Hello, My X server was crashing when playing video, and I wrote a patch to fix it. Please find the background and the patch at http://bugs.debian.org/724944 . Thanks, Bas
Done. (I didn't subscribe to the list; not sure if that was required. My mail wasn't bounced, so I suppose it worked.) By the way, I just noticed that while the patch does prevent the server from crashing, it doesn't actually solve the problem: videos are now all black. Not crashing the server is certainly an improvement, but this is still unusable. :-( I'm guessing the problem is whatever sets the intel_pixmap_private field to NULL, but I have no idea where to look for that, or how to debug it. It only happens after the server has been running for some time (a few minutes), which sounds like it will not be easy to track down, unfortunately. If anyone wants to try, or can tell me what I can try, please let me know. Thanks, Bas
The patch is a shotgun solution, putting NULL pointer checks where the pointer is explicitly not allowed to be NULL. I need an actual stacktrace to find the root cause. -Chris
Sure thing; you can find it attached. Of course it shows when the segfault is triggered, not when the data became NULL. And that should be fixed, because even though the server doesn't crash with the patch, it also doesn't play video. If you need any more information (like debug statements in the set_pixmap_private?), please let me know how I can generate it. Thanks, Bas
commit f9a18c9f38d09c145eb513ca989966dc135c1e9b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Sun Oct 13 10:36:35 2013 +0100
uxa: Check for allocation failure in i915 video
For a large screen, we have to create a temporary surface for rendering
the textured video. If this pixmap creation fails we may be left with a
system memory only pixmap leading to a segfault.
Reported-by: Bas Wijnen <wijnen@debian.org>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
However, that still leaveas the question as to how you ended up being
unable to allocate bo...
You can watch /sys/kernel/debug/dri/0/i915_gem_objects (or just use
intel-gpu-overlay) and see if there is an object leak.
-Chris
This does indeed stop the server from crashing, but actually makes the problem worse: it used to play video for a few minutes and then crash when trying. With my patch it would play video for a few minutes and then present black screens when trying. With your patch, it presents black screens from the start. I must say I'm not entirely sure if the backtrace I sent you is a "typical" case; I managed to crash it sooner than usual, so perhaps it wasn't the bug that I triggered before. It did stop the crashing however. I don't have enough knowledge about the internals to know how that works. I can see the file if I mount the debugfs, but what am I looking for? I don't seem to have intel-gpu-overlay on my system; does it make sense to install it? If so, where do I get it? While looking for it I did find and try intel-gpu-time, and noticed that it always reports the gpu 100% busy, even when running intel-gpu-time sleep 5 from a linux virtual terminal (so not even X is displayed). Is that normal? Thanks, Bas
Start of video, or beginning of X? I made two changes. The first to check for a failed GPU pixmap allocation during video playback and the second to check for a failed malloc during Screen initialisation. Neither should be likely. An increase in the number of total objects and allocated bytes. It just presents the same information, so not really important if you are happy with catting the debugfs file. Hmm, looks like it should report correctly on i915. -Chris
Beginning of X. After starting and logging in, I can play them for a few minutes; afterwards it will crash. I didn't check the backtrace myself, but when I wrote my shotgun-patch, the problem was that pixmap_private was NULL; bo is in there, right? So at least in that case, it could never have allocated it, or at least it couldn't store the pointer. Due to unrelated problems (unbearable slowness) I switched from gnome to xfce. It does report 0% now. It seems gnome keeps the gpu busy even if it's not displaying anything... Thanks, Bas
Still weird. Can you attach the Xorg.log from the black screen and/or crash. I doubt we failed to malloc the intel_pixmap, so the reason why the intel_pixmap would be NULL is more likely due to failure to allocate the GPU buffer object i.e. they are semantically interchangeable, an attached intel_pixmap to a Pixmap implies we have a GPU bo. Similarly checking for the intel_pixmap should be enough to assert that the GPU bo exists. -Chris
That took some time, because since I switched to xfce, it is a lot more stable. However, after running for a few days it still crashed when trying to play a video. The log is attached. I would have attached a detailed backtrace as well, but unfortunately I forgot to switch the core dump option on when switching from gdm to xdm, so I don't have a core this time. Thanks, Bas
No worries, if you can run addr2line -e /usr/lib/xorg/modules/drivers/intel_drv.so -i 0xfcd79 0xf8215 that should give me the information needed to pinpoint the crash. -Chris
$ addr2line -e /usr/lib/xorg/modules/drivers/intel_drv.so -i 0xfcd79
0xf8215
/build/xserver-xorg-video-intel-WbV7Z9/xserver-xorg-video-intel-2.21.15/build/src/uxa/../../../src/uxa/intel.h:138
/build/xserver-xorg-video-intel-WbV7Z9/xserver-xorg-video-intel-2.21.15/build/src/uxa/../../../src/uxa/i915_video.c:156
/build/xserver-xorg-video-intel-WbV7Z9/xserver-xorg-video-intel-2.21.15/build/src/uxa/../../../src/uxa/intel_video.c:1584
Note that I'm running the unpatched Debian version again (so not with
your or my patch), which is why it was crashing.
In case you have different sources, here's some context for those lines:
intel.h:138 is
static inline Bool intel_pixmap_tiled(PixmapPtr pixmap)
{
}
i915_video.c:156 is
/* front buffer, pitch, offset */
tiling = BUF_3D_TILED_SURFACE;
and intel_video.c:1584 is
} else {
width, height, dstPitch, dstPitch2,
src_w, src_h, drw_w, drw_h,
pixmap);
}
Thanks,
Bas
Ah. Ok, but we still don't know how we end up in this situation. If you apply the patch to prevent the crash here, can you please report what the contents of /sys/kernel/debug/dri/0/i915_gem_objects is at the time the video goes black? -Chris
Hi Chris, I got a black screen while using your patch. /sys/kernel/debug/dri/0/i915_gem_objects contents are shown below. The first time is while the video is running; the second after stopping it. AFAICS, there is no difference between them. However, after starting a new video, there is a difference in active objects; not sure if it is related (I don't really know what any of it means). That is the third one. Thanks, Bas root@star:/sys/kernel/debug/dri/0# cat i915_gem_objects 220 objects, 36782080 bytes 131 [131] objects, 34430976 [34430976] bytes in gtt 0 [0] active objects, 0 [0] bytes 131 [131] inactive objects, 34430976 [34430976] bytes 49 unbound objects, 638976 bytes 1 purgeable objects, 4096 bytes 6 pinned mappable objects, 15884288 bytes 118 fault mappable objects, 27901952 bytes 536870912 [268435456] gtt total Xorg: 217 objects, 36642816 bytes (0 active, 30703616 inactive, 5922816 unbound) root@star:/sys/kernel/debug/dri/0# cat i915_gem_objects 220 objects, 36782080 bytes 131 [131] objects, 34430976 [34430976] bytes in gtt 0 [0] active objects, 0 [0] bytes 131 [131] inactive objects, 34430976 [34430976] bytes 49 unbound objects, 638976 bytes 1 purgeable objects, 4096 bytes 6 pinned mappable objects, 15884288 bytes 118 fault mappable objects, 27901952 bytes 536870912 [268435456] gtt total Xorg: 217 objects, 36642816 bytes (0 active, 30703616 inactive, 5922816 unbound) root@star:/sys/kernel/debug/dri/0# cat i915_gem_objects 220 objects, 36782080 bytes 131 [131] objects, 34430976 [34430976] bytes in gtt 2 [2] active objects, 32768 [32768] bytes 129 [129] inactive objects, 34398208 [34398208] bytes 49 unbound objects, 638976 bytes 1 purgeable objects, 4096 bytes 6 pinned mappable objects, 15884288 bytes 118 fault mappable objects, 27901952 bytes 536870912 [268435456] gtt total Xorg: 217 objects, 36642816 bytes (32768 active, 30670848 inactive, 5922816 unbound)
Indeed, that scuppers my theory about it being an allocation failure to GEM space exhaustion. I am not sure what else to suggest to explain why it spontaneously fails to allocate that BO. You could try drm.debug=7, but searching for the failure would be like hunting for a needle in a haystack - and I'm not sure if we give any information as to how it fails, nor does libdrm_intel have any such debug. You can try disabling the uxa BO cache with Option "BufferCache" "false" and seeing if that makes any difference. -Chris