[ubuntu-x] Intel GPU hangs and batchbuffer dumps

Geir Ove Myhr gomyhr at gmail.com
Wed Mar 3 16:52:56 GMT 2010


>> With the patch from Chris Wilson, it should be sufficient to capture
>> only the file i915_error_state, but I guess we have to get the timing
>> right. The udev rule is only triggered when the kernel notices that
>> the GPU is hung, right? At that time the GPU is reset and this is
>> probably also the time that i915_error_state shows up. So I'm
>> wondering if we currently end up with recording a GPU dump of a
>> reinitialized GPU, which is not very useful. Maybe this would have
>> been obvious to me if I knew how to read the output of
>> intel_gpu_dump...

I have read up a bit on intel_gpu_dump. Apparently, there was some
rationale for doing it the way it's currently done. I found this in
xserver-xorg-video-intel_2:2.9.0-1ubuntu2_2:2.9.0-1ubuntu4.diff.gz:

--- xserver-xorg-video-intel-2.9.0.orig/debian/xserver-xorg-video-intel.udev
+++ xserver-xorg-video-intel-2.9.0/debian/xserver-xorg-video-intel.udev
@@ -0,0 +1,10 @@
+# do not edit this file, it will be overwritten on update
+
+# Jesse Barnes on ubuntu-devel at lists.ubuntu.com:
+#   You'll get three events, one when the error is detected, one before the
+#   reset and one after.  Each has a different environment variable set; the
+#   initial error has ERROR=1, the pre-reset event has RESET=1 and the
+#   post-reset event has ERROR=0.
+
+
+DRIVER=="i915, "ACTION=="change", ENV{ERROR}==1,
PROGRAM="/usr/share/apport/apport-gpu-error-intel.py"

So the event is indeed triggered before the reset happens. At that
point intel_gpu_dump should give a useful dump and i915_error_state
will contain nothing useful yet. At some point, at least when the
capture-error-state patch is in the Ubuntu kernel, we should trigger
at ERROR=0 and capture i915_error_state (which can be decoded with
intel_error_decode from intel-gpu-tools in newest git).

> Jesse, here are a few examples of the dumps we're collecting now.  Mind
> doublechecking that this are actually useful dumps?

I'm not an expert, but it looks like they have some potentially useful
information.

>  https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/529702
This one is maybe not so useful. The ringbuffer isn't shown.

>  https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/529410
The ringbuffer is all zero (MI_NOOP), but PGTBL_ER: 0x00000010
indicates that the hardware has detected an error. According the the
i965 PRM [1] it is "Invalid GTT Entry during Display A Fetch".

>  https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/528795
PGTBL_ER: 0x00000029
This one also has something in the Page Table Error register. Here,
bits 5, 3 and 0 are set. On i965 5 and 3 are reserved, but 0 means
"Invalid GTT Entry during Fetch on behalf of the Host".

It will be interesting once we get the first bug reports from
xserver-xorg-video-intel 2.9.1-1ubuntu8, since that also should have
hardware information attached :-)

[1]: http://intellinuxgraphics.org/VOL_1_graphics_core.pdf



More information about the Ubuntu-x mailing list