[ubuntu-x] Intel GPU hangs and batchbuffer dumps

Bryce Harrington bryce at canonical.com
Wed Mar 3 18:10:05 GMT 2010


On Wed, Mar 03, 2010 at 05:52:56PM +0100, Geir Ove Myhr wrote:
> >> With the patch from Chris Wilson, it should be sufficient to capture
> >> only the file i915_error_state, but I guess we have to get the timing
> >> right. The udev rule is only triggered when the kernel notices that
> >> the GPU is hung, right? At that time the GPU is reset and this is
> >> probably also the time that i915_error_state shows up. So I'm
> >> wondering if we currently end up with recording a GPU dump of a
> >> reinitialized GPU, which is not very useful. Maybe this would have
> >> been obvious to me if I knew how to read the output of
> >> intel_gpu_dump...
> 
> I have read up a bit on intel_gpu_dump. Apparently, there was some
> rationale for doing it the way it's currently done. I found this in
> xserver-xorg-video-intel_2:2.9.0-1ubuntu2_2:2.9.0-1ubuntu4.diff.gz:
> 
> --- xserver-xorg-video-intel-2.9.0.orig/debian/xserver-xorg-video-intel.udev
> +++ xserver-xorg-video-intel-2.9.0/debian/xserver-xorg-video-intel.udev
> @@ -0,0 +1,10 @@
> +# do not edit this file, it will be overwritten on update
> +
> +# Jesse Barnes on ubuntu-devel at lists.ubuntu.com:
> +#   You'll get three events, one when the error is detected, one before the
> +#   reset and one after.  Each has a different environment variable set; the
> +#   initial error has ERROR=1, the pre-reset event has RESET=1 and the
> +#   post-reset event has ERROR=0.
> +
> +
> +DRIVER=="i915, "ACTION=="change", ENV{ERROR}==1,
> PROGRAM="/usr/share/apport/apport-gpu-error-intel.py"
> 
> So the event is indeed triggered before the reset happens. At that
> point intel_gpu_dump should give a useful dump and i915_error_state
> will contain nothing useful yet. At some point, at least when the
> capture-error-state patch is in the Ubuntu kernel, we should trigger
> at ERROR=0 and capture i915_error_state (which can be decoded with
> intel_error_decode from intel-gpu-tools in newest git).

Hrm, that sounds rather involved.  I'm not sure how we could arrange to
have apport file a bug with info from two separate invocations.  Think
of any way to collect it in one call?

> > Jesse, here are a few examples of the dumps we're collecting now. ?Mind
> > doublechecking that this are actually useful dumps?
> 
> I'm not an expert, but it looks like they have some potentially useful
> information.
> 
> > ?https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/529702
> This one is maybe not so useful. The ringbuffer isn't shown.
> 
> > ?https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/529410
> The ringbuffer is all zero (MI_NOOP), but PGTBL_ER: 0x00000010
> indicates that the hardware has detected an error. According the the
> i965 PRM [1] it is "Invalid GTT Entry during Display A Fetch".
> 
> > ?https://bugs.edge.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/528795
> PGTBL_ER: 0x00000029
> This one also has something in the Page Table Error register. Here,
> bits 5, 3 and 0 are set. On i965 5 and 3 are reserved, but 0 means
> "Invalid GTT Entry during Fetch on behalf of the Host".

Great!

I'm impressed you know how to decipher these - would you mind writing a
paragraph or so on the freeze troubleshooting page in wiki about how to
get this info?

(And I wonder if there's some way we could automatically suss out the
error in the apport script itself...?)
 
> It will be interesting once we get the first bug reports from
> xserver-xorg-video-intel 2.9.1-1ubuntu8, since that also should have
> hardware information attached :-)

Definitely

> [1]: http://intellinuxgraphics.org/VOL_1_graphics_core.pdf
> 
> -- 
> Ubuntu-x mailing list
> Ubuntu-x at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-x



More information about the Ubuntu-x mailing list