[ubuntu-x] Automatic GPU lockup bug reports

Wed Mar 17 10:16:06 GMT 2010

A little incremental update on the apport GPU lockup reports...

On how GPU reset works:

I have looked a little on the code, and the first thing that pops out
is that only chipsets above i965 and GM45 are being reset. i945, G33,
and below are not reset. This resonates well with what I see in the
bug reports. On the chipsets where the GPU is not reset, the attached
IntelGpuDump.txt is compatible with the (limited) information in
i915_error_state. For the bug reports where I have got a manual dump
of i915_error_state with drm-intel-next kernel which dumps all
relevant information there, the information is compatible with
IntelGpuDump.txt, although more complete (i.e. includes all the
relevant buffers, IntelGpuDump.txt often lacks some important ones).
On chipsets where the GPU is reset, IntelGpuDump.txt is a dump of a
freshly initialized GPU. The best sign is that the HEAD is right in
the beginning of the ringbuffer, i.e. it just got started. The other
sign is that ACTHD and IPEHR are different from the ones recorded in
i915_error_state. With drm.debug=0x02 as kernel parameter, we can also
see that the GPU is being reset in dmesg output (see [1] for an
example from LP # 516909). The code that triggers the reset is
i915_error_work_func in drivers/gpu/drm/i915/i915_irq.c [2]. The
actual reset happens in 965_reset in i915_drv.c [3].

[1]: https://bugs.freedesktop.org/attachment.cgi?id=34126&action=edit
[2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_irq.c;h=5388354da0d176df4ff2a3b7c33de069abff12da;hb=HEAD
[3]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/gpu/drm/i915/i915_drv.c;h=1b2e95455c05d0cce04d17483c7bd4ff9f218fe0;hb=HEAD

On how the udev events are triggered:

The udev events are sent from i915_error_work_func mentioned above.
When a GPU reset happens, there are three events being sent. Once is
at the beginning of the function, when we know that an error has been
detected, one right before the reset and one after. The two last ones
only happen on i965 and above, so we don't want to listen for them.
The first happens whether the GPU is wedged or not (as defined by
dev_priv->mm.wedged). There is no uevent that is triggered for all
chipsets, but only if the GPU is wedged, which may be what we want.

The i915_error_work_func is called from the end of i915_handle_error
(also in i915_irq.c), which takes care of recording the error state to
i915_error_state in debugfs first, so it's fine to grab this file on
the first udev event also in the cases where the GPU will be reset (I
was worried about this in previous emails). i915_handle_error is
called from two places. One is when a bit in the error register EIR
gets set, which triggers an interrupt. The other is when the hangcheck
timer ellapses, i.e. EIR is not set, but the GPU makes no progress. In
the latter case "Hangcheck timer elapsed... GPU hung\n" is logged. In
both cases i915_handle_error prints "render error detected, EIR:
0x%08x\n" (i.e the EIR register is printed), but this will probably
change in drm-intel-next soon, so that this only is printed when a bit
in EIR is set [4]

[4]: http://lists.freedesktop.org/archives/intel-gfx/2010-March/006150.html

On what upstream wants:

Chris Wilson says that they would prefer dumps from kernels with the
i915_error_state dumping patch [5]. IntelGpuDump.txt usually lacks
some important information.

[5]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e

On what we can do:

1. Differentiate between "GPU hung" and other GPU errors. I think I
got this part right in my previous email:
- If there is "Hangcheck timer elapsed... GPU hung" in dmesg, give
title "GPU hung ++",
- If there is "page table error" in dmesg, give title "GPU page table error ++"
- If none of the above, simply let the title be "GPU error ++" for now.
2. Include error registers in the right priority in the title
- If PGTBL_ER is non-zero, use that .
- Otherwise, if EIR is non-zero, use that.
- Ignore ESR, it's useless.
3. If possible, carry the record-batch-buffer-following-GPU-error
patch [5] (above) in the kernel. Possibly drop it before release. This
will make the dumps for pre-i965 become better, and will make the
post-i965 dumps become useful.
4. Possibly add some message in the apport-script that says that while
we are recording the logs of the incident, they don't tell us how the
reporter experienced the problem. We get a lot of descriptions that
only says things like "problem happened" and we don't know if the
computer hung and needed a reboot or if the computer recovered all by
itself and the only thing the user notices is that apport asks it to
report a problem he/she was unaware of.
5. Fix whatever caused
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/539533
. This seemed to happen for a lot of people since yesterday. It seems
to be related to trying to add the MachineType to the title.

Open question:
- Is wedged the same as hung, or is there a subtle difference?

Geir Ove