[ubuntu-x] Automatic GPU lockup bug reports

Wed Mar 10 08:12:33 GMT 2010

> Yes, the userspace notification is asynchronous and the kernel does not
> wait before starting the reset procedure (if supported). Hence there is a
> race to capture the accurate data.
>
> The current i915_error_state gets around this by performing the capture in
> the error handler and aims to collect all the data that is strictly relevant
> to the crash. I would strongly recommend that this is used, and I want to
> deprecate the ringbuffer_info and batchbuffers debug files in the future -
> hence killing intel_gpu_dump.

I have noticed in the GPU-lockup bug report that we have been
receiving (https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated)
that the IntelGpuDump.txt that is attached usually is incomplete, but
can be useful for gathering statistics since dumps on the same chipset
often has similar characteristics. This may be due to the race
condition that Chris mentions.

One thing that I see a lot is that only the ringbuffer is captured,
while the GPU is executing a batchbuffer (see
https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level
description of ringbuffers and batchbuffers) . One example is
IntelGpuDump.txt from
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477
. The first line captures the memory address of the active head, i.e.
where the GPU is currently executing (ACTHD: 0x0e366d50). From the
ringbuffer dump we see
0x00012500:      0x18800080: MI_BATCH_BUFFER_START
0x00012504:      0x0e363001:    dword 1
0x00012508: HEAD 0x02000004: MI_FLUSH
which means that the last executed command in the ringbuffer was start
a batch buffer at memory address 0x0e363001. This is a little bit
ahead of ACTHD, so we can assume that the GPU is executing in that
batchbuffer, but the batchbuffer is not part of the dump, which makes
it hard to say what the GPU is up to. The only thing we can see is
that the last executed instruction is 0x15000000 (from the IPEHR
register which is loaded with every instruction that is processed).

I'm also wondering if there are many false positives, since I don't
always see signs of a GPU errror in the dmesg output. Even when there
are GPU hung messages, there may be messages in dmesg for a long time
after that, which means that it couldn't have been that GPU hang that
triggered the udev rule. I'm not sure how to interpret this.

Since the number of bug reports is quite overwhelming, I think a
suitable thing to do would be to lump similar automatic report
together by duplicating them to a master bug report. Most likely, the
i8xx reports are mostly this issue:
http://bugs.freedesktop.org/show_bug.cgi?id=26345 . The bugs on i945
also seem similar to one another. Then we can coordinate some testing
from the master bug report, but ask people to comment on their
findings on their own reports. That way the master bug report will not
be overcommented and we can easily detach bug reports later.

Geir Ove