[ubuntu-x] Automatic GPU lockup bug reports

Fri Mar 12 10:58:34 GMT 2010

On Fri, Mar 12, 2010 at 1:48 AM, Bryce Harrington <bryce at canonical.com> wrote:
> On Wed, Mar 10, 2010 at 09:12:33AM +0100, Geir Ove Myhr wrote:
>> I have noticed in the GPU-lockup bug report that we have been
>> receiving (https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated)
>> that the IntelGpuDump.txt that is attached usually is incomplete, but
>> can be useful for gathering statistics since dumps on the same chipset
>> often has similar characteristics. This may be due to the race
>> condition that Chris mentions.
>
> Incomplete in what sense?

Like below where the currently executing batchbuffer is completely
missing. I have also seen dumps without any ringbuffer or batchbuffer
at all. Then there's the issue of how much to trust the dump if the
kernel is racing to reset the GPU while userspace is trying to dump
it.

> Btw, you've noticed the random number strings that are included in
> titles.  That is basically a checksum hex of the dump report, which I'm
> calling the 'dump sign'.  If two bug reports have exactly the same gpu
> dump (character-for-character) then they'll have identical dump signs
> and thus are almost assuredly dupes.  Looking through our existing bug
> reports I found half a dozen with the same hex, and sure enough they
> were all against 915gm and so I marked them all dupes.

Yes, I knew that. I just thought I'd replace it with what I thought
was the actual problem. I think we will only get matches in degenerate
cases, like the 915gm one where the ringbuffer is only a big 0. It is
kind of taking a 'ps aux' dump. If one symptom is that no processes
are started, we will get matching MD5s, but for any "normal" output
the MD5 will not match even if there are other characteristics that
are the same. I think it's okay to have the first few hex-digits in
the title along with any other useful information that we can add,
like you did for the last ones.

> However, I recognize these hex strings are nigh-unreadible for triagers,
> and notice you've been replacing them with the PGTBL_ER or ESR values in
> some cases.  To save you some typing I've updated the report to append
> these to the title, if the values are non-zero.  I did not include
> looking at the EIR but notice this is discussed in your other email -
> let me know if that would be worth including and if it should be
> used preferentially to ESR and/or PGTBL_ER.

Good idea. I added the ESR before I found out that it is essentially
useless. So PGTBL_ER should be used before EIR, and ESR should never
be used. EIR=0x10 is the general sign of a page table error and in
that case more detailed information about it can be found in PGTBL_ER,
so we always have EIR=0x10 (and possibly other errors) if PGTBL_ER is
non-zero.

>> One thing that I see a lot is that only the ringbuffer is captured,
>> while the GPU is executing a batchbuffer (see
>> https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level
>> description of ringbuffers and batchbuffers) . One example is
>> IntelGpuDump.txt from
>> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477
>> . The first line captures the memory address of the active head, i.e.
>> where the GPU is currently executing (ACTHD: 0x0e366d50). From the
>> ringbuffer dump we see
>> 0x00012500:      0x18800080: MI_BATCH_BUFFER_START
>> 0x00012504:      0x0e363001:    dword 1
>> 0x00012508: HEAD 0x02000004: MI_FLUSH
>> which means that the last executed command in the ringbuffer was start
>> a batch buffer at memory address 0x0e363001. This is a little bit
>> ahead of ACTHD, so we can assume that the GPU is executing in that
>> batchbuffer, but the batchbuffer is not part of the dump, which makes
>> it hard to say what the GPU is up to. The only thing we can see is
>> that the last executed instruction is 0x15000000 (from the IPEHR
>> register which is loaded with every instruction that is processed).
> Can you propose a mechanism for how we can solve this?  I only half grok
> the freeze dumping stuff, and unfortunately some other X projects are
> demanding my time.  But if you can propose some specific changes I can
> at least supply some time to update the apport hook and/or get the bits
> into the archive.  I would love patches or even just bash snippets that
> can be put into the apport hook, udev hook, or whatever.

One option would be to carry the record-GPU-error-state kernel patch
http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=commit;h=9df30794f609d9412f14cfd0eb7b45dd64d0b14e
until some time before release and capture the i915_error_state a
little later. This would need some testing though, so our best option
may be to simply leave it at the status quo and ask the reporters of
the most promising automatic reports to test a drm-intel-next kernel
and get a manual dump.

>> I'm also wondering if there are many false positives, since I don't
>> always see signs of a GPU errror in the dmesg output. Even when there
>> are GPU hung messages, there may be messages in dmesg for a long time
>> after that, which means that it couldn't have been that GPU hang that
>> triggered the udev rule. I'm not sure how to interpret this.
>
> Can you propose a string to look for in the dmesg output?  It would be
> straightforward to have the apport hook scan for that string and refuse
> to file a bug report unless it sees it.

"GPU hung" would catch all where the kernel detects that the GPU is
really hung. It would not catch the ones with page table errors. Maybe
it's best to keep the gates open for now, so that we don't exclude
error types that we are not aware of?  One possible way to make life
easier is to distinguish in the bug title. "GPU hung" in dmesg, gives
bug tiltle "GPU hung ++", "page table error" could give "GPU page
table error ++" in the title and other bugs would simply be "GPU error
++".

>> Since the number of bug reports is quite overwhelming, I think a
>> suitable thing to do would be to lump similar automatic report
>> together by duplicating them to a master bug report. Most likely, the
>> i8xx reports are mostly this issue:
>> http://bugs.freedesktop.org/show_bug.cgi?id=26345 .
>
> It could be.  Are we sufficiently confident that we could just dupe all
> the bug reports in launchpad?  Or if we're not sure, we could go ahead
> and start forwarding the bug reports and let upstream dupe them there.
> The former is probably less total work, and like you mention we can
> always undupe them ourselves as we learn more.

Upstream has already duped many (but not all) of our 8xx bugs to fdo
26345. For now, we can probably dupe the automatic ones.

> With 8xx, another option we could pursue would be to blacklist KMS in
> the kernel and force them to use UMS instead.  Do you know if there has
> been testing to verify that the freezes experienced by 8xx are specific
> to KMS?  I'd hate to blacklist 845 for example, only to find it still
> doesn't work.

I haven't seen any such testing in Lucid, but in Karmic it usually
didn't help to use UMS instead of KMS. I'll ask someone to test this
in Lucid.

I know this would be a good time to invest some extra time and effort
into ubuntu bugs, but due to other commitments I don't have any room
for that.

Geir Ove