[ubuntu-x] Automatic GPU lockup bug reports

Fri Mar 12 00:48:07 GMT 2010

On Wed, Mar 10, 2010 at 09:12:33AM +0100, Geir Ove Myhr wrote:
> I have noticed in the GPU-lockup bug report that we have been
> receiving (https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated)
> that the IntelGpuDump.txt that is attached usually is incomplete, but
> can be useful for gathering statistics since dumps on the same chipset
> often has similar characteristics. This may be due to the race
> condition that Chris mentions.

Incomplete in what sense?

Btw, you've noticed the random number strings that are included in
titles.  That is basically a checksum hex of the dump report, which I'm
calling the 'dump sign'.  If two bug reports have exactly the same gpu
dump (character-for-character) then they'll have identical dump signs
and thus are almost assuredly dupes.  Looking through our existing bug
reports I found half a dozen with the same hex, and sure enough they
were all against 915gm and so I marked them all dupes.

Ideally, when apport tries filing a bug report with the same dump sign
as one already filed, it should automatically set it as a dupe.  I don't
know that this is working yet, the dupe detection stuff is still magical
to me.

However, I recognize these hex strings are nigh-unreadible for triagers,
and notice you've been replacing them with the PGTBL_ER or ESR values in
some cases.  To save you some typing I've updated the report to append
these to the title, if the values are non-zero.  I did not include
looking at the EIR but notice this is discussed in your other email -
let me know if that would be worth including and if it should be
used preferentially to ESR and/or PGTBL_ER.

> One thing that I see a lot is that only the ringbuffer is captured,
> while the GPU is executing a batchbuffer (see
> https://wiki.ubuntu.com/X/InterpretingIntelGpuDump for a high level
> description of ringbuffers and batchbuffers) . One example is
> IntelGpuDump.txt from
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/535477
> . The first line captures the memory address of the active head, i.e.
> where the GPU is currently executing (ACTHD: 0x0e366d50). From the
> ringbuffer dump we see
> 0x00012500:      0x18800080: MI_BATCH_BUFFER_START
> 0x00012504:      0x0e363001:    dword 1
> 0x00012508: HEAD 0x02000004: MI_FLUSH
> which means that the last executed command in the ringbuffer was start
> a batch buffer at memory address 0x0e363001. This is a little bit
> ahead of ACTHD, so we can assume that the GPU is executing in that
> batchbuffer, but the batchbuffer is not part of the dump, which makes
> it hard to say what the GPU is up to. The only thing we can see is
> that the last executed instruction is 0x15000000 (from the IPEHR
> register which is loaded with every instruction that is processed).

Can you propose a mechanism for how we can solve this?  I only half grok
the freeze dumping stuff, and unfortunately some other X projects are
demanding my time.  But if you can propose some specific changes I can
at least supply some time to update the apport hook and/or get the bits
into the archive.  I would love patches or even just bash snippets that
can be put into the apport hook, udev hook, or whatever.

> I'm also wondering if there are many false positives, since I don't
> always see signs of a GPU errror in the dmesg output. Even when there
> are GPU hung messages, there may be messages in dmesg for a long time
> after that, which means that it couldn't have been that GPU hang that
> triggered the udev rule. I'm not sure how to interpret this.

Can you propose a string to look for in the dmesg output?  It would be
straightforward to have the apport hook scan for that string and refuse
to file a bug report unless it sees it.

> Since the number of bug reports is quite overwhelming, I think a
> suitable thing to do would be to lump similar automatic report
> together by duplicating them to a master bug report. Most likely, the
> i8xx reports are mostly this issue:
> http://bugs.freedesktop.org/show_bug.cgi?id=26345 .

It could be.  Are we sufficiently confident that we could just dupe all
the bug reports in launchpad?  Or if we're not sure, we could go ahead
and start forwarding the bug reports and let upstream dupe them there.
The former is probably less total work, and like you mention we can
always undupe them ourselves as we learn more.

With 8xx, another option we could pursue would be to blacklist KMS in
the kernel and force them to use UMS instead.  Do you know if there has
been testing to verify that the freezes experienced by 8xx are specific
to KMS?  I'd hate to blacklist 845 for example, only to find it still
doesn't work.

I've removed the --kms-only flag on -intel, so it should now be possible
for 8xx users to switch off KMS via modeset=0 I think.  If we can get
some verifications that this helps eliminate the freezes, let me know
and we can proceed with blacklisting 8xx chips.

> The bugs on i945
> also seem similar to one another. Then we can coordinate some testing
> from the master bug report, but ask people to comment on their
> findings on their own reports. That way the master bug report will not
> be overcommented and we can easily detach bug reports later.

Sounds good.

Bryce