[ubuntu-x] Automatic GPU lockup bug reports

Geir Ove Myhr gomyhr at gmail.com
Wed Mar 10 06:52:35 GMT 2010


I have been in touch with Chris Wilson at intel regarding automatic
GPU dumps. Here is what he wrote (two messages edited into one). I'll
follow up with my thoughts in a separate email.


---------- Forwarded message ----------
From: Chris Wilson <chris at chris-wilson.co.uk>
Date: Tue, Mar 9, 2010 at 6:25 PM
Subject: Re: Automatic GPU lockup bug reports with GPU dump in Ubuntu
To: Geir Ove Myhr <gomyhr at gmail.com>


On Mon, 8 Mar 2010 12:06:41 +0100, Geir Ove Myhr <gomyhr at gmail.com> wrote:
> Hi Chris,
>
> I wondered if you could help with some advice as to how to best
> collect useful information for GPU hung bugs in Ubuntu. There has been
> some code in Ubuntu for a few months that would automatically collect
> a GPU dump on a detected GPU hang, but it was only recently activated
> and the initial problems sorted out so that it now seems to collect
> useful information. I wondered if you could take a look and see if the
> current bug reports are good enough to send upstream or we need to
> change something before we can do that.
>
> Currently, it works like this:
>
> Following the advice from Jesse Barnes [1], there is an udev rule [2]
> that triggers before the GPU is reset and runs a script,
> /usr/share/apport/apport-gpu-error-intel.py which collects
> intel_gpu_dump output and other logs and offers to file a bug report
> with this information later. Since intel_gpu_dump is run before the
> GPU reset, it should collect the same information as
> intel_error_decode would do after the reset with the
> record-batchbuffer-after-GPU-error patch in the kernel. That is,
> unless there is some race condition and the GPU is reset while
> apport-gpu-error-intel.py is running.

Yes, the userspace notification is asynchronous and the kernel does not
wait before starting the reset procedure (if supported). Hence there is a
race to capture the accurate data.

The current i915_error_state gets around this by performing the capture in
the error handler and aims to collect all the data that is strictly relevant
to the crash. I would strongly recommend that this is used, and I want to
deprecate the ringbuffer_info and batchbuffers debug files in the future -
hence killing intel_gpu_dump.

> [1]: https://lists.ubuntu.com/archives/ubuntu-devel/2009-September/029014.html
> [2]: SUBSYSTEM=="drm", ACTION=="change", ENV{ERROR}=="1",
> RUN+="/usr/share/apport/apport-gpu-error-intel.py"
>
> So far we have around 25 bug reports submitted this way. The
> description is usually of low quality, but the logs seem to be okay to
> an untrained eye like mine. I had one of the reporters test with the
> latest drm-intel-next and attaching i915_error_state and it seems that
> all the information I could extract from that was also contained in
> the automatic report. This latter report is this one:
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/532100

This is odd as I can't see the up-to-date output of i915_error_state in
this bug report. Anyway that bug in particular is likely to the fb
misalignment issue.

> and this search should give you a list of all the reports (because of
> some limitations in the apport files, new bug reports of this kind end
> up in nvidia-graphics-drivers):
> https://launchpad.net/ubuntu/+bugs?field.searchtext=GPU+lockup&orderby=-datecreated
> The title is of the form [chipset] GPU lockup md5sum-of-intel_gpu_dump.
>
> If the automatically collected information is of good quality, is
> there anything else besides "what happened?" that we should ask the
> submitters?
> Two days ago the kernel 2.6.32-16 was uploaded with 2.6.33 drm. Should
> we ask people who reported with 2.6.32-15 (with 2.6.32 drm) to
> re-test?

Hmm, no. The interesting stuff is in 33-rc1, and I have a number of patches
to push to Anholt, so 33-rc2... Always jam tomorrow. ;-)

I need to have a look through that list and see if there is enough
information to classify those bugs.

> I hope that we can gather information in a way that is useful for you guys.

The information is definitely useful. And you play an important role in
filtering that information to provide good upstream bug reports. As far we
are concerned regressions are release critical, both performance and
stability. Fortunately, regressions also tend to be bisectable. After that
we need to establish cause and effect, and just as importantly which
silicon. This is well established in the bug reporting process.

> >> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/532100
> > This is odd as I can't see the up-to-date output of i915_error_state in
> > this bug report. Anyway that bug in particular is likely to the fb
> > misalignment issue.
> Do you mean the patch "drm/i915: Increase fb alignment to 64k"? That
> should already be in the kernel. Or is there still an outstanding
> issue here?

There's a second patch I have to unbind the fb if it already has an
invalid alignment to hopefully address the same issue post resume. I'm
still waiting on feedback as to whether that helps. Otherwise, it is a
different issue.

> >> Two days ago the kernel 2.6.32-16 was uploaded with 2.6.33 drm. Should
> >> we ask people who reported with 2.6.32-15 (with 2.6.32 drm) to
> >> re-test?
> >
> > Hmm, no. The interesting stuff is in 33-rc1, and I have a number of patches
> > to push to Anholt, so 33-rc2... Always jam tomorrow. ;-)
>
> You mean 34-rc1 an 34-rc2, I suppose?

Yes. I'll be typing .33 for a while yet. :|

> Does that mean that reports from
> GPU lockups with 2.6.33 drm are no longer useful for upstream?

No, just the results are likely to be more reliable.

> We have
> pre-built drm-intel-next kernels that we may ask people to test. If
> this or 2.6.34-rcX is a requirement, then we may as well modify the
> udev rule to trigger after the GPU reset. Then the script that is
> triggered can check if the kernel is new enough to contain the
> record-batchbuffer patch and record <debugfs>/dri/*/i915_error_state
> along with dmesg, Xorg.0.log and other logs (btw, what does the *
> stand for? I have 0 and 64 on my computer). If the kernel is too old
> to give useful data, the script may as well exit silently. Does this
> sound like a better approach?

The vision is for multiple drm nodes to differentiate between master and
clients, with different minors for different roles. AFAIK, there currently
is no difference.

The only interesting thing about older kernels for us is regression
analysis. That likely can be done without the support of the gpu dumping
script, but you need to be careful that you remain bisecting the same
bug. It's not crash reports from older kernels aren't useful, it is just
that we need to check for known errors using unreliable means. And it is
likely that the interfaces will change again in future as we uncover more
information that we need to capture to diagnose bugs. At the moment I
think there is no better method than manual inspection to classify bugs.

> > The information is definitely useful. And you play an important role in
> > filtering that information to provide good upstream bug reports.
>
> Good. We just need to recalibrate those filters every now and then as
> the code changes. I have read a bit in the PRMs to get an idea of how
> the different registers should behave. I'm working on a wiki page [1]
> so that other non-developers may look at the ring- and batchbuffer
> dumps and do some basic screening. One thing I have noticed is that
> the ESR often gets set to 0x1, even when I check intel_gpu_dump on my
> computer as I write. Is this something we should ignore in the dumps?

Yes, EIR is the important one. I've started adding the logic to
intel_error_decode to convert the registers back to "English" which helps
a lot. I'd focus on using intel_error_decode, which is very similar to the
GPU dump.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre



More information about the Ubuntu-x mailing list