Debugging tools/approach for GPU hangs?

Jesse Barnes jesse.barnes at intel.com
Fri Sep 4 16:12:09 BST 2009


On Fri, 4 Sep 2009 02:25:45 -0700
Bryce Harrington <bryce at canonical.com> wrote:

> On Thu, Sep 03, 2009 at 05:02:45PM -0700, Matt Zimmerman wrote:
> > With more of the graphics stack moving into the kernel, we are
> > starting to see more bugs of this type:
> > 
> > http://launchpad.net/bugs/359392
> > http://launchpad.net/bugs/388357
> > http://launchpad.net/bugs/424055
> >
> > Where the GPU is hung, but the system is otherwise still
> > responsive.  This is annoyingly difficult to debug, with the
> > primary technique being to ssh into the system from a nearby one
> > (because the console is useless).
> 
> Actually there have been GPU hang bugs for a long time.  It's just
> that they wasn't a way to debug them until recently.
> 
> > I think it would be a worthwhile investment to work on improved
> > tools and methods for debugging this scenario, including:
> > 
> >  * Detecting (programatically) when this situation occurs and
> > capturing an apport problem report, as described in
> >    http://mdzlog.alcor.net/2009/06/17/collecting-debug-information-when-your-gpu-hangs/
> > 
> >    Bryce (and Jesse Barnes at Intel) mentioned that the kernel is
> > now supposed to log an error message when this happens, but I've
> > never seen evidence of that happening.
> 
> I'm cc'ing jbarnes here.  Last I heard this was implemented upstream
> but hadn't yet filtered down.

Yeah, the kernel should be sending a uevent from the drm device now as
well.  I've seen dumps occur in the wild too; when an error occurs the
kernel will dump some info to dmesg (not a complete dump, just a
summary) and capture the error record in debugfs under
i915_error_state.  If you catch the uevent, you could perform a GPU
dump and capture the dmesg and other debugfs state into a package.

> >  * Providing some means for the user to get the system into a
> > debuggable state, i.e. where they can see something on the screen.
> > Maybe it's possible to re-POST the video device to see if it gets
> > back to a sane state?
> > 
> >  * Documenting all of the above so that it can be easily executed by
> >    reasonably technical users

There are some reset patches available, but they don't seem to work in
all cases.  It would be good to know if they were useful for any of
your reported hangs...  It might be that they're better than nothing
and we should push them upstream now.

-- 
Jesse Barnes, Intel Open Source Technology Center



More information about the ubuntu-devel mailing list