-
Pl
chevron_right
Toluwaleke Ogundipe: GPU Reset Recovery in Mutter: A Progress Update
news.movim.eu / PlanetGnome • 15:05 • 10 minutes
It’s overdue, but here is my first progress update. If you haven’t read my introductory post , the short version: I’m implementing GPU reset recovery in Mutter , the Wayland compositor at the heart of GNOME Shell . When the GPU encounters a hardware- or driver-level fault and resets, invalidating the EGL context and wiping out all allocated GPU memory, Mutter currently has no way to recover. My project aims to change that.
Here is where things stand at the time of this writing:
Graphics reset recovery demoAfter a period of normal operation, a reset is triggered: the display goes dark, then comes back once a new framebuffer is created (currently triggered manually by maximizing the active window via a keyboard shortcut). Windows are rendering, input works, the session is alive. You’ll notice the desktop background looks wrong after recovery; I’ll explain why later in this post. The compositor itself, however, is no longer dead.
The Problem
A GPU reset is the hardware’s way of recovering from a hang or fault in the graphics pipeline. For the rest of the system, it means the GPU’s state has been wiped: any GL context that existed before the reset is invalid, and all GPU-allocated memory (textures, framebuffers, shader programs, etc) is lost.
Mutter’s fundamental problem is that it doesn’t create a robust GL context. While Mutter does call
GetGraphicsResetStatus()
as part of its existing rendering logic, without opting into the appropriate reset notification mechanism, that call will never return a reset status, even when one has occurred. Mutter (on
main
) simply has no way to detect a GPU reset, let alone recover from one.
The consequences in practice are severe. On some drivers (notably Mesa’s radeonsi/amdgpu), the driver deliberately kills the process when a non-robust GL context is present for a GPU that was reset. On others, Mutter would continue rendering blindly against an invalid context, with monitors on the affected GPU frozen on the last frame before the reset. GL errors accumulate on every subsequent frame, and the only way out is to kill the process. Either way, the session is lost, along with every application running within it. This has been a known, long-standing issue .
The Foundation
The API that makes detection and recovery possible is provided by the
EXT_robustness
OpenGL ES extension and its EGL counterpart,
EXT_create_context_robustness
. By specifying the
EGL_CONTEXT_OPENGL_RESET_NOTIFICATION_STRATEGY
attribute with the value
EGL_LOSE_CONTEXT_ON_RESET
at context creation, we opt into deterministic reset notification:
GetGraphicsResetStatus()
will now reliably return a reset status when one has occurred, and the context is invalidated in a well-defined manner. Additionally,
GetGraphicsResetStatus()
can be called repeatedly with the lost context itself and will return
NO_ERROR
once the reset has fully completed, which is precisely how we know when it is safe to attempt restoration. It’s worth noting this isn’t universally available: not every driver implements reset notification, and even where the EGL/GL layer supports it, actual reset detection depends on cooperation from the kernel driver. Robustness is something we can build on, not something we can assume.
Before GSoC began, Robert had already laid the groundwork for this. His commit did the following:
-
Registered
EXT_create_context_robustnessas a Cogl EGL winsys feature, so that its availability can be queried at runtime. -
Added
EGL_CONTEXT_OPENGL_RESET_NOTIFICATION_STRATEGYandEGL_LOSE_CONTEXT_ON_RESETto the attribute list at context creation, when that feature is available, creating a robust context for the first time. -
Added a stub
clutter_backend_reset_context()as the hook where recovery logic would eventually live. For the time being, it simply emitted a warning. -
Wired up the detection path in
meta_compositor_real_after_paint(); wheneverGetGraphicsResetStatus()returned a reset status, the stub was called.
He also added support for simulating GPU resets in Mesa’s llvmpipe software renderer (
MR !40681
). By creating or modifying a file at a path specified by the
LP_CONTEXT_RESET_FILE
environment variable, all llvmpipe contexts using the
LOSE_CONTEXT_ON_RESET
strategy that were created
before
that file start reporting a reset; contexts created
afterwards
correctly report no error. Without this, the only way to test would be to induce actual hardware or driver failures, which is considerably less convenient.
The Recovery Cycle
The naive first instinct, which is to detect the reset, immediately recreate the context, and resume rendering, does not work, and for a subtle reason: a GPU reset is not instantaneous. After
GetGraphicsResetStatus()
first returns an error, the hardware may still be in the process of resetting. Attempting to recreate the context mid-reset invites further failure, and so the correct approach is to wait for the reset to
complete
before attempting any restoration.
This led to designing the recovery as a cycle with two phases, Reset and Restoration , tracked by a set of five states:
All of this state management currently lives in
ClutterBackend
, with initial reset detection in
ClutterStageView
‘s frame handler. When
GetGraphicsResetStatus()
first returns an error, the state transitions to
RESET_IN_PROGRESS
and a GLib timeout source begins polling for reset completion every 20 milliseconds. Once
GetGraphicsResetStatus()
returns
NO_ERROR
, indicating the hardware has finished resetting, we move to
RESET_COMPLETED
. If the reset takes longer than 2 seconds, we proceed to restoration anyway; empirically, if a reset is going to complete, it does so quickly.
The restoration phase runs entirely outside the frame dispatch loop, using GLib idle and timeout sources. When this starts, the state transitions to
RESTORING
. Thanks to
Jonas
, an important lesson from an earlier implementation was that restoration must not happen during frame dispatch, because that code runs per monitor; you do not want to recreate the EGL context for every connected display.
During recovery, every frame dispatch is aborted. The frame handler signals to the frame clock that the frame should be dropped without scheduling a replacement. This check occurs both at the beginning of the frame handler, before any rendering occurs, and at the end , to catch resets that occur mid-frame.
If restoration fails and retries exhaust a 2-second timeout, the state transitions to
FAILED
and Mutter exits gracefully with a descriptive error message, rather than aborting or hanging indefinitely.
Restoring The Graphics Pipeline
Once
RESET_COMPLETED
is reached, restoration begins. The core of it lives in
clutter_backend_restore_graphics()
, and the sequence is:
-
Unref and destroy the current
CoglContext. -
Tear down the
CoglDisplay, destroying the underlyingEGLContext. -
Re-setup the
CoglDisplay, creating a freshEGLContext. -
Create a new
CoglContextagainst the newly set up display. -
Emit
ClutterBackend::graphics-resetto notify everything else.
Step 2 required a small new addition to Cogl:
cogl_display_destroy()
. Previously, there was no way to tear down a
CoglDisplay
‘s contents without destroying the object itself, which would have invalidated references to it held throughout the codebase. The function calls the object’s
destroy()
implementation and marks it as no longer set up, leaving the object intact and ready to be re-setup.
Straightforward in isolation, but the interesting work is in everything that happens at step 5.
Propagating the reset through the compositor
ClutterBackend::graphics-reset
is the hook through which the rest of the compositor learns that a new EGL context exists and needs to respond. Several objects connect to it, and the order in which their handlers run matters significantly.
The current sequence, enforced through the use of the
G_CONNECT_AFTER
flag, is as follows:
-
ClutterStageunrealizes: The stage is hidden andunrealize()is called on the stage actor, cascading down to all child actors (including everyClutterTextin the scene graph). This must happen before the font renderer is recreated; more on why in the next section. -
ClutterContextrecreates the font renderer:ClutterPangoRenderer, which holds the GPU-backed glyph cache, is destroyed and recreated with the newCoglContext. -
MetaBackendupdates the stage: Stage views are rebuilt, and cursor rendering is updated. -
ClutterStagerealizes: The stage is realized against the freshly rebuilt views and shown again.
One improvement made along the way: Compositor view recreation was previously triggered by a signal emitted when monitors or monitor settings change. With graphics reset recovery requiring the same operation,
MetaCompositor
would have needed to listen to two sources, along with ordering concerns relative to
MetaBackend
’s handler, which rebuilds the stage views. Instead, a new
MetaRenderer::views-rebuilt
signal was added, and emitted at the end of
meta_renderer_real_rebuild_views()
regardless of what triggered the rebuild.
MetaCompositor
now listens to that single, unified signal, and the ordering fragility is avoided entirely.
The signal handler ordering for the
graphics-reset
signal still depends to some extent on GLib connection order, which is not ideal. A more explicit, well-defined ordering mechanism is already in the works.
Recovering the glyph cache
Window content and client-rendered surfaces come back naturally once the context is recreated and stage views are rebuilt, clients re-render their Wayland buffers, and Mutter composites them. But some GPU-tied state lives inside the compositor itself and needs explicit recovery. The most intricate case encountered so far is the glyph cache.
ClutterPangoRenderer
maintains a texture atlas of rendered glyphs, built up as text is drawn to the screen. When the GPU context is lost, that atlas is gone. Simply recreating the renderer, as
ClutterContext
does in step 2 above, creates a fresh, empty one. But there is a subtlety:
ClutterText
actors internally cache
PangoLayout
objects, and each layout carries rendering data tied to the
old
renderer via GObject qdata. Drawing text with a stale layout against the
new
renderer produces incorrect results or crashes.
The fix required a small chain of additions:
-
clutter_forget_layout(), a new function inclutter-pango-render, removes the qdata from a givenPangoLayout, severing its tie to the old renderer. It verifies that both the renderer and the qdata exist and that the qdata was produced by the current renderer before clearing it. -
ClutterText.unrealize(), a new virtual method implementation, callsclutter_forget_layout()on each ofClutterText‘s internally cached layouts when the actor is unrealized. -
When
ClutterStageunrealizes (step 1), the unrealize cascade reaches everyClutterTextactor in the scene graph, clearing stale layout data across the board.
This is precisely why the stage must unrealize
before
the font renderer is recreated.
ClutterText
‘s
unrealize()
calls
clutter_forget_layout()
, which requires the old renderer to still be alive to destroy the layout’s qdata. If the renderer were already replaced, it would be impossible to correctly destroy the data. The stage is then re-realized in step 4, after the new renderer exists and stage views have been rebuilt, allowing text to render cleanly from a fresh cache.
Where things stand
As the recording shows, the compositor survives a GPU reset, and the session remains usable: windows update correctly, input is responsive, and the session doesn’t crash or freeze. There are two notable gaps, though.
First, framebuffer recreation isn’t automatic yet. After recovery, the display stays dark until something triggers the creation of a new framebuffer; in the recording, I do this manually by maximizing the active window via a keyboard shortcut. Without that nudge, stage views are rebuilt, but nothing causes a fresh framebuffer to actually be allocated, so the screen just stays black. Making this automatic is one of the next things to sort out.
Second, once the display is back, the desktop background renders with incorrect or garbled textures.
MetaBackgroundImage
and
MetaBackground
hold references to GPU textures (primarily the background image loaded from disk), and
MetaBackgroundContent
defines GLSL shaders, all of which are invalidated by the reset. Recovering them requires updating these objects to re-upload their GPU-side resources after the reset.
There are also residual GL errors after recovery that need investigation, and the signal handler ordering situation deserves a more deterministic solution.
What’s next
- Signal handler ordering : replacing the implicit connection order dependency with an explicit, documented mechanism.
- Automatic framebuffer recreation : removing the dependency on a manual trigger (like maximizing a window) for the display to actually come back after recovery.
-
Background texture recovery
: getting
MetaBackgroundand its counterparts to listen to the graphics reset signal and reload their GPU-side resources. - Auditing remaining GPU-tied resources : ensuring nothing else in the compositor holds stale references after recovery.
- MR review and upstream integration : the current implementation lives on my fork and is primarily only reviewed by my mentors; as the approach stabilises, it will be proposed for upstream review.
Thanks
A huge thank you to my mentors
Jonas Ådahl
,
Robert Mader
, and
Carlos Garnacho
for their incredible guidance since the beginning. I’d also like to thank
Bilal Elmoussaoui
for the valuable review comments on my merge request. On to what’s next!