• Pl chevron_right

      Toluwaleke Ogundipe: GPU Reset Recovery in Mutter: A Progress Update

      news.movim.eu / PlanetGnome • 15:05 • 10 minutes

    It’s overdue, but here is my first progress update. If you haven’t read my introductory post , the short version: I’m implementing GPU reset recovery in Mutter , the Wayland compositor at the heart of GNOME Shell . When the GPU encounters a hardware- or driver-level fault and resets, invalidating the EGL context and wiping out all allocated GPU memory, Mutter currently has no way to recover. My project aims to change that.

    Here is where things stand at the time of this writing:

    Graphics reset recovery demo

    After a period of normal operation, a reset is triggered: the display goes dark, then comes back once a new framebuffer is created (currently triggered manually by maximizing the active window via a keyboard shortcut). Windows are rendering, input works, the session is alive. You’ll notice the desktop background looks wrong after recovery; I’ll explain why later in this post. The compositor itself, however, is no longer dead.

    The Problem

    A GPU reset is the hardware’s way of recovering from a hang or fault in the graphics pipeline. For the rest of the system, it means the GPU’s state has been wiped: any GL context that existed before the reset is invalid, and all GPU-allocated memory (textures, framebuffers, shader programs, etc) is lost.

    Mutter’s fundamental problem is that it doesn’t create a robust GL context. While Mutter does call GetGraphicsResetStatus() as part of its existing rendering logic, without opting into the appropriate reset notification mechanism, that call will never return a reset status, even when one has occurred. Mutter (on main ) simply has no way to detect a GPU reset, let alone recover from one.

    The consequences in practice are severe. On some drivers (notably Mesa’s radeonsi/amdgpu), the driver deliberately kills the process when a non-robust GL context is present for a GPU that was reset. On others, Mutter would continue rendering blindly against an invalid context, with monitors on the affected GPU frozen on the last frame before the reset. GL errors accumulate on every subsequent frame, and the only way out is to kill the process. Either way, the session is lost, along with every application running within it. This has been a known, long-standing issue .

    The Foundation

    The API that makes detection and recovery possible is provided by the EXT_robustness OpenGL ES extension and its EGL counterpart, EXT_create_context_robustness . By specifying the EGL_CONTEXT_OPENGL_RESET_NOTIFICATION_STRATEGY attribute with the value EGL_LOSE_CONTEXT_ON_RESET at context creation, we opt into deterministic reset notification: GetGraphicsResetStatus() will now reliably return a reset status when one has occurred, and the context is invalidated in a well-defined manner. Additionally, GetGraphicsResetStatus() can be called repeatedly with the lost context itself and will return NO_ERROR once the reset has fully completed, which is precisely how we know when it is safe to attempt restoration. It’s worth noting this isn’t universally available: not every driver implements reset notification, and even where the EGL/GL layer supports it, actual reset detection depends on cooperation from the kernel driver. Robustness is something we can build on, not something we can assume.

    Before GSoC began, Robert had already laid the groundwork for this. His commit did the following:

    • Registered EXT_create_context_robustness as a Cogl EGL winsys feature, so that its availability can be queried at runtime.
    • Added EGL_CONTEXT_OPENGL_RESET_NOTIFICATION_STRATEGY and EGL_LOSE_CONTEXT_ON_RESET to the attribute list at context creation, when that feature is available, creating a robust context for the first time.
    • Added a stub clutter_backend_reset_context() as the hook where recovery logic would eventually live. For the time being, it simply emitted a warning.
    • Wired up the detection path in meta_compositor_real_after_paint() ; whenever GetGraphicsResetStatus() returned a reset status, the stub was called.

    He also added support for simulating GPU resets in Mesa’s llvmpipe software renderer ( MR !40681 ). By creating or modifying a file at a path specified by the LP_CONTEXT_RESET_FILE environment variable, all llvmpipe contexts using the LOSE_CONTEXT_ON_RESET strategy that were created before that file start reporting a reset; contexts created afterwards correctly report no error. Without this, the only way to test would be to induce actual hardware or driver failures, which is considerably less convenient.

    The Recovery Cycle

    The naive first instinct, which is to detect the reset, immediately recreate the context, and resume rendering, does not work, and for a subtle reason: a GPU reset is not instantaneous. After GetGraphicsResetStatus() first returns an error, the hardware may still be in the process of resetting. Attempting to recreate the context mid-reset invites further failure, and so the correct approach is to wait for the reset to complete before attempting any restoration.

    This led to designing the recovery as a cycle with two phases, Reset and Restoration , tracked by a set of five states:

    A state diagram showing the five states of Mutter's graphics reset recovery cycle: Normal Operation transitions to Reset in Progress when a reset is detected, which transitions to Reset Completed once the hardware finishes resetting or after a 2 second timeout. Reset Completed transitions down to Restoring, which either loops back to Normal Operation once the context and resources are successfully restored, or transitions to Recovery Failed after retries time out, exiting the process. Graphics reset recovery cycle

    All of this state management currently lives in ClutterBackend , with initial reset detection in ClutterStageView ‘s frame handler. When GetGraphicsResetStatus() first returns an error, the state transitions to RESET_IN_PROGRESS and a GLib timeout source begins polling for reset completion every 20 milliseconds. Once GetGraphicsResetStatus() returns NO_ERROR , indicating the hardware has finished resetting, we move to RESET_COMPLETED . If the reset takes longer than 2 seconds, we proceed to restoration anyway; empirically, if a reset is going to complete, it does so quickly.

    The restoration phase runs entirely outside the frame dispatch loop, using GLib idle and timeout sources. When this starts, the state transitions to RESTORING . Thanks to Jonas , an important lesson from an earlier implementation was that restoration must not happen during frame dispatch, because that code runs per monitor; you do not want to recreate the EGL context for every connected display.

    During recovery, every frame dispatch is aborted. The frame handler signals to the frame clock that the frame should be dropped without scheduling a replacement. This check occurs both at the beginning of the frame handler, before any rendering occurs, and at the end , to catch resets that occur mid-frame.

    If restoration fails and retries exhaust a 2-second timeout, the state transitions to FAILED and Mutter exits gracefully with a descriptive error message, rather than aborting or hanging indefinitely.

    Restoring The Graphics Pipeline

    Once RESET_COMPLETED is reached, restoration begins. The core of it lives in clutter_backend_restore_graphics() , and the sequence is:

    1. Unref and destroy the current CoglContext .
    2. Tear down the CoglDisplay , destroying the underlying EGLContext .
    3. Re-setup the CoglDisplay , creating a fresh EGLContext .
    4. Create a new CoglContext against the newly set up display.
    5. Emit ClutterBackend::graphics-reset to notify everything else.

    Step 2 required a small new addition to Cogl: cogl_display_destroy() . Previously, there was no way to tear down a CoglDisplay ‘s contents without destroying the object itself, which would have invalidated references to it held throughout the codebase. The function calls the object’s destroy() implementation and marks it as no longer set up, leaving the object intact and ready to be re-setup.

    Straightforward in isolation, but the interesting work is in everything that happens at step 5.

    Propagating the reset through the compositor

    ClutterBackend::graphics-reset is the hook through which the rest of the compositor learns that a new EGL context exists and needs to respond. Several objects connect to it, and the order in which their handlers run matters significantly.

    The current sequence, enforced through the use of the G_CONNECT_AFTER flag, is as follows:

    1. ClutterStage unrealizes: The stage is hidden and unrealize() is called on the stage actor, cascading down to all child actors (including every ClutterText in the scene graph). This must happen before the font renderer is recreated; more on why in the next section.
    2. ClutterContext recreates the font renderer: ClutterPangoRenderer , which holds the GPU-backed glyph cache, is destroyed and recreated with the new CoglContext .
    3. MetaBackend updates the stage: Stage views are rebuilt, and cursor rendering is updated.
    4. ClutterStage realizes: The stage is realized against the freshly rebuilt views and shown again.

    One improvement made along the way: Compositor view recreation was previously triggered by a signal emitted when monitors or monitor settings change. With graphics reset recovery requiring the same operation, MetaCompositor would have needed to listen to two sources, along with ordering concerns relative to MetaBackend ’s handler, which rebuilds the stage views. Instead, a new MetaRenderer::views-rebuilt signal was added, and emitted at the end of meta_renderer_real_rebuild_views() regardless of what triggered the rebuild. MetaCompositor now listens to that single, unified signal, and the ordering fragility is avoided entirely.

    The signal handler ordering for the graphics-reset signal still depends to some extent on GLib connection order, which is not ideal. A more explicit, well-defined ordering mechanism is already in the works.

    Recovering the glyph cache

    Window content and client-rendered surfaces come back naturally once the context is recreated and stage views are rebuilt, clients re-render their Wayland buffers, and Mutter composites them. But some GPU-tied state lives inside the compositor itself and needs explicit recovery. The most intricate case encountered so far is the glyph cache.

    ClutterPangoRenderer maintains a texture atlas of rendered glyphs, built up as text is drawn to the screen. When the GPU context is lost, that atlas is gone. Simply recreating the renderer, as ClutterContext does in step 2 above, creates a fresh, empty one. But there is a subtlety: ClutterText actors internally cache PangoLayout objects, and each layout carries rendering data tied to the old renderer via GObject qdata. Drawing text with a stale layout against the new renderer produces incorrect results or crashes.

    The fix required a small chain of additions:

    • clutter_forget_layout() , a new function in clutter-pango-render , removes the qdata from a given PangoLayout , severing its tie to the old renderer. It verifies that both the renderer and the qdata exist and that the qdata was produced by the current renderer before clearing it.
    • ClutterText.unrealize() , a new virtual method implementation, calls clutter_forget_layout() on each of ClutterText ‘s internally cached layouts when the actor is unrealized.
    • When ClutterStage unrealizes (step 1), the unrealize cascade reaches every ClutterText actor in the scene graph, clearing stale layout data across the board.

    This is precisely why the stage must unrealize before the font renderer is recreated. ClutterText ‘s unrealize() calls clutter_forget_layout() , which requires the old renderer to still be alive to destroy the layout’s qdata. If the renderer were already replaced, it would be impossible to correctly destroy the data. The stage is then re-realized in step 4, after the new renderer exists and stage views have been rebuilt, allowing text to render cleanly from a fresh cache.

    Where things stand

    As the recording shows, the compositor survives a GPU reset, and the session remains usable: windows update correctly, input is responsive, and the session doesn’t crash or freeze. There are two notable gaps, though.

    First, framebuffer recreation isn’t automatic yet. After recovery, the display stays dark until something triggers the creation of a new framebuffer; in the recording, I do this manually by maximizing the active window via a keyboard shortcut. Without that nudge, stage views are rebuilt, but nothing causes a fresh framebuffer to actually be allocated, so the screen just stays black. Making this automatic is one of the next things to sort out.

    Second, once the display is back, the desktop background renders with incorrect or garbled textures. MetaBackgroundImage and MetaBackground hold references to GPU textures (primarily the background image loaded from disk), and MetaBackgroundContent defines GLSL shaders, all of which are invalidated by the reset. Recovering them requires updating these objects to re-upload their GPU-side resources after the reset.

    There are also residual GL errors after recovery that need investigation, and the signal handler ordering situation deserves a more deterministic solution.

    What’s next

    • Signal handler ordering : replacing the implicit connection order dependency with an explicit, documented mechanism.
    • Automatic framebuffer recreation : removing the dependency on a manual trigger (like maximizing a window) for the display to actually come back after recovery.
    • Background texture recovery : getting MetaBackground and its counterparts to listen to the graphics reset signal and reload their GPU-side resources.
    • Auditing remaining GPU-tied resources : ensuring nothing else in the compositor holds stale references after recovery.
    • MR review and upstream integration : the current implementation lives on my fork and is primarily only reviewed by my mentors; as the approach stabilises, it will be proposed for upstream review.

    Thanks

    A huge thank you to my mentors Jonas Ådahl , Robert Mader , and Carlos Garnacho for their incredible guidance since the beginning. I’d also like to thank Bilal Elmoussaoui for the valuable review comments on my merge request. On to what’s next! 🦾 ❤