Running LLMs on a Steam Deck in 5 minutes

A step-by-step guide to setting up your Steam Deck to run local Large Language Models using llama.cpp and Distrobox

Overview

This guide demonstrates how to execute large language models on a Steam Deck using GPU acceleration. The process leverages containerization and compilation tools to achieve functional AI inference on portable hardware.

Key Specifications

The Steam Deck features an AMD APU with 4 Zen 2 CPU cores, 8 RDNA 2 GPU compute units, and 16GB shared LPDDR5 RAM. By default, the GPU accesses only 1GB of this memory, though the BIOS allows allocation up to 4GB—recommended for larger model deployment.

Setup Process

Container Environment

The guide uses Distrobox to create an Ubuntu 24.04 container, preserving the host SteamOS installation untouched.

Build Configuration

After enabling SSH and entering the container, users install development tools and build llama.cpp with Vulkan GPU support using CMake:

bash

Model Execution

The Gemma 3 1B model can run via command-line interface, with GPU layer offloading specified as -1 for full acceleration.

Performance Metrics

During inference, the GPU consumes approximately 10-11 watts, while overall system power draw reaches 20-25 watts—demonstrating surprising efficiency for quantized 7B-scale models.

Practical Outcome

This approach enables portable LLM inference without modifying the base operating system, making the Steam Deck viable for experimental AI applications and localized language processing tasks.