Running LLMs on a Steam Deck in 5 minutes

TL;DR

We'll use Distrobox to create an Ubuntu container, install the necessary tools, and compile llama.cpp with GPU support.
In a couple of minutes, you'll be able to run a model like Gemma 3 1B (or any 4b - 7b model) directly on your Steam Deck's GPU.
The Deck's GPU uses around 10-11W when generating responses. The whole machine uses around 20-25W.
You don't need to modify your SteamOS at all. Everything runs inside a container.

The Steam Deck is a surprisingly capable device for running local LLMs. With a bit of setup, you can have a portable, powerful inference machine.

Component	Specification
APU	7 nm AMD APU
CPU	Zen 2 4-core/8-thread, 2.4-3.5GHz
GPU	8 RDNA 2 compute units, 1.6GHz
RAM	16 GB LPDDR5 @ 5500 MT/s

A Note on VRAM

The Steam Deck shares its 16GB of RAM between the system and the GPU. By default, the GPU is allocated 1GB of this shared memory. However, you can increase this to 4GB in the BIOS, which is highly recommended for running larger models.

Shut down the Steam Deck completely.
Hold Volume Up (+) and press the Power button. Release the power button but keep holding Volume Up until you hear a chime.
Select Setup Utility.
Navigate to Advanced.
Set UMA Frame Buffer Size to 4G.
Go to Save & Exit and select Save Changes and Exit.

1. Enable SSH

First, we need to get shell access to the Steam Deck.

Switch to Desktop Mode: Press the Steam button, go to Power, and select Switch to Desktop.
Open Konsole: Click the bottom-left button, go to System, and open Konsole.
Set a Password: Press the X button to open the on-screen keyboard.Type passwd and press enter to set a password for the deck user. You'll need this to SSH in.
Enable SSH: Run the following command to start the SSH server:

    sudo systemctl enable sshd --now

Find Your IP Address: Click the Wi-Fi icon in the bottom-right corner, click the arrow on your current connection, and choose Details to find your IP address.
Connect via SSH: From another computer on the same network, you can now SSH into your Steam Deck:

    ssh deck@your-ip-address

2. Create a Distrobox Container

Next, we'll create an Ubuntu container to house our development environment. This keeps the main SteamOS clean. We are using Distrobox, which is a tool that allows you to create and manage containerized development environments.

Create the container:

    distrobox create -I ubuntu:24.04

Enter the container:

    distrobox enter ubuntu:24.04

From now on, everything we do will be inside the container (your SteamOS will remain untouched). You can exit the container at any time by typing exit.

3. Install Dependencies

Inside the Distrobox container, we need to install the tools required to build llama.cpp with GPU acceleration.

sudo apt update && sudo apt install -y rocminfo nano ncdu nload screen tmux pigz unzip iotop htop build-essential cmake git mesa-vulkan-drivers libvulkan-dev vulkan-tools glslang-tools glslc libshaderc-dev spirv-tools  libcurl4-openssl-dev ca-certificates kmod ccache radeontop

4. Build llama.cpp

Now we can clone the llama.cpp repository and build it with Vulkan support.

Clone the repo:

    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp

Configure the build:

    rm -rf build
    cmake -B build -DGGML_VULKAN=ON -DGGML_CCACHE=ON

Build it:

    cmake --build build --config Release -j

5. Run an LLM!

With llama.cpp built, you can now run a model. This command will download and run Gemma 3 1B, using the Steam Deck's GPU for acceleration.

./build/bin/llama-cli \
  -hf unsloth/gemma-3-1b-it-GGUF\
  --gpu-layers -1 \
  -p "Say hello from Steam Deck GPU."

You should see the model generate a response, with the GPU taking on the bulk of the work.

6. Power Consumption

A quick note on power. While running the Gemma 3 1B model, the GPU consumes around 10W. Factoring in the CPU, the total power draw is between 20-25W. This makes the Steam Deck a surprisingly efficient device for running even quantized 7B models.

Conclusion

The ability to run quantized 7B models in such a small and power-efficient form factor is incredible. I wonder if there are any good role-play finetunes that fit in the Deck's VRAM? Who else is thinking about building something an AI companion that you can speak to (and that can speak back)?