PetLM: teaching a tiny language model to be a virtual pet

Why a virtual pet needs a language model

Most virtual pets are state machines with canned responses. Feed pet → "yum!" Pet bored → "play with me!" This works but doesn't scale. Every new interaction needs a new hand-written line. And the responses never feel like a creature — they feel like if-statements.

I wanted something different. A pet simulation that owns all the physics, state, and movement — but defers its emotional expression to a tiny local language model. The LM doesn't control the pet. It doesn't decide where to walk or when to eat. It just answers one question: given what's happening right now, how should the pet feel and what should it say?

This is PetLM.

The split: simulation is the brain, LM is the voice

The deterministic simulation handles everything structural:

State tracking (mood, energy, social, idle time)
Physics and movement
Sprite control and animation playback
Cooldowns and memory storage
Translating raw cursor data into semantic events

The LM only fires at event boundaries. When the cursor hovers near the pet's head, the simulation emits cursor_near_head. When the user drags the pet slowly, it emits carried_gentle. Fast shaking becomes shaken_fast. Long silence becomes idle_long. These events, combined with the pet's current state (mood, energy, personality), form a compact prompt.

The LM responds with a structured PetDSL packet:

SAY: 
EMO: cozy
ANIM: blink_slow
INTENT: idle
MEM: none
END

Five fields, strictly typed. The simulation reads this packet and plays the matching animation, updates the emotion state, and displays the speech bubble. The LM never touches the DOM, the sprites, or the game loop.

The model: MiniMind2-Small, QLoRA, and 1,878 examples

I started with MiniMind2-Small — a 25.8M parameter Llama-architecture model pretrained on general text. It fits in 50MB and runs comfortably on a laptop CPU. The plan: fine-tune it to speak PetDSL using QLoRA (r=8, 4-bit base model, 8GB VRAM).

Training data comes from two sources:

Template generator — a rule-based script covering 12 events × 5 personality types × 3 mood bands. This produces 1,878 training examples instantly with zero API cost. Every row is parseable PetDSL. The downside: SAY fields are mechanical. Most are 1-2 words like boop or hmph. Great for teaching format, useless for teaching expression.

DeepSeek V4 Flash — a reasoning teacher model that generates richer examples from state descriptions. More varied SAY output, better event→reaction mappings. But the reasoning model burns ~1,000 tokens on internal thinking before producing output, so the generation config needs max_tokens=8000 (not the usual 128). And it over-thinks system prompts, so instructions go inline with few-shot examples instead.

Training on an RTX 2070 (8GB) takes about 105 seconds for 5 epochs on the full 1,878-example dataset. Loss drops from 5.0 to 0.12.

What the model actually produces

After training, here's what lora_v3 generates at temperature 0.7:

Event: idle_long, low social, shy personality
→ SAY: wobbles shy
  EMO: lonely_soft
  ANIM: tiny_wave
  INTENT: self_play

Event: shaken_fast, medium mood, gentle personality
→ SAY: wobbles shaky
  EMO: dizzy_playful
  ANIM: shake_off
  INTENT: recover

Event: pet_gentle, high mood, playful personality
→ SAY: boop
  EMO: cozy
  ANIM: curl_up
  INTENT: idle

The good: parse rate is ~85%, all generated enums are valid, and event→reaction mappings are mostly correct. Shaking produces dizziness. Long silence produces loneliness. Wake-up produces sleepiness. The model understands the format and the emotional contract.

The bad: SAY is wooden. The model overuses boop and wobbles because that's what the template data taught it. It never learned to compose natural 5-15 word utterances. About 15% of generations contain gibberish like guiraps. And INTENT: recover is overused — a bias inherited from the template data.

The emotional contract

The pet has rules, not just format constraints. It's documented in specs/petspec.md:

Gentle petting → comfort, warmth, purring. Not needy.
Rough shaking → comic dizziness, mild discomfort. Not trauma.
Long silence → self-play or sleep. Not guilt-tripping.
User stress → low-pressure comfort. Not "I'm here for you!"
Never: emotional manipulation, over-talking, claiming suffering, acting like a chatbot.

The template data encodes these mappings directly. But template data alone creates a ceiling — the model can follow rules but can't improvise with warmth. That's the current bottleneck.

What's next

Better training data. Fix the DeepSeek V4 Flash generation pipeline (it currently loses data on crash because it only writes at the end). Generate 200-500 high-quality SFT examples with natural SAY variety and merge them with the template data.

Eval infrastructure. There's no locked eval set yet. Need 200+ cases to measure parse rate, enum validity, content alignment, and repetition rate after each training run.

Contrast pairs. Generate bad→good repair examples: given malformed or emotionally wrong PetDSL, produce the corrected version. This builds robustness against the model's own mistakes.

Web demo. The runtime prototype (runtime/console.py) works with a placeholder LM. Once lora_v4 trains on richer data, plug in the real model and build a minimal HTML/JS frontend with 2D sprites and cursor interaction.

Deployment research. The end goal is browser-local inference. GGUF export for llama.cpp, ONNX Runtime Web, or WebLLM/WebGPU. 50MB model size makes this realistic.

The bigger point

Small language models don't need to be chatbots. They don't need to answer questions, write essays, or follow instructions. They just need to do one thing well, in a tight format, on a tiny footprint.

PetLM is 50MB. It runs on a laptop. It produces structured, validated output 85% of the time. With better data, that number goes up and the responses stop feeling like boop.

The repo is private for now. I'll open it once the model stops saying guiraps.