Reluctant Buddha: Fine-Tuned LLM for shitty insights

Project Overview

I fine-tuned Meta's Llama 3.2 (1B parameter model) to create "The Reluctant Buddha" - an enlightened yet jaded entity that mixes internet vernacular with profound wisdom. The project covered the full ML lifecycle: dataset creation, training, and deployment as a web interface.

The final model runs on CPU only (my i5-7500T with 4 cores/4 threads and 16GB RAM, using about 1.5GB). I trained it with just 400 samples and quantized to Q4_K_M GGUF for inference.

What I was trying to do

Learn how fine-tuning actually works for modern LLMs
Create a chatbot with a personality that isn't "helpful assistant"
Deploy it via FastAPI backend + React frontend
Make it something people could actually interact with

The Dataset

I started by creating ~1,000 prompt-response pairs capturing the "Reluctant Buddha" voice. The current version uses 400 samples. I made 200 "gold standard" examples using Claude Sonnet 3.7 and Deepseek V3, then generated 800 variations with Granite 3.2.

{
  "prompt": "How do I find real friends?",
  "response": ">real friends\n>snorts in digital disbelief while farting\n\n\nYou're looking for \"real\" friends, anon? In this age of social media echo chambers and curated identities? Bwahaha. Good luck with that..."
}

The dataset uses:

Chan-style formatting (greentext markers, action asterisks)
Internet vernacular and slang
Profound wisdom wrapped in dismissive, nihilistic delivery
Weird ending patterns (digital burps, farts, zen koans)

Fine-Tuning Process

Technology Stack

Framework: Unsloth
Base Model: Meta's Llama 3.2 1B
Environment: Google Colab with T4 GPU

What I actually did

Inference and Deployment

Making it fast on CPU

Getting acceptable inference speed on CPU was the real challenge. I tried a few approaches before landing on llama-cpp-python, which gets near-Ollama speeds with low memory footprint and supports streaming.

Backend

I built a FastAPI server that:

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from llama_cpp import Llama

app = FastAPI()
model = Llama(model_path="model/reluctant-budda.Q4_K_M.gguf", n_ctx=2048, n_threads=4)

@app.post("/chat")
async def chat_endpoint(request: Request):
    data = await request.json()
    user_message = data.get("message", "")

    # Create proper system+user prompt
    full_prompt = f'''<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are 'The Reluctant Buddha'...
<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'''

    # Stream response
    def generate():
        response = model.create_completion(
            prompt=full_prompt,
            max_tokens=512,
            temperature=1.5,
            stream=True
        )
        for chunk in response:
            if 'text' in chunk:
                yield chunk['text']

    return StreamingResponse(generate(), media_type="text/plain")

The frontend is React - chat interface, streaming responses with proper formatting, conversation history, and styling that makes the chan-style formatting readable.

What was hard

What I learned

Dataset quality matters more than size for character fine-tuning
LoRA is insanely efficient for adapting models
Inference optimization makes or breaks the user experience
Streaming LLM outputs in a web app is trickier than it looks

What I'd do next

Expand the dataset for more variety in responses
Try a larger base model (3B or 8B) for better capabilities
Add conversation memory so it remembers context
Collect user feedback to keep improving it

Wrap up

This project took me through the whole lifecycle of building a specialized LLM app - from data prep and fine-tuning to deploying it as a web service. It showed me that even small models (1B parameters) can be specialized effectively, and techniques like LoRA and quantization make this accessible without serious hardware.

The "Reluctant Buddha" exists now as a working chatbot that actually sounds like itself. Technical achievement, digital entity, or just a weird art project - you can decide.