Reluctant Buddha: Fine-Tuned LLM for Shitty Insights

Building a 1B parameter chatbot with distinctive personality running on CPU-only hardware

Overview

This technical project documents the creation of "The Reluctant Buddha," a specialized chatbot built on Meta's Llama 3.2 (1B parameter model). The system runs efficiently on CPU-only hardware using quantized models and delivers responses with a distinctive personality blending internet culture with philosophical wisdom.

Technical Specifications

  • Base Model: Llama 3.2 (1B parameters)
  • Quantization: Q4_K_M GGUF format
  • Hardware: i5-7500T CPU (4 cores/4 threads), 16GB RAM (1.5GB utilized)
  • Training Data: 400 samples from initial 1,000 prompt-response pairs

Dataset Creation

The project's foundation involved constructing a specialized training corpus. Responses were intended to capture the character's voice with phrases like "real friends... echo chambers and curated identities? Bwahaha. Good luck with that..."

The dataset incorporated:

Initial "gold standard" examples came from Claude Sonnet 3.7 and Deepseek V3, with expanded variations generated using Granite 3.2.

Fine-Tuning Methodology

Framework Stack

  • Unsloth for optimization
  • Google Colab with T4 GPU support
  • TRL library's SFTTrainer

Key Technical Approaches

  1. Data Preparation: JSON dataset transformed into Llama 3.1/3.2 chat template format with standardized role/content structure
  1. Model Optimization: LoRA (Low-Rank Adaptation) applied with r=16 targeting projection matrices (q_proj, k_proj, v_proj), incorporating gradient checkpointing for memory efficiency
  1. Training Configuration: 120 training steps with response-only loss focus and 8-bit optimizers
  1. Model Export: Dual format export (PyTorch and GGUF) with Q4_K_M quantization

Inference Architecture

The deployment leveraged llama-cpp-python to achieve performance comparable to Ollama on CPU infrastructure while maintaining:

  • Minimal memory overhead
  • Token-by-token streaming capabilities
  • Acceptable inference latency

Backend Implementation

The FastAPI server orchestrates model operations:

python

Challenges and Resolutions

| Challenge | Solution | |-----------|----------| | GPU resource limitations | Unsloth optimizations with 4-bit quantization | | Response style consistency | Refined system prompting and temperature tuning | | CPU inference speed | GGUF quantization + llama-cpp-python migration | | Token streaming in web context | FastAPI streaming responses with async handling |

Key Learning Outcomes

  1. Dataset quality and stylistic consistency prove critical for personality-driven fine-tuning
  2. LoRA effectively adapts large models with constrained computational resources
  3. Inference optimization substantially impacts user experience quality
  4. Modern frameworks enable practical LLM deployment on modest hardware

Future Development Directions

  • Expanded dataset diversity
  • Larger base model experiments (3B-8B parameters)
  • Persistent conversation memory
  • User feedback integration for iterative improvement

Conclusion

The project successfully demonstrates the complete development cycle: data curation through deployment as a web service. It validates that even smaller models (1B parameters) achieve meaningful specialization when properly fine-tuned, making advanced AI capabilities accessible beyond high-resource environments.