Overview
This technical project documents the creation of "The Reluctant Buddha," a specialized chatbot built on Meta's Llama 3.2 (1B parameter model). The system runs efficiently on CPU-only hardware using quantized models and delivers responses with a distinctive personality blending internet culture with philosophical wisdom.
Technical Specifications
- Base Model: Llama 3.2 (1B parameters)
- Quantization: Q4_K_M GGUF format
- Hardware: i5-7500T CPU (4 cores/4 threads), 16GB RAM (1.5GB utilized)
- Training Data: 400 samples from initial 1,000 prompt-response pairs
Dataset Creation
The project's foundation involved constructing a specialized training corpus. Responses were intended to capture the character's voice with phrases like "real friends... echo chambers and curated identities? Bwahaha. Good luck with that..."
The dataset incorporated:
Initial "gold standard" examples came from Claude Sonnet 3.7 and Deepseek V3, with expanded variations generated using Granite 3.2.
Fine-Tuning Methodology
Framework Stack
- Unsloth for optimization
- Google Colab with T4 GPU support
- TRL library's SFTTrainer
Key Technical Approaches
- Data Preparation: JSON dataset transformed into Llama 3.1/3.2 chat template format with standardized role/content structure
- Model Optimization: LoRA (Low-Rank Adaptation) applied with r=16 targeting projection matrices (q_proj, k_proj, v_proj), incorporating gradient checkpointing for memory efficiency
- Training Configuration: 120 training steps with response-only loss focus and 8-bit optimizers
- Model Export: Dual format export (PyTorch and GGUF) with Q4_K_M quantization
Inference Architecture
The deployment leveraged llama-cpp-python to achieve performance comparable to Ollama on CPU infrastructure while maintaining:
- Minimal memory overhead
- Token-by-token streaming capabilities
- Acceptable inference latency
Backend Implementation
The FastAPI server orchestrates model operations:
python
Challenges and Resolutions
| Challenge | Solution | |-----------|----------| | GPU resource limitations | Unsloth optimizations with 4-bit quantization | | Response style consistency | Refined system prompting and temperature tuning | | CPU inference speed | GGUF quantization + llama-cpp-python migration | | Token streaming in web context | FastAPI streaming responses with async handling |
Key Learning Outcomes
- Dataset quality and stylistic consistency prove critical for personality-driven fine-tuning
- LoRA effectively adapts large models with constrained computational resources
- Inference optimization substantially impacts user experience quality
- Modern frameworks enable practical LLM deployment on modest hardware
Future Development Directions
- Expanded dataset diversity
- Larger base model experiments (3B-8B parameters)
- Persistent conversation memory
- User feedback integration for iterative improvement
Conclusion
The project successfully demonstrates the complete development cycle: data curation through deployment as a web service. It validates that even smaller models (1B parameters) achieve meaningful specialization when properly fine-tuned, making advanced AI capabilities accessible beyond high-resource environments.