Overview
This post documents a project where I fine-tuned Llama 3.2 3B to analyze voice transcripts locally, demonstrating that smaller specialized models can outperform much larger general-purpose alternatives.
Key Achievements
- Performance boost: Base model score improved from 5.35 to 8.55
- Training efficiency: 4 hours on a single RTX 4090 with batch size 16
- Competitive results: Outperformed 70B general models on this specific task
- Local inference: Merged to GGUF format (Q4_K_M quantization) for LM Studio deployment
Technical Approach
Dataset Creation
I started with 13 real voice transcripts and used Kimi K2 as a teacher model to generate "gold standard" JSON outputs. This seeded the creation of over 40,000 synthetic transcripts with corresponding reference outputs.
Training Setup
The process utilized the Unsloth library to make training as fast as possible, employing LoRA with 128 rank and alpha values across a single epoch with a 5e-5 learning rate.
Output Structure
The model produces detailed JSON payloads containing titles, cleaned transcripts, categories, tags, summaries, action items, entity extraction, and temporal data.
Comparative Analysis
The fine-tuned 3B model ranked second overall (8.40 score) only behind the teacher model (Kimi K2), surpassing numerous larger models including a 70B Hermes variant that achieved 8.18.
Conclusion
The project validates that specialized, efficiently-trained smaller models can deliver superior task-specific performance compared to general-purpose larger models.