Building an offline Quran verse recogniser from scratch

TL;DR

I'm building a system that takes Quran audio and tells you exactly which verse is being recited. Fully offline, no API calls. The pipeline: Whisper encoder (fine-tuned on Quran by Tarteel) -> CTC head -> WFST beam search -> verse ID.

After the full training run, I hit 70% accuracy. Then I discovered a problem: the model doesn't generalise to new voices. At all. 3% on a held-out reciter, 0% on my own recitation. So now I'm adding more training data (23 reciters instead of 11), data augmentation, and retraining. The decoding graph covers all 6,236 verses and fits in 7.9 MB.

Where this started

I came across yazinsai/offline-tarteel, a really well-documented project trying to solve the same problem: identify which Quran verse is being recited, entirely on-device. Yazin's analysis of the problem space is sharp. His central insight is correct - there are no small Arabic wav2vec2 models, and that's the fundamental blocker for traditional ASR approaches.

His best results were 67-72% accuracy with Tarteel's Whisper model using transcribe-then-fuzzy-match. An 81% result with a larger CTC model proved the ceiling is high when the acoustic model actually understands Arabic, but that model was 1.2GB - way too big for on-device use.

I thought I could do better with a different decoding strategy. The Quran is a closed corpus - 6,236 verses, and the text will never change. That's the dream scenario for constrained decoding.

The key insight: WFST decoding

Instead of transcribing audio to text and then fuzzy-matching against a database (which is what most approaches do), I compiled the entire Quran into a weighted finite state transducer. It's basically a massive trie where every verse is a valid path through token states, and the decoder can only follow valid paths.

Think of it as autocomplete but for the entire Quran. Even a mediocre acoustic model becomes dramatically more accurate because it literally can't hallucinate. If the model is 60% confident between two similar-sounding words, the WFST resolves the ambiguity by checking which one is valid in that context.

Audio (16kHz) -> Mel Spectrogram -> Whisper Encoder (512-dim)
    -> CTC Head (512 -> 3000) -> Log-Softmax
    -> WFST Beam Search -> Verse ID

Building the pieces

Tokeniser

I trained a SentencePiece Unigram tokeniser on all 6,236 verses with a vocab of 3,000 tokens. Unigram over BPE because Arabic is morphologically rich - you want subword units that respect the structure of the language rather than just frequent byte pairs.

100% round-trip accuracy on all 6,236 verses. I set normalization_rule_name="identity" to preserve the Uthmani script exactly. Quran text is not the place to get creative with normalisation.

Average tokens per verse: 32.7. Max: 382 (Al-Baqarah 2:282, the longest verse in the Quran).

The decoding graph

This was three steps:

Word-to-token lexicon - mapped all 21,580 unique Uthmani word forms to their token sequences. 100% reconstruction accuracy.

Grammar FST (G.fst) - built a trie-structured FST where each verse is a valid path. The last arc of each verse carries a verse ID (surah * 1000 + ayah). State 0 is both start and final, with epsilon arcs from verse-end states back to start - this allows recognising consecutive verse spans.

CTC topology (LG.fst) - added blank self-loops on every state so the CTC decoder can emit blank frames anywhere.

186K states, 192K arcs, 5.1 MB for G.fst. After adding CTC topology: 7.9 MB for LG.fst.

I originally tried k2 for the FST operations but its C extension wouldn't load on Python 3.13 with macOS ARM. kaldifst (lighter OpenFst wrapper) worked straight away.

For CTC with SentencePiece tokens, the lexicon is implicit - each word's token sequence is deterministic from the tokeniser. So the grammar FST already operates at token level. Traditional L.compose.G composition would be needed if I had a separate phone/grapheme level, but SentencePiece handles word-to-token mapping directly. Simpler than expected.

Training data

I needed ayah-level audio from multiple reciters. Tarteel's own CDN returns 403s. qurancdn.com works for some but limited coverage. alquran.cloud was the reliable one - public API, no auth, 27+ reciters.

Downloaded 68,500 MP3 files across 11 reciters, about 18 GB total. The split is by reciter, not by verse - I want the model to generalise across voices, not memorise how Alafasy sounds on verse 2:255.

Split	Entries	Reciters
Train	49,887	8 Arabic reciters
Val	6,228	Sudais
Test	6,236	Ibrahim Walk (English)

Yes, the test set is an English reciter. Deliberately harsh - if the model can handle a completely different language rendering of the same verses, it's actually learning the content and not just the accent.

One bug I caught early: the original manifests included verse number tokens (like the Arabic numerals for 1, 2, 3) at the end of each verse's token sequence. These come from Uthmani script convention but reciters don't actually read verse numbers. Stripped them before tokenisation.

Training

I'm using tarteel-ai/whisper-base-ar-quran as the base - a Whisper model already fine-tuned on Quran recitation by Tarteel. I stripped the decoder, kept only the encoder (512-dim), and bolted a linear CTC head onto it (512 -> 3000, matching the tokeniser vocab).

Three-phase progressive unfreezing:

Phase	What's trained	Learning rate	Epochs
A	CTC head only (encoder frozen)	1e-3	3
B	Top 4 encoder layers + head	1e-5	3
C	Everything	5e-6	3

One annoyance - nn.CTCLoss doesn't work on MPS. The workaround is running the forward pass on GPU, then moving log-probs to CPU for the loss computation. Gradients still flow through the .cpu() call (I verified), but it adds about 10% overhead. Better than running the whole encoder on CPU though.

Results after full training

Ran the full 9 epochs (3 each phase). Total training time: about 16.5 hours on MPS.

Epoch	Phase	Train Loss	Val Loss	Time
1	A	2.7719	3.2833	76m
2	A	1.7187	3.2009	77m
3	A	1.6002	3.2048	75m
4	B	1.0152	2.0172	113m
5	B	0.6649	1.8522	117m
6	B	0.4874	1.7677	116m
7	C	0.3683	1.7312	144m
8	C	0.3099	1.7289	137m
9	C	0.2702	1.7036	141m

The Phase A to B transition (epoch 3 to 4) was the big jump - val loss dropped from 3.20 to 2.02, a 37% improvement. Encoder unfreezing matters.

After the full run, evaluated on the 50 EveryAyah test samples:

Metric	Phase A only	Full 3-phase
SeqAcc	4/50 (8%)	35/50 (70%)
Empty predictions	33/50	15/50

70% accuracy. Big improvement from 8% after Phase A.

The generalisation problem

Then I tested on a held-out reciter (Sudais) that wasn't in training. 3% accuracy. Tried my own recitation. 0%.

The model memorised reciter acoustics instead of learning the language. Eight training reciters with zero augmentation wasn't enough - it learned to recognise specific voices, not Quranic Arabic phonemes.

This is embarrassing but useful. The WFST approach is sound, the CTC training works, but I starved the model of speaker diversity.

What I'm doing now

Phase 4: fix generalisation.

Split	Reciters	Count
Train	18 (including 3 from Tarteel)	18
Val	Sudais, Husary Mujawwad, Hudhaify	3
Test	Muhammad Jibreel (completely held out)	1

Sudais moves to validation so I can track the generalisation metric directly during training. Jibreel stays completely untouched for final evaluation.

Running 3+5+5 epochs this time (more in phases B and C since there's more data and regularisation now).

Why I still think this gets to 95%+

Yazin's 67% with the Tarteel model was using transcribe-then-match. Most errors were partial - close enough that the WFST graph should resolve them. The encoder already understands Quranic Arabic, it just needs the CTC head trained to produce useful frame-level outputs, and then the WFST handles disambiguation.

The generalisation failure is fixable. More speakers, augmentation, and regularisation should force the model to learn phoneme-level patterns rather than voice fingerprints. The architecture is right - I just under-trained on diversity.

The full decoding graph (7.9 MB) plus the quantised model should fit comfortably under 200 MB, which was the whole point - something that can run on a phone or cheap device without connectivity.

What's next

Retrain with the expanded dataset and augmentation. If Sudais val accuracy crosses 80%, I'll move to quantisation (INT8, maybe INT4) and ONNX export for on-device deployment via sherpa-onnx.

I'll post an update once the new training run finishes.

All code is in quran-asr if you want to follow along.