I wanted my blog posts to have audio narration. Not a podcast, not a read-aloud button that sends text to a cloud API. Local TTS with narro, my 80M parameter CPU model, generating Opus files that live next to the markdown source. One command to narrate an entire Hugo site.
That part was straightforward. The part that got interesting was highlighting: tracking which sentence is being spoken and lighting it up in the browser as the audio plays.
The Pipeline
Three commands:
narro hugo install ~/mysite # copy player assets (JS, CSS, HTML partial)
narro hugo generate ~/mysite # narrate all posts with tts: true
narro hugo status ~/mysite # show what's been narrated and what hasn't
generate walks the content directory, finds posts with tts: true in frontmatter, extracts the prose (stripping code blocks, math, shortcodes, frontmatter), runs it through narro, converts to Opus via ffmpeg, and drops narration.opus and narration.json next to the index.md. The JSON file contains sentence-level timestamps. The HTML partial picks them up and wires everything together.
The player is a vanilla JS widget. No dependencies. Play, pause, spacebar toggle. The active paragraph highlights as the audio plays.
The Alignment Problem
Timestamps are the interesting part. Narro uses a causal language model (Qwen-based, 80M params) to generate hidden states, then a Vocos decoder converts those to audio. There is no explicit alignment signal in this architecture.
My first attempt tried word-level timestamps using attention weights. The idea: each generated audio token attends to input text tokens, so you can compute a center-of-mass over the attention distribution to estimate when each word is spoken. This works beautifully for encoder-decoder models with cross-attention. It does not work for causal LM self-attention.
Causal self-attention attends to everything that came before. The distributions are diffuse. Every word’s center-of-mass lands somewhere in the middle of the sequence, producing overlapping time ranges that are useless for highlighting.
Sentence-Level Timestamps
The fix was to stop trying to be precise about words and use the precision the model actually gives you.
Narro processes each sentence independently. The number of generated tokens per sentence is known exactly. Multiply by the token duration (2048 samples / 32kHz = 64ms per token) and you get exact sentence timing. No approximation, no heuristics. The timestamps are ground truth.
The alignment JSON is one entry per sentence:
[
{"text": "Hello world.", "start": 0.0, "end": 0.64},
{"text": "Goodbye moon.", "start": 0.64, "end": 1.6}
]
I initially tried distributing each sentence’s duration across its words proportionally by character count. It worked, but the DOM manipulation was fragile. Wrapping every word in a <span>, walking the tree to skip headings and code blocks, matching word indices sequentially to alignment entries. If any word got missed or doubled, every subsequent highlight drifted.
Sentence-level is better. The JS player matches sentences to existing <p> elements by substring containment. The DOM stays untouched. The highlighting is a CSS class toggled on the paragraph. Simpler code, no drift, and the visual result is actually clearer to read.
The Browser Side
The JS player runs a binary search on the alignment data at 60fps via requestAnimationFrame and toggles a CSS class on the active paragraph. That is essentially the whole thing. No DOM rewriting, no word wrapping, no skip-tag lists.
What It Sounds Like
Every post on this blog with tts: true in frontmatter has a player at the top. Hit play. The active paragraph highlights as it is spoken. The model runs on CPU in about 20x real-time.
The source is at github.com/queelius/narro.
Discussion