I Tried to Voice-Clone Myself in 90 MB
The experiment: a tiny SmolLM2 running in your browser. A token-level n-gram trained on every word I have published. Mix the two distributions in probability space at every generation step. Sample from the mix.
You can try it at /ask. There is a slider for how strongly the n-gram bleeds in.
I expected a small chatbot that sounds like me. I got a 135M parameter model that uses my words to produce paranoid lorem-ipsum, and a 1.7B model that mostly behaves like a competent chatbot decorated with my function words. Sometimes a phrase comes out that I might actually write. More often the output is grammatically OK but conceptually empty.
The result is weak. The architecture is interesting anyway, and I want to write up why.
The Pieces
Three things, all running locally:
-
SmolLM2-Instruct in 135M, 360M, or 1.7B sizes. Q4_K_M GGUF, served from HuggingFace, run via Wllama. 90 MB to 1 GB on disk. CPU only.
-
A token-level n-gram over my blog corpus: every post tokenized with SmolLM2’s BPE, indexed with a suffix array. 1.6 MB of source text, 470,000 tokens, 1.9 MB suffix array.
-
A token-by-token sampling loop that mixes the LLM’s output distribution with the n-gram’s, in probability space.
The third piece is the part worth thinking about.
The Math
At each generation step the LLM produces p_llm(t) over its 49,152-token vocabulary, exposed by Wllama’s getLogits(-1). The n-gram, given the longest suffix of the current context that occurs in the corpus, produces a sparse p_ngram(t): nonzero on tokens it has seen following that context, zero elsewhere.
Linear combination:
p_mix(t) = α · p_ngram(t) + (1 − α) · p_llm(t)
That is the whole algorithm. The inner loop is small enough to fit on screen:
for (let step = 0; step < N; step++) {
const llm = await wllama.getLogits(-1);
const m = ig.longestSuffixMatch(context);
const ngram = m.suffixLen > 0
? new Map(ig.continuations(m.matchedTokens).map(c => [c.token, c.prob]))
: null;
const mix = new Map();
for (const { token, p } of llm) {
const pn = ngram?.get(token) ?? 0;
mix.set(token, alpha * pn + (1 - alpha) * p);
}
const next = sample(mix, temperature);
if (await wllama.isTokenEOG(next)) break;
context.push(next);
await wllama.decode([next]);
}
Tokens unseen by the n-gram have p_ngram = 0 and retain (1 − α) · p_llm in the mixture. They are not zeroed out, just unboosted.
α = 0 is the LLM. α = 1 is the n-gram, which loops as soon as the generated context drifts off-corpus. In between is a model running on LLM grammar with n-gram register.
... Read more →