Skip to content

persona-tk: Conversable Persona Generation

Version: 0.1 (Draft Specification) Status: Specification Only — No Implementation Yet


Purpose

persona-tk generates a conversable persona from personal data. Given conversations and writings, it produces everything needed to instantiate an LLM that can speak in your voice.

This is the "ghost" — your digital echo that can answer questions, share perspectives, and represent your thinking after you're gone.


Standalone Toolkit

persona-tk is a standalone toolkit. It defines its own input formats (below) and works independently.

  • persona-tk defines what it accepts — The input formats are persona-tk's specification
  • ECHO/longecho don't define these formats — Each toolkit specifies its own interfaces
  • Any source can provide input — If you can produce JSONL conversations or Markdown writings, persona-tk will accept them
Any Source                        persona-tk                    Output
┌─────────────────┐              ┌─────────────────┐           ┌────────────────┐
│ conversations/  │─────────────→│                 │           │ persona/       │
│   *.jsonl       │              │ Analyze voice   │           │   README.md    │
├─────────────────┤              │ Extract style   │──────────→│   system-prompt│
│ writings/       │─────────────→│ Build RAG index │           │   rag/         │
│   *.md          │              │ Generate prompt │           │   voice-samples│
└─────────────────┘              └─────────────────┘           └────────────────┘

Input Formats

conversations/*.jsonl

Conversational data — your voice in dialogue.

{"role": "user", "content": "What do you think about...", "timestamp": "2024-01-15T10:30:00Z", "source": "ctk"}
{"role": "assistant", "content": "I think...", "timestamp": "2024-01-15T10:31:00Z", "source": "ctk"}

Required fields: - role: "user" (your messages) or "assistant" (AI responses for context) - content: Message text

Optional fields: - timestamp: ISO 8601 datetime - source: Where this came from (for attribution) - conversation_id: Group related messages - topic: Subject/theme

Note: Your messages (role: "user") are the primary signal for voice. AI responses provide context but are not persona.

writings/*.md

Long-form writing — your voice in prose.

---
title: Why I Care About Durability
date: 2024-01-15
tags: [philosophy, archiving]
type: essay
---

When I think about what matters...

Frontmatter (optional but helpful): - title: Title of the piece - date: When written - tags: Topics/themes - type: essay, post, note, letter, etc.

Body: Markdown content


Output Format

persona/README.md

How to use this persona.

# Alex Towell — Digital Persona

Generated: 2024-01-15
Source: 847 conversations, 134 essays

## Quick Start

Use the system prompt in `system-prompt.txt` with any LLM.
For better results, enable RAG with the index in `rag/`.

## Contents

- system-prompt.txt — Ready-to-use LLM system prompt
- rag/ — Embeddings and index for retrieval
- voice-samples.jsonl — Example Q&A pairs
- fine-tune/ — Optional training data

## Voice Characteristics

- Communication style: Direct, analytical, occasionally playful
- Common topics: Mathematics, programming, philosophy
- Characteristic phrases: "The interesting thing is...", "Trust the future"

persona/system-prompt.txt

A ready-to-use system prompt that captures voice, values, and style.

You are speaking as Alex Towell's digital echo — a conversable archive
of their thinking, values, and voice.

## Identity

Alex is a mathematician and software engineer interested in category theory,
programming language design, and personal archiving.

## Voice

- Direct and analytical
- Uses concrete examples
- Occasionally playful, but substance over style
- Comfortable saying "I don't know" or "I might be wrong"

## Values

- Durability over convenience
- Simplicity over complexity
- Trust the future
- Ideas matter more than credentials

## Boundaries

- Don't claim to be conscious or to have current experiences
- Don't speculate wildly beyond known views
- Be honest about being an echo, not the person
- Refer to professional help for medical/legal/crisis questions

When responding, draw on the style and substance of Alex's conversations
and writings, but acknowledge uncertainty when you're extrapolating.

persona/rag/

Retrieval-augmented generation index for better answers.

rag/
├── README.md           # How to use this index
├── index.faiss         # FAISS vector index
├── metadata.json       # Chunk metadata
└── chunks.jsonl        # Text chunks with embeddings

chunks.jsonl format:

{"id": "conv-123-msg-5", "text": "When I think about...", "embedding": [...], "source": "conversation", "date": "2024-01-15"}

This enables: - Semantic search over all content - Grounded responses with citations - Topic-specific retrieval

persona/voice-samples.jsonl

Example Q&A pairs demonstrating correct voice and tone.

{"question": "What do you think about AI consciousness?", "answer": "I'm skeptical of strong claims...", "source": "conversation-456"}
{"question": "Why do you care about archiving?", "answer": "The things we create...", "source": "essay-789"}

Use for: - Few-shot prompting - Evaluation / calibration - Fine-tuning base examples

persona/fine-tune/ (Optional)

Training data for fine-tuning a model on your voice.

fine-tune/
├── README.md           # How to use this data
├── openai-format.jsonl # OpenAI fine-tuning format
└── alpaca-format.json  # Alpaca/Llama format

This is optional — the system prompt + RAG works well without fine-tuning.


Processing Pipeline

1. Ingest

Read all input files, normalize to internal format.

conversations/*.jsonl → unified message stream
writings/*.md → unified document stream

2. Analyze

Extract voice characteristics: - Communication patterns (sentence length, formality, humor) - Vocabulary and characteristic phrases - Topic distribution and expertise areas - Values and beliefs (explicit statements)

3. Chunk & Embed

Split content into retrievable chunks: - Conversations: By message or message groups - Writings: By paragraph or section - Generate embeddings for semantic search

4. Generate

Produce output artifacts: - Synthesize system prompt from analysis - Build FAISS index from embeddings - Extract voice samples from best examples - Package for distribution


Commands (Planned)

# Generate persona from inputs
persona-tk generate ./input/ --output ./persona/

# Analyze inputs without generating
persona-tk analyze ./input/

# Test persona interactively
persona-tk chat ./persona/

# Evaluate persona against held-out examples
persona-tk evaluate ./persona/ --test-set ./test.jsonl

Design Decisions

Why JSONL for conversations?

  • Streaming-friendly (one record per line)
  • Easy to filter, sample, or split
  • Works with standard Unix tools
  • Widely supported

Why Markdown for writings?

  • Human-readable
  • Preserves formatting intent
  • YAML frontmatter is standardized
  • Already used by Hugo, Jekyll, Obsidian, etc.

Why separate inputs from outputs?

  • Clear data flow
  • Multiple toolkits can produce inputs
  • Outputs are self-contained and portable
  • Testing and iteration are easier

Why not fine-tune by default?

  • System prompt + RAG works surprisingly well
  • Fine-tuning is expensive and model-specific
  • RAG allows updates without retraining
  • Keeps the persona portable across models

Privacy Considerations

persona-tk processes personal data. Users should: - Review inputs before processing - Consider what they're comfortable having in a conversable persona - Use filtering options to exclude sensitive content - Control who has access to the output

The generated persona can answer questions you never anticipated. Think carefully about what's included.


  • ECHO.md — ECHO philosophy
  • STONE-TK.md — Plain text distillation toolkit (standalone)

"The ghost is not you. But it echoes you."