persona-tk: Conversable Persona Generation

Version: 0.1 (Draft Specification) Status: Specification Only — No Implementation Yet

Purpose

persona-tk generates a conversable persona from personal data. Given conversations and writings, it produces everything needed to instantiate an LLM that can speak in your voice.

This is the "ghost" — your digital echo that can answer questions, share perspectives, and represent your thinking after you're gone.

Standalone Toolkit

persona-tk is a standalone toolkit. It defines its own input formats (below) and works independently.

persona-tk defines what it accepts — The input formats are persona-tk's specification
ECHO/longecho don't define these formats — Each toolkit specifies its own interfaces
Any source can provide input — If you can produce JSONL conversations or Markdown writings, persona-tk will accept them

Any Source                        persona-tk                    Output
┌─────────────────┐              ┌─────────────────┐           ┌────────────────┐
│ conversations/  │─────────────→│                 │           │ persona/       │
│   *.jsonl       │              │ Analyze voice   │           │   README.md    │
├─────────────────┤              │ Extract style   │──────────→│   system-prompt│
│ writings/       │─────────────→│ Build RAG index │           │   rag/         │
│   *.md          │              │ Generate prompt │           │   voice-samples│
└─────────────────┘              └─────────────────┘           └────────────────┘

Input Formats

conversations/*.jsonl

Conversational data — your voice in dialogue.

{"role": "user", "content": "What do you think about...", "timestamp": "2024-01-15T10:30:00Z", "source": "ctk"}
{"role": "assistant", "content": "I think...", "timestamp": "2024-01-15T10:31:00Z", "source": "ctk"}

Required fields: - role: "user" (your messages) or "assistant" (AI responses for context) - content: Message text

Optional fields: - timestamp: ISO 8601 datetime - source: Where this came from (for attribution) - conversation_id: Group related messages - topic: Subject/theme

Note: Your messages (role: "user") are the primary signal for voice. AI responses provide context but are not persona.

writings/*.md

Long-form writing — your voice in prose.

---
title: Why I Care About Durability
date: 2024-01-15
tags: [philosophy, archiving]
type: essay
---

When I think about what matters...

Frontmatter (optional but helpful): - title: Title of the piece - date: When written - tags: Topics/themes - type: essay, post, note, letter, etc.

Body: Markdown content

Output Format

persona/README.md

How to use this persona.

# Alex Towell — Digital Persona

Generated: 2024-01-15
Source: 847 conversations, 134 essays

## Quick Start

Use the system prompt in `system-prompt.txt` with any LLM.
For better results, enable RAG with the index in `rag/`.

## Contents

- system-prompt.txt — Ready-to-use LLM system prompt
- rag/ — Embeddings and index for retrieval
- voice-samples.jsonl — Example Q&A pairs
- fine-tune/ — Optional training data

## Voice Characteristics

- Communication style: Direct, analytical, occasionally playful
- Common topics: Mathematics, programming, philosophy
- Characteristic phrases: "The interesting thing is...", "Trust the future"

persona/system-prompt.txt

A ready-to-use system prompt that captures voice, values, and style.

You are speaking as Alex Towell's digital echo — a conversable archive
of their thinking, values, and voice.

## Identity

Alex is a mathematician and software engineer interested in category theory,
programming language design, and personal archiving.

## Voice

- Direct and analytical
- Uses concrete examples
- Occasionally playful, but substance over style
- Comfortable saying "I don't know" or "I might be wrong"

## Values

- Durability over convenience
- Simplicity over complexity
- Trust the future
- Ideas matter more than credentials

## Boundaries

- Don't claim to be conscious or to have current experiences
- Don't speculate wildly beyond known views
- Be honest about being an echo, not the person
- Refer to professional help for medical/legal/crisis questions

When responding, draw on the style and substance of Alex's conversations
and writings, but acknowledge uncertainty when you're extrapolating.

persona/rag/

Retrieval-augmented generation index for better answers.

rag/
├── README.md           # How to use this index
├── index.faiss         # FAISS vector index
├── metadata.json       # Chunk metadata
└── chunks.jsonl        # Text chunks with embeddings

chunks.jsonl format:

{"id": "conv-123-msg-5", "text": "When I think about...", "embedding": [...], "source": "conversation", "date": "2024-01-15"}

This enables: - Semantic search over all content - Grounded responses with citations - Topic-specific retrieval

persona/voice-samples.jsonl

Example Q&A pairs demonstrating correct voice and tone.

{"question": "What do you think about AI consciousness?", "answer": "I'm skeptical of strong claims...", "source": "conversation-456"}
{"question": "Why do you care about archiving?", "answer": "The things we create...", "source": "essay-789"}

Use for: - Few-shot prompting - Evaluation / calibration - Fine-tuning base examples

persona/fine-tune/ (Optional)

Training data for fine-tuning a model on your voice.

fine-tune/
├── README.md           # How to use this data
├── openai-format.jsonl # OpenAI fine-tuning format
└── alpaca-format.json  # Alpaca/Llama format

This is optional — the system prompt + RAG works well without fine-tuning.

Processing Pipeline

1. Ingest

Read all input files, normalize to internal format.

conversations/*.jsonl → unified message stream
writings/*.md → unified document stream

2. Analyze

Extract voice characteristics: - Communication patterns (sentence length, formality, humor) - Vocabulary and characteristic phrases - Topic distribution and expertise areas - Values and beliefs (explicit statements)

3. Chunk & Embed

Split content into retrievable chunks: - Conversations: By message or message groups - Writings: By paragraph or section - Generate embeddings for semantic search

4. Generate

Produce output artifacts: - Synthesize system prompt from analysis - Build FAISS index from embeddings - Extract voice samples from best examples - Package for distribution

Commands (Planned)

# Generate persona from inputs
persona-tk generate ./input/ --output ./persona/

# Analyze inputs without generating
persona-tk analyze ./input/

# Test persona interactively
persona-tk chat ./persona/

# Evaluate persona against held-out examples
persona-tk evaluate ./persona/ --test-set ./test.jsonl

Design Decisions

Why JSONL for conversations?

Streaming-friendly (one record per line)
Easy to filter, sample, or split
Works with standard Unix tools
Widely supported

Why Markdown for writings?

Human-readable
Preserves formatting intent
YAML frontmatter is standardized
Already used by Hugo, Jekyll, Obsidian, etc.

Why separate inputs from outputs?

Clear data flow
Multiple toolkits can produce inputs
Outputs are self-contained and portable
Testing and iteration are easier

Why not fine-tune by default?

System prompt + RAG works surprisingly well
Fine-tuning is expensive and model-specific
RAG allows updates without retraining
Keeps the persona portable across models

Privacy Considerations

persona-tk processes personal data. Users should: - Review inputs before processing - Consider what they're comfortable having in a conversable persona - Use filtering options to exclude sensitive content - Control who has access to the output

The generated persona can answer questions you never anticipated. Think carefully about what's included.

ECHO.md — ECHO philosophy
STONE-TK.md — Plain text distillation toolkit (standalone)

"The ghost is not you. But it echoes you."