Skip to main content

src2md: Context-Window-Optimized Code for LLMs

src2md solves a fundamental problem when working with LLMs: how do you fit a meaningful codebase into a limited context window while preserving the most important information?

The Problem

You want an LLM to understand your codebase, but:

  • GPT-4 has ~128K tokens, Claude ~200K
  • A medium-sized project easily exceeds this
  • Naive truncation loses critical context
  • Manual curation doesn’t scale

The Solution

src2md uses intelligent summarization to fit codebases into LLM context windows:

pip install src2md

# Basic markdown generation
src2md /path/to/project -o documentation.md

# With context optimization for GPT-4
src2md /path/to/project --gpt4 -o optimized.md

# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3

Key Features

Context Window Optimization

Intelligently fit codebases into specific LLM context windows:

# Target specific LLM context windows
src2md . --target-tokens 128000  # GPT-4
src2md . --target-tokens 200000  # Claude

# Predefined windows
src2md . --window gpt-4-turbo
src2md . --window claude-3

Multi-Tier Summarization

Progressive compression strategies based on file importance:

from src2md import Converter

converter = Converter(
    target_tokens=100000,
    summarization_levels={
        'critical': 'full',      # Keep full source
        'important': 'ast',       # AST-based summary
        'supporting': 'minimal',  # Docstrings only
        'peripheral': 'exclude'   # Skip entirely
    }
)

File Importance Scoring

Multi-factor analysis prioritizes critical files:

  • Centrality: How many other files import this?
  • Complexity: Cyclomatic complexity, LOC
  • Recency: Recently modified files matter more
  • Naming: main.py, index.ts get priority

AST-Based Analysis

Language-aware summarization extracts structure:

# From a 500-line Python file, extract:
# - Class/function signatures
# - Docstrings
# - Type hints
# - Key logic patterns

Multiple Output Formats

src2md . --format markdown    # Default
src2md . --format json        # Structured data
src2md . --format jsonl       # Line-delimited JSON
src2md . --format html        # Web-viewable
src2md . --format text        # Plain text

Python API

from src2md import Repository, ContextWindow

# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()

# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
    .optimize_for(ContextWindow.GPT_4)
    .analyze()
    .to_markdown())

# Full fluent API with all features
result = (Repository("/path/to/project")
    .name("MyProject")
    .include("src/", "lib/")
    .exclude("tests/", "*.log")
    .with_importance_scoring()
    .with_summarization(
        compression_ratio=0.3,
        preserve_important=True,
        use_llm=True
    )
    .optimize_for_tokens(100_000)
    .analyze()
    .to_json(pretty=True))

LLM-Powered Compression

For semantic understanding, use LLM-powered summarization:

# Use OpenAI for semantic compression
export OPENAI_API_KEY=...
src2md . --llm-compress --provider openai

# Use Anthropic
export ANTHROPIC_API_KEY=...
src2md . --llm-compress --provider anthropic

The LLM understands what’s important and produces human-readable summaries rather than just truncating.

Use Cases

  • Code Review: Give LLMs full project context for better reviews
  • Documentation: Generate docs with full codebase awareness
  • Debugging: Provide complete context for bug analysis
  • Onboarding: Create digestible project overviews

Installation

pip install src2md

# With LLM support
pip install src2md[llm]

Resources


src2md: Because LLMs deserve to see your whole codebase, not just the first 10 files.

Discussion