src2md solves a fundamental problem when working with LLMs: how do you fit a meaningful codebase into a limited context window while preserving the most important information?
The Problem
You want an LLM to understand your codebase, but:
- GPT-4 has ~128K tokens, Claude ~200K
- A medium-sized project easily exceeds this
- Naive truncation loses critical context
- Manual curation doesn’t scale
The Solution
src2md uses intelligent summarization to fit codebases into LLM context windows:
pip install src2md
# Basic markdown generation
src2md /path/to/project -o documentation.md
# With context optimization for GPT-4
src2md /path/to/project --gpt4 -o optimized.md
# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3
Key Features
Context Window Optimization
Intelligently fit codebases into specific LLM context windows:
# Target specific LLM context windows
src2md . --target-tokens 128000 # GPT-4
src2md . --target-tokens 200000 # Claude
# Predefined windows
src2md . --window gpt-4-turbo
src2md . --window claude-3
Multi-Tier Summarization
Progressive compression strategies based on file importance:
from src2md import Converter
converter = Converter(
target_tokens=100000,
summarization_levels={
'critical': 'full', # Keep full source
'important': 'ast', # AST-based summary
'supporting': 'minimal', # Docstrings only
'peripheral': 'exclude' # Skip entirely
}
)
File Importance Scoring
Multi-factor analysis prioritizes critical files:
- Centrality: How many other files import this?
- Complexity: Cyclomatic complexity, LOC
- Recency: Recently modified files matter more
- Naming:
main.py,index.tsget priority
AST-Based Analysis
Language-aware summarization extracts structure:
# From a 500-line Python file, extract:
# - Class/function signatures
# - Docstrings
# - Type hints
# - Key logic patterns
Multiple Output Formats
src2md . --format markdown # Default
src2md . --format json # Structured data
src2md . --format jsonl # Line-delimited JSON
src2md . --format html # Web-viewable
src2md . --format text # Plain text
Python API
from src2md import Repository, ContextWindow
# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()
# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
.optimize_for(ContextWindow.GPT_4)
.analyze()
.to_markdown())
# Full fluent API with all features
result = (Repository("/path/to/project")
.name("MyProject")
.include("src/", "lib/")
.exclude("tests/", "*.log")
.with_importance_scoring()
.with_summarization(
compression_ratio=0.3,
preserve_important=True,
use_llm=True
)
.optimize_for_tokens(100_000)
.analyze()
.to_json(pretty=True))
LLM-Powered Compression
For semantic understanding, use LLM-powered summarization:
# Use OpenAI for semantic compression
export OPENAI_API_KEY=...
src2md . --llm-compress --provider openai
# Use Anthropic
export ANTHROPIC_API_KEY=...
src2md . --llm-compress --provider anthropic
The LLM understands what’s important and produces human-readable summaries rather than just truncating.
Use Cases
- Code Review: Give LLMs full project context for better reviews
- Documentation: Generate docs with full codebase awareness
- Debugging: Provide complete context for bug analysis
- Onboarding: Create digestible project overviews
Installation
pip install src2md
# With LLM support
pip install src2md[llm]
Resources
- PyPI: pypi.org/project/src2md/
- GitHub: github.com/queelius/src2md
src2md: Because LLMs deserve to see your whole codebase, not just the first 10 files.
Discussion