src2md
Convert source code to structured, context-optimized markdown for LLMs with intelligent summarization.
Resources & Distribution
Source Code
Package Registries
src2md
src2md is a powerful tool that converts source code repositories into structured, context-window-optimized representations for Large Language Models (LLMs). It addresses the fundamental challenge of fitting meaningful codebases into limited context windows while preserving the most important information through intelligent summarization, AST-based analysis, and optional LLM-powered compression.
๐ Features
New in v2.1
- ๐ฏ Context Window Optimization: Intelligently fit codebases into LLM context windows with smart truncation
- ๐ Intelligent Summarization: AST-based code analysis with multiple compression levels
- ๐ค LLM-Powered Compression: Optional OpenAI/Anthropic integration for semantic summarization
- โก Fluent API: Elegant method chaining with new summarization methods
- ๐ File Importance Scoring: Multi-factor analysis to prioritize critical files
- ๐ช Predefined LLM Windows: Built-in support for GPT-4, Claude, and more
- ๐ Progressive Summarization: Multi-tier compression strategies for different file types
Core Features
- Multiple Output Formats: JSON, JSONL, Markdown, HTML, and plain text
- Smart Token Management: Accurate token counting with tiktoken and structure-aware truncation
- Multi-Language Support: Specialized summarizers for Python, JavaScript, TypeScript, JSON, YAML
- Code Statistics: Automatic generation of project metrics and complexity analysis
- Flexible Filtering: Customizable include/exclude patterns
- Rich CLI Interface: Beautiful progress indicators and colored output
๐ฆ Installation
Install via PyPI using pip:
pip install src2md
๐ ๏ธ Usage
Quick Start - Fluent API
from src2md import Repository, ContextWindow
# Basic usage
output = Repository("/path/to/project").analyze().to_markdown()
# Optimize for GPT-4 context window
output = (Repository("/path/to/project")
.optimize_for(ContextWindow.GPT_4)
.analyze()
.to_markdown())
# Full fluent API with all features
result = (Repository("/path/to/project")
.name("MyProject")
.branch("main")
.include("src/", "lib/")
.exclude("tests/", "*.log")
.with_importance_scoring()
.with_summarization(
compression_ratio=0.3, # Target 30% of original size
preserve_important=True, # Keep critical files intact
use_llm=True # Use LLM if available
)
.prioritize(["main.py", "core/"])
.optimize_for_tokens(100_000) # 100K token limit
.analyze()
.to_json(pretty=True))
Command Line Interface
# Basic markdown generation
src2md /path/to/project -o documentation.md
# With context optimization
src2md /path/to/project --gpt4 -o optimized.md
src2md /path/to/project --claude3 --importance
# With intelligent summarization
src2md /path/to/project --summarize --compression-ratio 0.3
src2md /path/to/project --summarize-tests --summarize-docs
# With LLM-powered summarization (requires API key)
src2md /path/to/project --use-llm --llm-model gpt-3.5-turbo
# Multiple output formats
src2md /path/to/project --format json --pretty
src2md /path/to/project --format html -o docs.html
Python API Examples
Basic Context Optimization
from src2md import Repository, ContextWindow
# Optimize for different LLM context windows
repo = Repository("./my-project")
output = repo.optimize_for(ContextWindow.CLAUDE_3).analyze().to_markdown()
# Custom token limit with importance scoring
repo = (Repository("./my-project")
.with_importance_scoring()
.optimize_for_tokens(50_000)
.analyze())
Intelligent Summarization
# Enable smart summarization with compression
repo = (Repository("./my-project")
.with_summarization(
compression_ratio=0.3, # Compress to 30% of original
preserve_important=True, # Keep critical files intact
use_llm=False # Use AST-based summarization
)
.optimize_for(ContextWindow.GPT_4)
.analyze())
# Use LLM-powered summarization (requires API key)
import os
os.environ['OPENAI_API_KEY'] = 'your-key-here'
repo = (Repository("./my-project")
.with_summarization(
compression_ratio=0.2, # More aggressive compression
use_llm=True,
llm_model="gpt-3.5-turbo"
)
.analyze())
Multi-Tier Compression Strategy
# Configure different summarization levels for different file types
repo = (Repository("./my-project")
.with_importance_scoring()
.prioritize(["src/core/", "api/"]) # Critical paths
.summarize_tests() # Compress test files
.summarize_docs() # Compress documentation
.with_summarization(
compression_ratio=0.25,
preserve_important=True
)
.optimize_for_tokens(100_000)
.analyze())
# Access summarization metadata
data = repo.to_dict()
for file in data['source_files']:
if file.get('was_summarized'):
print(f"Summarized {file['path']}: {file['original_size']} -> {file['size']} bytes")
Generate Multiple Formats
repo = Repository("./my-project").analyze()
markdown = repo.to_markdown()
json_data = repo.to_json()
html_doc = repo.to_html()
# Access raw data
data = repo.to_dict()
print(f"Files: {data['metadata']['file_count']}")
print(f"Token usage: {data['metadata'].get('total_tokens', 0)}")
print(f"Compression achieved: {data['metadata'].get('compression_ratio', 1.0):.1%}")
๐ฏ Summarization Features
AST-Based Python Summarization
src2md uses Abstract Syntax Tree (AST) analysis to intelligently summarize Python code while preserving structure:
- MINIMAL: Only class/function signatures
- OUTLINE: Signatures with structural hierarchy
- DOCSTRINGS: Signatures plus documentation
- SIGNATURES: Full signatures with type hints
- FULL: No summarization
Multi-Language Support
Specialized summarizers for different file types:
- Python: AST-based analysis with import/export preservation
- JavaScript/TypeScript: Function and class extraction
- JSON/YAML: Schema extraction with sample data
- Test Files: Test name and assertion extraction
- Documentation: Heading and key point extraction
Smart Truncation
When files must be truncated to fit token limits:
- Preserves code structure (complete functions/classes)
- Maintains syntax validity
- Prioritizes public APIs over private methods
- Keeps imports and exports intact
LLM-Powered Summarization
Optional integration with OpenAI and Anthropic for semantic compression:
# Set API keys
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
# Use LLM summarization
src2md /path/to/project --use-llm --llm-model gpt-3.5-turbo
src2md /path/to/project --use-llm --llm-model claude-3-haiku-20240307
๐ Output Formats
JSON
Structured data perfect for programmatic processing:
{
"metadata": {
"project_name": "my-project",
"generated_at": "2025-01-01T12:00:00",
"patterns": {...}
},
"statistics": {
"total_files": 42,
"languages": {"python": {"count": 15, "total_size": 50000}},
"project_complexity": 3.2
},
"documentation": [...],
"source_files": [...]
}
JSONL
One JSON object per line - perfect for streaming and big data tools:
{"type": "metadata", "data": {...}}
{"type": "statistics", "data": {...}}
{"type": "source_file", "data": {...}}
HTML
Beautiful, styled documentation ready for the web with syntax highlighting and responsive design.
Markdown
Clean, readable documentation compatible with GitHub, GitLab, and other platforms.
๐ง Advanced Options
File Patterns
# Custom documentation patterns
src2md project --doc-pat '*.md' '*.rst' '*.txt'
# Specific source file types
src2md project --src-pat '*.py' '*.js' '*.ts'
# Ignore patterns
src2md project --ignore-pat '*.pyc' 'node_modules/' '.git/'
Ignore Files
Create a .src2mdignore file in your project root:
# Dependencies
node_modules/
__pycache__/
*.pyc
# Build outputs
dist/
build/
*.egg-info/
# IDE files
.vscode/
.idea/
Configuration
# Use custom ignore file
src2md project --ignore-file .gitignore
# Disable statistics
src2md project --no-stats
# Metadata only (no file contents)
src2md project --no-content
๐ฏ Use Cases
- LLM Context: Generate structured context for AI/ML models
- Documentation: Create beautiful project documentation
- Code Analysis: Extract metrics and statistics from codebases
- Data Export: Convert code to structured formats for analysis
- Archive: Create comprehensive snapshots of projects
- CI/CD: Generate documentation automatically in build pipelines
๐ Statistics & Metrics
src2md automatically generates:
- File Metrics: Counts by type and language
- Code Complexity: Cyclomatic complexity scores
- Token Usage: Actual token counts for LLM context
- Compression Stats: Before/after summarization metrics
- Importance Scores: File prioritization rankings
- Language Breakdown: Distribution of code by language
- Structure Analysis: Dependency and module relationships
๐ค Migration from v0.x
The new version is backward compatible. Existing commands work unchanged:
# This still works exactly as before
src2md project -o docs.md --doc-pat '*.md' --src-pat '*.py'
New features are opt-in through additional flags and the Python API.
๐ License
MIT License - see LICENSE file for details.