CLAUDE.md¶
This file provides guidance to Claude Code when working with this repository.
Development Commands¶
Installation¶
pip install -r requirements.txt
# Development dependencies
pip install pytest pytest-cov black flake8
Testing¶
# Run all tests
python -m pytest
# Run with coverage
python -m coverage run -m pytest
python -m coverage report
# Run specific test
python -m pytest tests/test_storage.py -v
Code Quality¶
# Format code
python -m black src/ tests/ examples/
# Lint
python -m flake8 src/ tests/
# Pre-commit check
python -m black --check src/ tests/ && python -m flake8 src/ tests/
Architecture Overview¶
Design Philosophy¶
Simplified, production-ready architecture: - SQLite only (no optional backends) - In-memory numpy similarity search (O(n) brute force, acceptable for <100k nodes) - Network topology for intelligent retrieval - Clean separation of concerns
Core Components (Redesigned)¶
1. Storage Layer (src/storage.py)¶
SQLiteStorage: Single storage backend
- save_node(node_id, content, metadata): Store node with content hash
- save_embedding(node_id, embedding, model_name, config): Store embedding
- load_embeddings(model_name, node_ids): Retrieve embeddings as dict
- get_node(node_id): Get node metadata + content
- search_nodes(filters): Query nodes by metadata
Schema:
- nodes table: id, source_id, node_type, content_hash, content_text, metadata
- embeddings table: node_id, model_name, model_config, embedding (blob), dimension
Key features: - Content deduplication via SHA256 hashing - Multiple embeddings per node (different models) - ACID transactions - No edge storage (computed on-demand)
2. Embedding Layer (src/embeddings.py)¶
EmbeddingProvider (abstract):
- embed(texts) -> np.ndarray
- get_dimension() -> int
- get_model_name() -> str
- get_model_config() -> dict
Implementations:
- OllamaEmbedding: Remote Ollama server
- SentenceTransformerEmbedding: Local sentence-transformers
- TFIDFEmbedding: Fast TF-IDF (no external deps)
WeightedEmbeddingConfig: For multi-part content
- Weights: {"user": 1.5, "assistant": 1.0} (weight roles differently)
- Aggregation: weighted_avg, concat, max_pool
3. Network Analysis (src/network.py)¶
NetworkBuilder: Constructs similarity graphs
- build_network(embeddings, min_similarity): Create NetworkX graph
- Uses vectorized cosine similarity (sklearn)
- Only creates edges where similarity ≥ threshold
- Returns sparse graph (O(n log n) edges for threshold=0.7)
NetworkAnalyzer: Topology analysis
- detect_communities(): Louvain algorithm
- get_bridge_nodes(min_betweenness): High-betweenness nodes
- get_hub_nodes(min_degree): High-degree nodes
- get_neighbors(node_id, k_hops): k-hop neighborhood
- get_community_for_node(node_id): Community membership
- sample_community_nodes(community_id, n): Sample from community
- auto_tag_community(community_id): TF-IDF auto-tagging
4. Retrieval (src/retrieval.py)¶
RetrievalStrategy (abstract):
- retrieve(query_embedding, embeddings_dict, graph, n) -> List[node_id]
Implementations:
- SimilarityRetrieval: Pure cosine similarity (traditional RAG)
- CommunityRetrieval: Community-aware retrieval
- BridgeRetrieval: Cross-domain via bridges
- HubRetrieval: Versatile knowledge via hubs
- HybridRetrieval: Combines all strategies
5. Main Interface (src/core.py)¶
NetworkRAG: Main user-facing class
def __init__(storage, embedding_provider, min_similarity=0.7)
def add_node(node_id, content, metadata)
def build_network(rebuild=False)
def detect_communities()
def find_similar(query, n=10, strategy="hybrid")
def auto_tag_community(community_id, n_samples=20)
def get_bridges_for_community(community_id)
def get_hubs_for_community(community_id, min_degree=5)
def get_neighbors(node_id, k_hops=1)
Implementation Phases¶
Current Status: Phase 0 (research/demo code)
Phase 1 (Week 1): Storage layer - SQLite schema + CRUD operations - Migration from in-memory to persistent storage - Transaction safety
Phase 2 (Week 2): Network analysis - Community utilities (sample, tag) - Bridge/hub identification - Neighbor exploration
Phase 3 (Week 3): Advanced features - Weighted embeddings - Multi-strategy retrieval - Incremental updates
Phase 4 (Week 4): Production polish - Error handling, logging - Concurrent access - Documentation
Key Algorithms¶
Network Construction (O(n²) similarity, O(n log n) edges):
sim_matrix = cosine_similarity(embeddings) # sklearn, vectorized
for i, j in pairs:
if sim_matrix[i, j] >= threshold:
graph.add_edge(i, j, weight=sim_matrix[i, j])
Community Detection: Louvain algorithm (networkx)
Auto-tagging: TF-IDF on sampled community nodes
samples = sample_community_nodes(community_id, n=20)
tfidf = TfidfVectorizer(max_features=100, stop_words='english')
vectors = tfidf.fit_transform([node.content for node in samples])
top_terms = get_top_terms(vectors.mean(axis=0), k=5)
Similarity Thresholds¶
Critical configuration:
- min_similarity = 0.7: Recommended for <10k nodes
- min_similarity = 0.8: Recommended for >10k nodes
- strong_similarity = 0.8: Marks high-confidence edges (future use)
Edge complexity: - 0.7 → ~3% density → O(n log n) edges - 0.8 → ~0.5% density → O(n) edges - 0.9 → ~0.05% density → very sparse
Testing Strategy¶
Unit tests: - Storage: CRUD operations, transactions, deduplication - Embeddings: Provider interface compliance - Network: Graph construction, community detection - Retrieval: Strategy correctness
Integration tests: - End-to-end: add nodes → build network → retrieve - CTK integration example - Performance benchmarks (1k, 10k nodes)
Coverage goal: >80% for core modules
Performance Notes¶
Acceptable O(n) brute-force similarity: - 1k nodes: <1s - 10k nodes: ~20s - 100k nodes: ~30min (use incremental updates or batching)
Optimization priorities (Phase 4): 1. Caching similarity computations 2. Incremental network updates (add single node) 3. Batch operations for imports 4. Optional: FAISS approximate search for >100k nodes
Code Style¶
- Black formatting (line length 88)
- Type hints where helpful (not mandatory)
- Docstrings for public methods
- Comments for non-obvious algorithms
Example Usage Pattern¶
# Setup
storage = SQLiteStorage("data.db")
embedder = OllamaEmbedding(host="http://localhost:11434")
rag = NetworkRAG(storage, embedder, min_similarity=0.7)
# Add data
rag.add_node("node1", "content...", {"type": "doc"})
# Build network
rag.build_network()
communities = rag.detect_communities()
# Retrieve
results = rag.find_similar("query", n=10, strategy="hybrid")
# Analysis
tags = rag.auto_tag_community(0)
bridges = rag.get_bridges_for_community(0)
hubs = rag.get_hubs_for_community(0)