Skip to content

CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

Development Commands

Installation

pip install -r requirements.txt

# Development dependencies
pip install pytest pytest-cov black flake8

Testing

# Run all tests
python -m pytest

# Run with coverage
python -m coverage run -m pytest
python -m coverage report

# Run specific test
python -m pytest tests/test_storage.py -v

Code Quality

# Format code
python -m black src/ tests/ examples/

# Lint
python -m flake8 src/ tests/

# Pre-commit check
python -m black --check src/ tests/ && python -m flake8 src/ tests/

Architecture Overview

Design Philosophy

Simplified, production-ready architecture: - SQLite only (no optional backends) - In-memory numpy similarity search (O(n) brute force, acceptable for <100k nodes) - Network topology for intelligent retrieval - Clean separation of concerns

Core Components (Redesigned)

1. Storage Layer (src/storage.py)

SQLiteStorage: Single storage backend - save_node(node_id, content, metadata): Store node with content hash - save_embedding(node_id, embedding, model_name, config): Store embedding - load_embeddings(model_name, node_ids): Retrieve embeddings as dict - get_node(node_id): Get node metadata + content - search_nodes(filters): Query nodes by metadata

Schema: - nodes table: id, source_id, node_type, content_hash, content_text, metadata - embeddings table: node_id, model_name, model_config, embedding (blob), dimension

Key features: - Content deduplication via SHA256 hashing - Multiple embeddings per node (different models) - ACID transactions - No edge storage (computed on-demand)

2. Embedding Layer (src/embeddings.py)

EmbeddingProvider (abstract): - embed(texts) -> np.ndarray - get_dimension() -> int - get_model_name() -> str - get_model_config() -> dict

Implementations: - OllamaEmbedding: Remote Ollama server - SentenceTransformerEmbedding: Local sentence-transformers - TFIDFEmbedding: Fast TF-IDF (no external deps)

WeightedEmbeddingConfig: For multi-part content - Weights: {"user": 1.5, "assistant": 1.0} (weight roles differently) - Aggregation: weighted_avg, concat, max_pool

3. Network Analysis (src/network.py)

NetworkBuilder: Constructs similarity graphs - build_network(embeddings, min_similarity): Create NetworkX graph - Uses vectorized cosine similarity (sklearn) - Only creates edges where similarity ≥ threshold - Returns sparse graph (O(n log n) edges for threshold=0.7)

NetworkAnalyzer: Topology analysis - detect_communities(): Louvain algorithm - get_bridge_nodes(min_betweenness): High-betweenness nodes - get_hub_nodes(min_degree): High-degree nodes - get_neighbors(node_id, k_hops): k-hop neighborhood - get_community_for_node(node_id): Community membership - sample_community_nodes(community_id, n): Sample from community - auto_tag_community(community_id): TF-IDF auto-tagging

4. Retrieval (src/retrieval.py)

RetrievalStrategy (abstract): - retrieve(query_embedding, embeddings_dict, graph, n) -> List[node_id]

Implementations: - SimilarityRetrieval: Pure cosine similarity (traditional RAG) - CommunityRetrieval: Community-aware retrieval - BridgeRetrieval: Cross-domain via bridges - HubRetrieval: Versatile knowledge via hubs - HybridRetrieval: Combines all strategies

5. Main Interface (src/core.py)

NetworkRAG: Main user-facing class

def __init__(storage, embedding_provider, min_similarity=0.7)
def add_node(node_id, content, metadata)
def build_network(rebuild=False)
def detect_communities()
def find_similar(query, n=10, strategy="hybrid")
def auto_tag_community(community_id, n_samples=20)
def get_bridges_for_community(community_id)
def get_hubs_for_community(community_id, min_degree=5)
def get_neighbors(node_id, k_hops=1)

Implementation Phases

Current Status: Phase 0 (research/demo code)

Phase 1 (Week 1): Storage layer - SQLite schema + CRUD operations - Migration from in-memory to persistent storage - Transaction safety

Phase 2 (Week 2): Network analysis - Community utilities (sample, tag) - Bridge/hub identification - Neighbor exploration

Phase 3 (Week 3): Advanced features - Weighted embeddings - Multi-strategy retrieval - Incremental updates

Phase 4 (Week 4): Production polish - Error handling, logging - Concurrent access - Documentation

Key Algorithms

Network Construction (O(n²) similarity, O(n log n) edges):

sim_matrix = cosine_similarity(embeddings)  # sklearn, vectorized
for i, j in pairs:
    if sim_matrix[i, j] >= threshold:
        graph.add_edge(i, j, weight=sim_matrix[i, j])

Community Detection: Louvain algorithm (networkx)

Auto-tagging: TF-IDF on sampled community nodes

samples = sample_community_nodes(community_id, n=20)
tfidf = TfidfVectorizer(max_features=100, stop_words='english')
vectors = tfidf.fit_transform([node.content for node in samples])
top_terms = get_top_terms(vectors.mean(axis=0), k=5)

Similarity Thresholds

Critical configuration: - min_similarity = 0.7: Recommended for <10k nodes - min_similarity = 0.8: Recommended for >10k nodes - strong_similarity = 0.8: Marks high-confidence edges (future use)

Edge complexity: - 0.7 → ~3% density → O(n log n) edges - 0.8 → ~0.5% density → O(n) edges - 0.9 → ~0.05% density → very sparse

Testing Strategy

Unit tests: - Storage: CRUD operations, transactions, deduplication - Embeddings: Provider interface compliance - Network: Graph construction, community detection - Retrieval: Strategy correctness

Integration tests: - End-to-end: add nodes → build network → retrieve - CTK integration example - Performance benchmarks (1k, 10k nodes)

Coverage goal: >80% for core modules

Performance Notes

Acceptable O(n) brute-force similarity: - 1k nodes: <1s - 10k nodes: ~20s - 100k nodes: ~30min (use incremental updates or batching)

Optimization priorities (Phase 4): 1. Caching similarity computations 2. Incremental network updates (add single node) 3. Batch operations for imports 4. Optional: FAISS approximate search for >100k nodes

Code Style

  • Black formatting (line length 88)
  • Type hints where helpful (not mandatory)
  • Docstrings for public methods
  • Comments for non-obvious algorithms

Example Usage Pattern

# Setup
storage = SQLiteStorage("data.db")
embedder = OllamaEmbedding(host="http://localhost:11434")
rag = NetworkRAG(storage, embedder, min_similarity=0.7)

# Add data
rag.add_node("node1", "content...", {"type": "doc"})

# Build network
rag.build_network()
communities = rag.detect_communities()

# Retrieve
results = rag.find_similar("query", n=10, strategy="hybrid")

# Analysis
tags = rag.auto_tag_community(0)
bridges = rag.get_bridges_for_community(0)
hubs = rag.get_hubs_for_community(0)