complex-network-rag
Topology-aware RAG using complex network analysis. Features community detection, hub/bridge identification, and a YAML DSL for configuring field embeddings and similarity computation.
Resources & Distribution
Source Code
Package Registries
Complex Network RAG
Topology-Aware Retrieval for Smarter Knowledge Graphs
Complex Network RAG is a production-ready retrieval-augmented generation (RAG) system that uses network topology and structured similarity for intelligent document retrieval. Unlike traditional RAG systems that rely solely on embedding similarity, Complex Network RAG leverages graph structure to discover communities, identify knowledge hubs, and find conceptual bridges between domains.
Key Features
- Structured Similarity: Field-specific embeddings for documents (title, abstract, tags, etc.)
- Hybrid Linkage: Combine semantic embeddings with attribute similarity (Jaccard, Dice, exact match)
- Network Topology: Leverage communities, hubs, and bridges for intelligent retrieval
- YAML DSL: Declarative configuration for complex similarity strategies
- Three Interfaces: Fluent API, CLI, and interactive REPL
- Production Ready: SQLite storage, comprehensive testing (755+ tests), proven at scale
System Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Three User Interfaces │
├───────────────────┬─────────────────────┬──────────────────────┤
│ Fluent API │ CLI Commands │ Interactive REPL │
│ (Python) │ (Bash/Scripts) │ (Exploration) │
└─────────┬─────────┴──────────┬──────────┴───────────┬──────────┘
│ │ │
└────────────────────┼───────────────────────┘
│
┌────────▼────────┐
│ YAML DSL │◀── Core Abstraction
│ (Declarative │
│ Configuration) │
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌────▼────┐ ┌──────▼──────┐ ┌──────▼──────┐
│Structured│ │ Network │ │ Storage │
│Similarity│ │ Analysis │ │ Layer │
│• Field │ │• Communities│ │• SQLite │
│ Embed │ │• Hubs │ │• Embeddings │
│• Attrib │ │• Bridges │ │• Metadata │
│ Sim │ │• Topology │ └─────────────┘
└──────────┘ └─────────────┘
│ │
└─────────┬──────────┘
│
┌───────▼────────┐
│ Knowledge │
│ Graph │
│ (NetworkX) │
└───────┬────────┘
│
┌───────▼────────┐
│ Retrieval │
│ Strategies │
│• Topology-aware│
│• Context-rich │
└────────────────┘
Why Complex Network RAG?
Traditional RAG systems treat documents as isolated points in embedding space. Complex Network RAG recognizes that knowledge forms a network with meaningful structure:
- Communities cluster related knowledge (e.g., all ML papers, all medical documents)
- Hubs connect multiple domains (e.g., “attention mechanism” bridges NLP and vision)
- Bridges enable cross-domain transfer (e.g., finding analogies between fields)
- Neighborhoods capture local context beyond simple similarity
This topology-aware approach enables:
- Better context through community-aware retrieval
- Cross-domain insights via bridge nodes
- Robust retrieval that considers both content and structure
- Explainable results based on network position
Quick Start
Installation
# Clone repository
git clone https://github.com/queelius/complex-network-rag.git
cd complex-network-rag
# Install dependencies
pip install -r requirements.txt
# Optional: Install in development mode
pip install -e .
Three Ways to Use Complex Network RAG
1. YAML DSL (Recommended for Configuration)
Define your similarity strategy declaratively:
# config/papers.yaml
document:
fields:
- name: title
type: text
embed: true
- name: abstract
type: text
embed: true
- name: tags
type: list
similarity:
components:
- type: field_embedding
field: title
model: tfidf
weight: 0.3
- type: field_embedding
field: abstract
model: tfidf
weight: 0.5
chunking:
method: sentences
max_tokens: 512
- type: attribute_similarity
field: tags
metric: jaccard
weight: 0.2
min_combined_similarity: 0.4
Use it:
from src.network_rag import NetworkRAG
# Load from config
rag = NetworkRAG.builder().from_config('config/papers.yaml').build()
# Add documents
rag.add('paper1', document={
'title': 'Attention Is All You Need',
'abstract': 'The dominant sequence transduction models...',
'tags': ['transformers', 'attention', 'nlp']
})
# Build network and search
rag.build_network()
results = rag.search('transformer architecture').top(10)
2. Interactive REPL (Great for Exploration)
# Start REPL
network-rag repl
# Interactive session
[no db]> db connect papers.db
✓ Connected to papers.db
papers.db> add "Attention Is All You Need - introducing transformers"
✓ Added document: abc123
papers.db [1 docs]> build
✓ Network built: 1 nodes, 0 edges
papers.db [1 docs]> search "transformers"
Search: transformers
Found 1 results:
1. abc123 (score: 0.892)
Attention Is All You Need - introducing transformers
See examples/repl_demo.md for complete guide.
3. Fluent API (Powerful Python Interface)
from src.network_rag import NetworkRAG
# Build with fluent API
rag = (NetworkRAG.builder()
.with_storage('knowledge.db')
.with_tfidf_embeddings(max_features=512)
.with_similarity_threshold(min_similarity=0.7)
.from_config('config/papers.yaml')
.build())
# Add documents with method chaining
with rag.batch() as batch:
batch.add("Document 1...", metadata={'category': 'ML'})
batch.add("Document 2...", metadata={'category': 'NLP'})
batch.add("Document 3...", metadata={'category': 'Vision'})
# Advanced queries
results = (rag.search("neural networks")
.with_strategy("community")
.filter(category="ML")
.expand_neighbors(hops=2)
.prioritize_hubs()
.top(10))
# Rich result objects
for result in results:
print(f"{result.id}: {result.score:.3f}")
print(f" Community: {result.community_id}")
print(f" Content: {result.content[:100]}...")
See Fluent API Guide for complete API reference.
Core Concepts
Structured Similarity
Instead of embedding entire documents, Complex Network RAG embeds individual fields:
# Different fields can have different strategies
title: TF-IDF embedding (semantic, exact titles matter)
abstract: TF-IDF with sentence chunking (semantic, handles long text)
tags: Jaccard similarity (set overlap, no embedding needed)
authors: Exact match (boolean, either same or not)
This gives you:
- Fine-grained control over what matters
- Better performance (no need to embed everything)
- Hybrid strategies combining semantics + metadata
- Explainable scores (see contribution of each component)
Network Topology
Documents form a similarity graph where edges connect similar documents. This enables:
Communities: Clusters of related knowledge
communities = rag.detect_communities()
# Community 0: ML papers
# Community 1: Medical documents
# Community 2: Legal texts
Hubs: High-degree nodes connecting domains
hubs = rag.get_hub_nodes(min_degree=10)
# Node "attention-mechanisms" connects NLP, Vision, and Speech
Bridges: High-betweenness nodes enabling cross-domain transfer
bridges = rag.get_bridge_nodes(min_betweenness=0.1)
# Node "optimization-theory" bridges ML and Operations Research
Retrieval Strategies
Complex Network RAG offers multiple retrieval strategies:
- similarity: Traditional cosine similarity (baseline)
- community: Prefer documents in same community (coherence)
- hub: Prioritize high-degree nodes (versatile knowledge)
- bridge: Emphasize cross-domain connectors (transfer learning)
- hybrid: Intelligent combination of all strategies (recommended)
Project Structure
complex-network-rag/
├── src/ # Core source code
│ ├── network_rag.py # Main NetworkRAG class
│ ├── fluent.py # Fluent API builders
│ ├── cli.py # Command-line interface
│ ├── repl.py # Interactive REPL
│ ├── storage.py # SQLite storage backend
│ ├── embeddings.py # Embedding providers
│ ├── structured_similarity.py # Structured similarity engine
│ ├── yaml_parser.py # YAML DSL parser
│ ├── config_builder.py # Interactive config wizard
│ └── ...
├── config/ # Example YAML configurations
│ ├── papers_full.yaml # Research papers (complete)
│ ├── papers_minimal.yaml # Research papers (simple)
│ ├── products_basic.yaml # E-commerce products
│ ├── conversations.yaml # Chat messages
│ └── ...
├── examples/ # Example scripts
│ ├── repl_demo.md # REPL tutorial
│ ├── fluent_api.py # Fluent API examples
│ ├── basic_usage.py # Basic usage
│ └── ...
├── tests/ # Comprehensive test suite (755+ tests)
├── docs/ # Documentation (see below)
└── README.md # This file
Documentation
Getting Started
- Getting Started - New user guide with step-by-step tutorials
- examples/repl_demo.md - Interactive REPL walkthrough
Core Concepts
- Core Concepts - Deep dive into structured similarity and network topology
- YAML DSL Reference - Complete YAML DSL specification
- Chunking Guide - Text chunking strategies
API References
- API Reference - Fluent API and NetworkRAG class documentation
- CLI Reference - Command-line interface guide
- Fluent API Guide - Fluent API patterns and examples
Implementation Details
- CLAUDE.md - Development guide for contributors
- Architecture Overview - Architecture overview
- REPL Phase 1 - REPL technical details
Examples
Research Papers
See config/papers_full.yaml for complete configuration.
from src.network_rag import NetworkRAG
# Load papers config
rag = NetworkRAG.builder().from_config('config/papers_full.yaml').build()
# Add papers
rag.add('transformer', document={
'title': 'Attention Is All You Need',
'abstract': 'The dominant sequence transduction models...',
'authors': ['Vaswani', 'Shazeer', 'Parmar'],
'tags': ['transformers', 'attention', 'seq2seq']
})
rag.add('bert', document={
'title': 'BERT: Pre-training of Deep Bidirectional Transformers',
'abstract': 'We introduce BERT, a new language representation model...',
'authors': ['Devlin', 'Chang', 'Lee', 'Toutanova'],
'tags': ['transformers', 'pretraining', 'nlp']
})
# Build and query
rag.build_network()
communities = rag.detect_communities()
# Find papers about transformers
results = rag.search('transformer architecture').with_strategy('hybrid').top(10)
E-commerce Products
See config/products_basic.yaml for complete configuration.
# Products with semantic + metadata matching
rag = NetworkRAG.builder().from_config('config/products_basic.yaml').build()
rag.add('laptop1', document={
'name': 'MacBook Pro 14"',
'description': 'Powerful laptop with M3 chip...',
'category': 'laptops',
'price': 1999.00,
'brand': 'Apple'
})
# Query combines semantic similarity (name, description)
# with exact category match and price range
results = rag.search('powerful laptop for development').filter(category='laptops').top(10)
Chat Conversations
See config/conversations.yaml for role-weighted chat embeddings.
Performance
Storage Estimates (768-dim embeddings)
| Documents | Storage Size | Build Time | Search Time |
|---|---|---|---|
| 1K | ~3.5 MB | ~0.3s | <10ms |
| 10K | ~35 MB | ~25s | <50ms |
| 100K | ~350 MB | ~5min | <200ms |
Similarity Thresholds
The min_similarity threshold controls network density:
| Threshold | Use Case | Edge Density | Notes |
|---|---|---|---|
| 0.6 | Exploratory | ~5% | Dense network, more connections |
| 0.7 | General use | ~3% | Balanced (recommended) |
| 0.8 | Large datasets | ~0.5% | Sparse, high precision |
| 0.9 | Very large | ~0.05% | Very sparse, may miss connections |
Testing
Comprehensive test suite with 755+ tests:
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test files
pytest tests/test_network_rag.py -v
pytest tests/test_structured_similarity.py -v
pytest tests/test_yaml_parser.py -v
Test categories:
- Unit tests: Individual components (storage, embeddings, similarity)
- Integration tests: End-to-end workflows
- REPL tests: Interactive shell functionality
- API tests: Fluent interface and builders
- Parser tests: YAML DSL validation
Contributing
See CLAUDE.md for development setup and guidelines.
Key areas for contribution:
- New embedding providers (OpenAI, Cohere, etc.)
- Additional chunking strategies
- Network visualization tools
- Performance optimizations
- Documentation improvements
Roadmap
Current (v0.1.0)
- ✅ Structured similarity with field-level embeddings
- ✅ Hybrid linkage (embeddings + attribute similarity)
- ✅ YAML DSL with full validation
- ✅ Fluent API, CLI, and REPL interfaces
- ✅ SQLite storage with hierarchical embeddings
- ✅ Network analysis (communities, hubs, bridges)
- ✅ Comprehensive testing (755+ tests)
Near Term (v0.2.0)
- Additional embedding providers (OpenAI, Cohere, HuggingFace)
- Advanced chunking strategies (sliding window, semantic)
- REPL Phase 3: Session persistence, script export
- Network visualization (matplotlib, networkx)
- Performance optimizations (caching, incremental updates)
Future (v1.0.0)
- Distributed processing for large datasets
- Real-time incremental updates
- Advanced retrieval strategies (PageRank, random walks)
- Integration with LLM frameworks (LangChain, LlamaIndex)
- REST API for remote access
Citation
If you use Complex Network RAG in your research, please cite:
@software{complex_network_rag,
title = {Complex Network RAG: Topology-Aware Retrieval for Knowledge Graphs},
author = {Your Name},
year = {2025},
url = {https://github.com/queelius/complex-network-rag}
}
License
MIT License - see LICENSE file for details.
Acknowledgments
- Built on NetworkX for graph algorithms
- Uses scikit-learn for similarity computations
- REPL powered by Click
- Testing with pytest
Support
- Documentation: See docs/ directory
- Examples: See examples/ directory
- Issues: GitHub issue tracker
- Discussions: GitHub discussions
Complex Network RAG - Making knowledge graphs smarter through topology-aware retrieval.