Core Concepts¶
This guide provides a deep dive into the two fundamental innovations of Complex Network RAG: 1. Structured Similarity - Field-level document comparison 2. Network Topology - Graph-based knowledge organization
Table of Contents¶
- Traditional RAG vs Complex Network RAG
- Structured Similarity
- Network Topology
- Similarity Components
- Network Analysis
- Retrieval Strategies
- Design Patterns
Traditional RAG vs Complex Network RAG¶
Traditional RAG¶
Traditional Retrieval-Augmented Generation systems:
┌──────────┐ ┌───────────┐ ┌─────────┐ ┌─────────────┐
│ Document │───▶│ Embed │───▶│ Vector │───▶│ Top-K │
│ (Text) │ │ (Single) │ │ DB │ │ Similarity │
└──────────┘ └───────────┘ └─────────┘ └─────────────┘
│
▼
┌──────────┐
│ Results │
└──────────┘
Limitations: - Treats documents as atomic units (ignores internal structure) - Pure distance-based retrieval (no graph structure) - All content equally weighted - No community awareness - Limited cross-domain discovery
Complex Network RAG¶
Complex Network RAG extends this with two key innovations:
┌──────────────────┐
│ Document │
│ {title: "...", │
│ abstract: "..."|
│ tags: [...]} │
└────────┬─────────┘
│
▼
┌─────────────────────────────────────┐
│ Structured Similarity │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │ Field │ │ Attribute │ │
│ │Embeddings │ │Similarity │ │
│ │ (title, │ │ (tags, │ │
│ │ abstract) │ │ authors) │ │
│ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │
│ └──────┬───────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Combined Score │ │
│ └────────┬───────┘ │
└───────────────┼───────────────────┘
│
▼
┌─────────────────────┐
│ Similarity Graph │
│ │
│ A ──── B ──── C │
│ │ │ │ │
│ D ──── E F │
│ │ │ │
│ G ────────── H │
└──────────┬──────────┘
│
▼
┌──────────────────────────┐
│ Topology-Aware Retrieval│
│ • Communities │
│ • Hubs & Bridges │
│ • Multi-hop paths │
└──────────┬───────────────┘
│
▼
┌──────────┐
│ Results │
└──────────┘
Advantages: - Field-specific embeddings (title ≠ abstract ≠ tags) - Hybrid similarity (embeddings + metadata) - Weighted components (configure what matters) - Community detection (find clusters) - Hub and bridge identification (cross-domain) - Topology-aware strategies (leverage structure)
Structured Similarity¶
The Problem¶
Consider a research paper document:
{
"title": "Attention Is All You Need",
"abstract": "The dominant sequence transduction models...",
"authors": ["Vaswani", "Shazeer", "Parmar"],
"tags": ["transformers", "attention", "seq2seq"],
"year": 2017,
"citations": 75823
}
Question: How should we compare two papers?
Traditional approach: Concatenate everything and embed:
text = f"{title} {abstract} {authors} {tags}"
embedding = embed(text)
similarity = cosine(embedding1, embedding2)
Problems: - Long abstracts dominate similarity - Author names treated as semantic text - Year and citations ignored (not text) - No control over what matters
The Solution: Field-Level Similarity¶
Complex Network RAG computes similarity per field:
similarity:
components:
# Semantic similarity on title (30%)
- type: field_embedding
field: title
model: tfidf
weight: 0.3
# Semantic similarity on abstract (40%)
- type: field_embedding
field: abstract
model: tfidf
weight: 0.4
# Set overlap on tags (20%)
- type: attribute_similarity
field: tags
metric: jaccard
weight: 0.2
# Set overlap on authors (10%)
- type: attribute_similarity
field: authors
metric: jaccard
weight: 0.1
Result: Fine-grained, explainable similarity:
Paper A vs Paper B:
title: "Attention Is All You Need" vs "BERT: Transformers for NLP"
→ embedding similarity: 0.75 × 0.3 = 0.225
abstract: [500 words on transformers] vs [500 words on BERT]
→ embedding similarity: 0.68 × 0.4 = 0.272
tags: {transformers, attention, seq2seq} vs {transformers, pretraining, nlp}
→ jaccard: 1/5 = 0.2 × 0.2 = 0.040
authors: {Vaswani, Shazeer, ...} vs {Devlin, Chang, ...}
→ jaccard: 0/8 = 0.0 × 0.1 = 0.000
──────────────────────────────────────────────────────────
Combined similarity: 0.537
Component Types¶
1. Field Embedding Component¶
What: Embed text field, compute cosine similarity
When to use: - Text fields (title, abstract, description) - Semantic matching needed - Order matters
Configuration:
- type: field_embedding
field: title
model: tfidf # or: sentence_bert, ollama
weight: 0.3
min_similarity: 0.5 # Only consider if ≥ 0.5
# Optional: For long text
chunking:
method: sentences # or: fixed, sliding
max_tokens: 512
overlap: 50
Example: Paper titles, product descriptions, article text
2. Attribute Similarity Component¶
What: Compare non-text attributes directly (no embedding)
When to use: - Categorical data (tags, categories) - Exact matching (IDs, codes) - Set overlap (authors, keywords) - Boolean fields
Configuration:
- type: attribute_similarity
field: tags
metric: jaccard # or: dice, exact, overlap
weight: 0.2
min_similarity: 0.1
Metrics:
- jaccard: |A ∩ B| / |A ∪ B| (set overlap)
- dice: 2|A ∩ B| / (|A| + |B|) (harmonic mean)
- exact: 1.0 if equal, 0.0 otherwise
- overlap: |A ∩ B| / min(|A|, |B|) (containment)
Example: Tags, categories, authors, keywords
3. Composite Embedding Component¶
What: Combine multiple fields into single embedding
When to use: - Fields that should be considered together - Want a single semantic representation - Differential weighting within composite
Configuration:
- type: composite_embedding
component_id: full_text
fields:
- field: name
weight: 0.4
- field: description
weight: 0.6
model: tfidf
aggregation: weighted_avg
weight: 0.8
Example: Product name + description, title + abstract
Weight Normalization¶
Weights are automatically normalized to sum to 1.0:
# You write:
components:
- field: title
weight: 3 # 3/10 = 0.3
- field: abstract
weight: 5 # 5/10 = 0.5
- field: tags
weight: 2 # 2/10 = 0.2
# System interprets as:
components:
- field: title
weight: 0.3
- field: abstract
weight: 0.5
- field: tags
weight: 0.2
Visual representation of component weights:
Title [██████████████████████████████] 30%
Abstract [██████████████████████████████████████████████████] 50%
Tags [████████████████████] 20%
Combined Similarity = 0.3×sim(title) + 0.5×sim(abstract) + 0.2×sim(tags)
This makes configuration intuitive (use ratios like 3:5:2).
Chunking Strategies¶
Long text fields are automatically chunked:
Sentence Chunking (Recommended)¶
How it works: 1. Split text into sentences 2. Group sentences until max_tokens reached 3. Overlap last N tokens with next chunk 4. Embed each chunk separately 5. Aggregate chunk embeddings (mean, max, etc.)
Best for: Natural text (abstracts, articles, descriptions)
Fixed Chunking¶
How it works: 1. Split text every max_tokens tokens 2. No sentence boundaries 3. Optional overlap
Best for: Code, structured text, logs
Sliding Window¶
How it works: 1. Fixed-size window 2. Slide by (max_tokens - overlap) 3. Dense coverage
Best for: Finding local patterns, overlapping context
Aggregation Methods¶
When you have multiple chunks or fields, how do you combine embeddings?
Options:
- weighted_avg: Weighted average of embeddings
- max_pool: Max value per dimension (emphasizes peaks)
- concat: Concatenate embeddings (increases dimension)
- sum: Sum embeddings (emphasizes frequency)
Example:
# Abstract split into 3 chunks
chunk1_emb = [0.5, 0.2, 0.8]
chunk2_emb = [0.6, 0.1, 0.7]
chunk3_emb = [0.4, 0.3, 0.6]
# weighted_avg (default)
result = (chunk1_emb + chunk2_emb + chunk3_emb) / 3
= [0.50, 0.20, 0.70]
# max_pool
result = [max(0.5,0.6,0.4), max(0.2,0.1,0.3), max(0.8,0.7,0.6)]
= [0.60, 0.30, 0.80]
Network Topology¶
From Embeddings to Graphs¶
Complex Network RAG doesn't just compute embeddings—it builds a similarity graph:
Documents → Embeddings → Similarity Matrix → Filtered Graph
Step 1: Compute Pairwise Similarity
┌─────┬─────┬─────┬─────┐
│ │ A │ B │ C │
├─────┼─────┼─────┼─────┤
│ A │ 1.0 │ 0.7 │ 0.3 │
│ B │ 0.7 │ 1.0 │ 0.5 │
│ C │ 0.3 │ 0.5 │ 1.0 │
└─────┴─────┴─────┴─────┘
Step 2: Filter by Threshold (≥ 0.4)
A ────── B
│
C
Only keep strong connections!
Key insight: Only create edges where similarity ≥ threshold.
Why Graphs?¶
Traditional RAG: "Find 10 most similar documents"
Problem: No context about relationships between retrieved documents.
Complex Network RAG: "Find 10 most similar documents, considering their network position"
Query → Graph → [hub1: 0.9, bridge3: 0.87, doc5: 0.84, ...]
(hub connects 15 docs) (bridge connects 2 communities)
Benefit: Retrieve diverse, structurally important documents.
Similarity Threshold¶
The min_combined_similarity threshold controls edge creation:
Effect: - Only create edges where combined similarity ≥ 0.4 - Controls network density - Critical tuning parameter
Choosing a Threshold¶
| Threshold | Edge Density | Network Type | Use Case |
|---|---|---|---|
| 0.3 | ~10% | Very dense | Exploratory, small datasets |
| 0.4 | ~5% | Dense | General use, medium datasets |
| 0.5 | ~3% | Moderate | Balanced (recommended) |
| 0.6 | ~1% | Sparse | Large datasets, high precision |
| 0.7+ | ~0.1% | Very sparse | Very large datasets |
Edge complexity: - n documents with threshold t - Expected edges ≈ n² × P(similarity ≥ t) - Dense networks (many edges) are slower but capture more relationships - Sparse networks (few edges) are faster but may miss connections
Empirical rules: - 1K documents: threshold 0.4-0.5 - 10K documents: threshold 0.5-0.6 - 100K documents: threshold 0.6-0.7
Network Properties¶
Degree¶
Definition: Number of connections a node has
Interpretation: - High degree = hub (connects many documents) - Low degree = specialized (few connections)
Communities¶
Definition: Dense clusters of highly connected nodes
Algorithm: Louvain community detection (modularity optimization)
Interpretation: - Nodes in same community are tightly related - Different communities represent different topics/domains - Community size indicates topic prevalence
Example:
Community 0: Machine Learning papers (45 nodes)
Community 1: Medical texts (32 nodes)
Community 2: Legal documents (28 nodes)
Community 3: News articles (67 nodes)
Visual representation of network topology:
Community 0 (ML) Community 1 (Medical)
┌─────────────────────┐ ┌─────────────────────┐
│ A ─── B ─── C │ │ G ─── H ─── I │
│ │ │ │ │ │ │ │ │ │
│ D ─── E ─── F ────┼──────┼─→ J ─── K ─── L │
│ │ │Bridge│ │ │
│ M (Hub) │ │ N │
│ │ │ │ │ │ │
│ O P Q │ └─────────────────────┘
└─────────────────────┘
Legend:
• Nodes (A-Q): Documents
• ─── : Strong similarity edge
• Hub (M): High degree (connects many docs)
• Bridge (F-J): Connects communities (cross-domain)
• Dense clusters: Communities (related topics)
Key insights from this topology: - Community 0 contains ML papers (tightly connected) - Community 1 contains medical papers (separate cluster) - Node M is a hub (survey paper? foundational concept?) - Edge F-J is a bridge (ML application in medicine?)
Betweenness Centrality¶
Definition: How often a node lies on shortest paths between other nodes
Interpretation: - High betweenness = bridge (connects different parts of graph) - Low betweenness = internal to community
Example:
Node "optimization-theory": betweenness = 0.35
→ Bridges ML community and Operations Research community
Node "transformer-intro": betweenness = 0.02
→ Internal to NLP community, not a bridge
Clustering Coefficient¶
Definition: How connected a node's neighbors are
Interpretation: - High clustering = tightly knit neighborhood - Low clustering = star pattern (hub with unconnected neighbors)
Similarity Components¶
Embedding Components¶
TF-IDF Embedding¶
What: Term Frequency-Inverse Document Frequency
Pros: - Fast (no neural network) - Interpretable (sparse vectors) - Works well for keywords - No external dependencies
Cons: - Bag-of-words (ignores order) - No semantic understanding - Vocabulary limited to training data
Use for: Titles, tags, short text, keywords
Sentence Transformers¶
What: Neural embeddings (BERT-based)
Pros: - Semantic understanding - Handles synonyms - Contextual (order matters) - Pre-trained models available
Cons: - Slower than TF-IDF - Requires model download - Dense vectors (higher memory)
Use for: Abstracts, descriptions, full text
- type: field_embedding
field: abstract
model: sentence_bert
model_name: all-MiniLM-L6-v2
weight: 0.5
Ollama (Remote)¶
What: Remote embedding service
Pros: - Latest models - No local compute - Scalable
Cons: - Requires network - API costs - Latency
Use for: Production deployments, large-scale
- type: field_embedding
field: content
model: ollama
model_name: nomic-embed-text
host: http://localhost:11434
weight: 0.7
Attribute Components¶
Jaccard Similarity¶
Formula: |A ∩ B| / |A ∪ B|
Example:
A = {'ml', 'nlp', 'transformers'}
B = {'nlp', 'transformers', 'bert'}
intersection = {'nlp', 'transformers'} # |A ∩ B| = 2
union = {'ml', 'nlp', 'transformers', 'bert'} # |A ∪ B| = 4
jaccard = 2 / 4 = 0.5
Use for: Tags, keywords, categories
Dice Similarity¶
Formula: 2|A ∩ B| / (|A| + |B|)
Example:
A = {'ml', 'nlp', 'transformers'}
B = {'nlp', 'transformers', 'bert'}
intersection = {'nlp', 'transformers'} # |A ∩ B| = 2
dice = 2 × 2 / (3 + 3) = 4 / 6 = 0.67
Use for: When you want to emphasize overlap more than Jaccard
Exact Match¶
Formula: 1.0 if A == B, else 0.0
Example:
Use for: Categories, IDs, exact-match fields
Network Analysis¶
Community Detection¶
Goal: Find clusters of related documents
Algorithm: Louvain (fast, hierarchical)
Usage:
# Detect communities
communities = rag.detect_communities()
# Group by community
from collections import defaultdict
comm_groups = defaultdict(list)
for node_id, comm_id in communities.items():
comm_groups[comm_id].append(node_id)
# Analyze
for comm_id, nodes in comm_groups.items():
# Auto-tag community
tags = rag.auto_tag_community(comm_id)
print(f"Community {comm_id}: {', '.join(tags)}")
Auto-tagging: - Sample random nodes from community - Extract text content - Run TF-IDF to find distinctive terms - Return top-K keywords
Hub Identification¶
Goal: Find highly connected nodes
Criteria: Degree ≥ threshold
Usage:
# Find hubs (min 10 connections)
hubs = rag.get_hub_nodes(min_degree=10)
# Analyze hubs
for node_id in hubs:
degree = rag.graph.degree(node_id)
community = rag.get_community_for_node(node_id)
print(f"{node_id}: {degree} connections in community {community}")
Why hubs matter: - Represent foundational concepts - Good candidates for retrieval (versatile) - Often overview documents - Connect specialized sub-topics
Bridge Identification¶
Goal: Find nodes connecting communities
Criteria: Betweenness centrality ≥ threshold
Usage:
# Find bridges
bridges = rag.get_bridge_nodes(min_betweenness=0.1)
# Analyze bridges
for node_id in bridges:
# Get neighboring communities
neighbors = rag.get_neighbors(node_id, k_hops=1)
neighbor_comms = {rag.get_community_for_node(n) for n in neighbors}
print(f"{node_id} connects communities: {neighbor_comms}")
Why bridges matter: - Enable cross-domain transfer - Often interdisciplinary work - Good for exploration (diverse results) - Connect different perspectives
Retrieval Strategies¶
Strategy Overview¶
| Strategy | Focus | Use Case |
|---|---|---|
| similarity | Pure cosine | Baseline, simple queries |
| community | Same cluster | Domain-specific retrieval |
| hub | High degree | Foundational documents |
| bridge | Cross-domain | Interdisciplinary queries |
| hybrid | Combination | General use (recommended) |
Similarity Strategy (Baseline)¶
Algorithm: Pure cosine similarity
When to use: - Baseline comparison - Very specific queries - Small, homogeneous datasets
Community Strategy¶
Algorithm: Boost nodes in same community as top results
How it works: 1. Find top-K by similarity 2. Detect their communities 3. Boost other nodes in those communities
When to use: - Want domain-coherent results - Exploring a specific topic - Need contextual documents
Hub Strategy¶
Algorithm: Boost high-degree nodes
How it works: 1. Find top-K by similarity 2. Compute degree for each 3. Boost high-degree nodes
When to use: - Want foundational documents - Overview/survey papers - Versatile knowledge
Bridge Strategy¶
Algorithm: Boost high-betweenness nodes
How it works: 1. Find top-K by similarity 2. Compute betweenness for each 3. Boost bridge nodes
When to use: - Cross-domain queries - Want diverse perspectives - Transfer learning tasks
Hybrid Strategy (Recommended)¶
Algorithm: Intelligent combination of all strategies
How it works: 1. Pure similarity (baseline) 2. Community boost (coherence) 3. Hub boost (foundation) 4. Bridge boost (diversity) 5. Combine with learned weights
When to use: - General purpose (default) - Want balanced results - Let system optimize
Design Patterns¶
Pattern 1: Multi-Field Products¶
# E-commerce: semantic + categorical
schema:
name: text
description: text
category: text
brand: text
embeddings:
name_vec:
field: name
model: tfidf
desc_vec:
field: description
model: tfidf
product_vec:
combine:
- ref: name_vec
weight: 0.3
- ref: desc_vec
weight: 0.7
similarities:
text_sim:
embedding: product_vec
category_sim:
field: category
metric: exact
brand_sim:
field: brand
metric: exact
overall:
combine:
- ref: text_sim
weight: 0.6
- ref: category_sim
weight: 0.3
- ref: brand_sim
weight: 0.1
network:
edges:
similarity: overall
min: 0.4
Pattern 2: Research Papers¶
# Papers: precise title + semantic abstract + metadata
schema:
title: text
abstract: text
tags:
type: list
default: []
authors:
type: list
default: []
embeddings:
title_vec:
field: title
model: tfidf
abstract_vec:
field: abstract
model: tfidf
chunking:
method: sentences
max_tokens: 512
text_vec:
combine:
- ref: title_vec
weight: 0.4
- ref: abstract_vec
weight: 0.6
similarities:
text_sim:
embedding: text_vec
tag_sim:
field: tags
metric: jaccard
author_sim:
field: authors
metric: jaccard
overall:
combine:
- ref: text_sim
weight: 0.7
- ref: tag_sim
weight: 0.2
- ref: author_sim
weight: 0.1
network:
edges:
similarity: overall
min: 0.35
Pattern 3: Hierarchical Content¶
# Blog posts: title > summary > full text
schema:
title: text
summary: text
content: text
tags:
type: list
default: []
embeddings:
title_vec:
field: title
model: tfidf
summary_vec:
field: summary
model: tfidf
content_vec:
field: content
model: tfidf
chunking:
method: sentences
max_tokens: 1024
text_vec:
combine:
- ref: title_vec
weight: 0.4
- ref: summary_vec
weight: 0.35
- ref: content_vec
weight: 0.25
similarities:
text_sim:
embedding: text_vec
tag_sim:
field: tags
metric: jaccard
overall:
combine:
- ref: text_sim
weight: 0.9
- ref: tag_sim
weight: 0.1
network:
edges:
similarity: overall
min: 0.4
Pattern 4: Conversations¶
# Chat logs: role-weighted messages
schema:
user_message: text
assistant_message: text
topic: text
embeddings:
user_vec:
field: user_message
model: tfidf
assistant_vec:
field: assistant_message
model: tfidf
chat_vec:
combine:
- ref: user_vec
weight: 0.6 # User more important
- ref: assistant_vec
weight: 0.4
similarities:
chat_sim:
embedding: chat_vec
topic_sim:
field: topic
metric: exact
overall:
combine:
- ref: chat_sim
weight: 0.8
- ref: topic_sim
weight: 0.2
network:
edges:
similarity: overall
min: 0.4
Summary¶
Structured Similarity: - ✅ Field-level embeddings for fine-grained control - ✅ Hybrid components (embeddings + attributes) - ✅ Weighted combination with automatic normalization - ✅ Chunking for long text - ✅ Explainable scores
Network Topology: - ✅ Similarity graph with threshold filtering - ✅ Community detection for topic clusters - ✅ Hub identification for foundational concepts - ✅ Bridge identification for cross-domain transfer - ✅ Topology-aware retrieval strategies
Next Steps: - See YAML DSL Reference for complete configuration syntax - See API Reference for programmatic usage - See Tutorials for end-to-end examples
Complex Network RAG - Structure + Topology = Smarter Retrieval