Core Concepts¶

This guide provides a deep dive into the two fundamental innovations of Complex Network RAG: 1. Structured Similarity - Field-level document comparison 2. Network Topology - Graph-based knowledge organization

Table of Contents¶

Traditional RAG vs Complex Network RAG
Structured Similarity
Network Topology
Similarity Components
Network Analysis
Retrieval Strategies
Design Patterns

Traditional RAG vs Complex Network RAG¶

Traditional RAG¶

Traditional Retrieval-Augmented Generation systems:

┌──────────┐    ┌───────────┐    ┌─────────┐    ┌─────────────┐
│ Document │───▶│  Embed    │───▶│ Vector  │───▶│   Top-K     │
│  (Text)  │    │ (Single)  │    │   DB    │    │ Similarity  │
└──────────┘    └───────────┘    └─────────┘    └─────────────┘
                                                         │
                                                         ▼
                                                   ┌──────────┐
                                                   │ Results  │
                                                   └──────────┘

Limitations: - Treats documents as atomic units (ignores internal structure) - Pure distance-based retrieval (no graph structure) - All content equally weighted - No community awareness - Limited cross-domain discovery

Complex Network RAG¶

Complex Network RAG extends this with two key innovations:

┌──────────────────┐
│    Document      │
│  {title: "...",  │
│   abstract: "..."|
│   tags: [...]}   │
└────────┬─────────┘
         │
         ▼
┌─────────────────────────────────────┐
│    Structured Similarity            │
│                                     │
│  ┌───────────┐  ┌───────────┐     │
│  │  Field    │  │ Attribute │     │
│  │Embeddings │  │Similarity │     │
│  │ (title,   │  │ (tags,    │     │
│  │ abstract) │  │  authors) │     │
│  └─────┬─────┘  └─────┬─────┘     │
│        │              │            │
│        └──────┬───────┘            │
│               ▼                    │
│      ┌────────────────┐           │
│      │ Combined Score │           │
│      └────────┬───────┘           │
└───────────────┼───────────────────┘
                │
                ▼
    ┌─────────────────────┐
    │  Similarity Graph   │
    │                     │
    │   A ──── B ──── C   │
    │   │      │      │   │
    │   D ──── E      F   │
    │   │              │   │
    │   G ────────── H     │
    └──────────┬──────────┘
               │
               ▼
    ┌──────────────────────────┐
    │  Topology-Aware Retrieval│
    │  • Communities           │
    │  • Hubs & Bridges        │
    │  • Multi-hop paths       │
    └──────────┬───────────────┘
               │
               ▼
         ┌──────────┐
         │ Results  │
         └──────────┘

Advantages: - Field-specific embeddings (title ≠ abstract ≠ tags) - Hybrid similarity (embeddings + metadata) - Weighted components (configure what matters) - Community detection (find clusters) - Hub and bridge identification (cross-domain) - Topology-aware strategies (leverage structure)

Structured Similarity¶

The Problem¶

Consider a research paper document:

{
  "title": "Attention Is All You Need",
  "abstract": "The dominant sequence transduction models...",
  "authors": ["Vaswani", "Shazeer", "Parmar"],
  "tags": ["transformers", "attention", "seq2seq"],
  "year": 2017,
  "citations": 75823
}

Question: How should we compare two papers?

Traditional approach: Concatenate everything and embed:

text = f"{title} {abstract} {authors} {tags}"
embedding = embed(text)
similarity = cosine(embedding1, embedding2)

Problems: - Long abstracts dominate similarity - Author names treated as semantic text - Year and citations ignored (not text) - No control over what matters

The Solution: Field-Level Similarity¶

Complex Network RAG computes similarity per field:

similarity:
  components:
    # Semantic similarity on title (30%)
    - type: field_embedding
      field: title
      model: tfidf
      weight: 0.3

    # Semantic similarity on abstract (40%)
    - type: field_embedding
      field: abstract
      model: tfidf
      weight: 0.4

    # Set overlap on tags (20%)
    - type: attribute_similarity
      field: tags
      metric: jaccard
      weight: 0.2

    # Set overlap on authors (10%)
    - type: attribute_similarity
      field: authors
      metric: jaccard
      weight: 0.1

Result: Fine-grained, explainable similarity:

Paper A vs Paper B:
  title:    "Attention Is All You Need" vs "BERT: Transformers for NLP"
            → embedding similarity: 0.75 × 0.3 = 0.225

  abstract: [500 words on transformers] vs [500 words on BERT]
            → embedding similarity: 0.68 × 0.4 = 0.272

  tags:     {transformers, attention, seq2seq} vs {transformers, pretraining, nlp}
            → jaccard: 1/5 = 0.2 × 0.2 = 0.040

  authors:  {Vaswani, Shazeer, ...} vs {Devlin, Chang, ...}
            → jaccard: 0/8 = 0.0 × 0.1 = 0.000

  ──────────────────────────────────────────────────────────
  Combined similarity: 0.537

Component Types¶

1. Field Embedding Component¶

What: Embed text field, compute cosine similarity

When to use: - Text fields (title, abstract, description) - Semantic matching needed - Order matters

Configuration:

- type: field_embedding
  field: title
  model: tfidf              # or: sentence_bert, ollama
  weight: 0.3
  min_similarity: 0.5       # Only consider if ≥ 0.5

  # Optional: For long text
  chunking:
    method: sentences       # or: fixed, sliding
    max_tokens: 512
    overlap: 50

Example: Paper titles, product descriptions, article text

2. Attribute Similarity Component¶

What: Compare non-text attributes directly (no embedding)

When to use: - Categorical data (tags, categories) - Exact matching (IDs, codes) - Set overlap (authors, keywords) - Boolean fields

Configuration:

- type: attribute_similarity
  field: tags
  metric: jaccard           # or: dice, exact, overlap
  weight: 0.2
  min_similarity: 0.1

Metrics: - jaccard: |A ∩ B| / |A ∪ B| (set overlap) - dice: 2|A ∩ B| / (|A| + |B|) (harmonic mean) - exact: 1.0 if equal, 0.0 otherwise - overlap: |A ∩ B| / min(|A|, |B|) (containment)

Example: Tags, categories, authors, keywords

3. Composite Embedding Component¶

What: Combine multiple fields into single embedding

When to use: - Fields that should be considered together - Want a single semantic representation - Differential weighting within composite

Configuration:

- type: composite_embedding
  component_id: full_text
  fields:
    - field: name
      weight: 0.4
    - field: description
      weight: 0.6
  model: tfidf
  aggregation: weighted_avg
  weight: 0.8

Example: Product name + description, title + abstract

Weight Normalization¶

Weights are automatically normalized to sum to 1.0:

# You write:
components:
  - field: title
    weight: 3      # 3/10 = 0.3
  - field: abstract
    weight: 5      # 5/10 = 0.5
  - field: tags
    weight: 2      # 2/10 = 0.2

# System interprets as:
components:
  - field: title
    weight: 0.3
  - field: abstract
    weight: 0.5
  - field: tags
    weight: 0.2

Visual representation of component weights:

Title     [██████████████████████████████] 30%
Abstract  [██████████████████████████████████████████████████] 50%
Tags      [████████████████████] 20%

Combined Similarity = 0.3×sim(title) + 0.5×sim(abstract) + 0.2×sim(tags)

This makes configuration intuitive (use ratios like 3:5:2).

Chunking Strategies¶

Long text fields are automatically chunked:

Sentence Chunking (Recommended)¶

chunking:
  method: sentences
  max_tokens: 512
  overlap: 50

How it works: 1. Split text into sentences 2. Group sentences until max_tokens reached 3. Overlap last N tokens with next chunk 4. Embed each chunk separately 5. Aggregate chunk embeddings (mean, max, etc.)

Best for: Natural text (abstracts, articles, descriptions)

Fixed Chunking¶

chunking:
  method: fixed
  max_tokens: 256
  overlap: 0

How it works: 1. Split text every max_tokens tokens 2. No sentence boundaries 3. Optional overlap

Best for: Code, structured text, logs

Sliding Window¶

chunking:
  method: sliding
  max_tokens: 256
  overlap: 128

How it works: 1. Fixed-size window 2. Slide by (max_tokens - overlap) 3. Dense coverage

Best for: Finding local patterns, overlapping context

Aggregation Methods¶

When you have multiple chunks or fields, how do you combine embeddings?

aggregation: weighted_avg    # Default

Options: - weighted_avg: Weighted average of embeddings - max_pool: Max value per dimension (emphasizes peaks) - concat: Concatenate embeddings (increases dimension) - sum: Sum embeddings (emphasizes frequency)

Example:

# Abstract split into 3 chunks
chunk1_emb = [0.5, 0.2, 0.8]
chunk2_emb = [0.6, 0.1, 0.7]
chunk3_emb = [0.4, 0.3, 0.6]

# weighted_avg (default)
result = (chunk1_emb + chunk2_emb + chunk3_emb) / 3
       = [0.50, 0.20, 0.70]

# max_pool
result = [max(0.5,0.6,0.4), max(0.2,0.1,0.3), max(0.8,0.7,0.6)]
       = [0.60, 0.30, 0.80]

Network Topology¶

From Embeddings to Graphs¶

Complex Network RAG doesn't just compute embeddings—it builds a similarity graph:

Documents → Embeddings → Similarity Matrix → Filtered Graph

Step 1: Compute Pairwise Similarity
┌─────┬─────┬─────┬─────┐
│     │ A   │ B   │ C   │
├─────┼─────┼─────┼─────┤
│ A   │ 1.0 │ 0.7 │ 0.3 │
│ B   │ 0.7 │ 1.0 │ 0.5 │
│ C   │ 0.3 │ 0.5 │ 1.0 │
└─────┴─────┴─────┴─────┘

Step 2: Filter by Threshold (≥ 0.4)
     A ────── B
            │
            C

Only keep strong connections!

Key insight: Only create edges where similarity ≥ threshold.

Why Graphs?¶

Traditional RAG: "Find 10 most similar documents"

Query → [doc1: 0.9, doc2: 0.85, doc3: 0.83, ..., doc10: 0.72]

Problem: No context about relationships between retrieved documents.

Complex Network RAG: "Find 10 most similar documents, considering their network position"

Query → Graph → [hub1: 0.9, bridge3: 0.87, doc5: 0.84, ...]
                 (hub connects 15 docs) (bridge connects 2 communities)

Benefit: Retrieve diverse, structurally important documents.

Similarity Threshold¶

The min_combined_similarity threshold controls edge creation:

similarity:
  min_combined_similarity: 0.4

Effect: - Only create edges where combined similarity ≥ 0.4 - Controls network density - Critical tuning parameter

Choosing a Threshold¶

Threshold	Edge Density	Network Type	Use Case
0.3	~10%	Very dense	Exploratory, small datasets
0.4	~5%	Dense	General use, medium datasets
0.5	~3%	Moderate	Balanced (recommended)
0.6	~1%	Sparse	Large datasets, high precision
0.7+	~0.1%	Very sparse	Very large datasets

Edge complexity: - n documents with threshold t - Expected edges ≈ n² × P(similarity ≥ t) - Dense networks (many edges) are slower but capture more relationships - Sparse networks (few edges) are faster but may miss connections

Empirical rules: - 1K documents: threshold 0.4-0.5 - 10K documents: threshold 0.5-0.6 - 100K documents: threshold 0.6-0.7

Network Properties¶

Degree¶

Definition: Number of connections a node has

Interpretation: - High degree = hub (connects many documents) - Low degree = specialized (few connections)

degree = rag.graph.degree('node_id')
print(f"Node has {degree} connections")

Communities¶

Definition: Dense clusters of highly connected nodes

Algorithm: Louvain community detection (modularity optimization)

Interpretation: - Nodes in same community are tightly related - Different communities represent different topics/domains - Community size indicates topic prevalence

communities = rag.detect_communities()
# {node1: 0, node2: 0, node3: 1, node4: 1, ...}

Example:

Community 0: Machine Learning papers (45 nodes)
Community 1: Medical texts (32 nodes)
Community 2: Legal documents (28 nodes)
Community 3: News articles (67 nodes)

Visual representation of network topology:

         Community 0 (ML)          Community 1 (Medical)
    ┌─────────────────────┐      ┌─────────────────────┐
    │   A ─── B ─── C     │      │   G ─── H ─── I     │
    │   │     │     │     │      │   │     │     │     │
    │   D ─── E ─── F ────┼──────┼─→ J ─── K ─── L     │
    │         │           │Bridge│         │           │
    │         M (Hub)     │      │         N           │
    │         │ │ │       │      │                     │
    │         O P Q       │      └─────────────────────┘
    └─────────────────────┘

Legend:
  • Nodes (A-Q): Documents
  • ─── : Strong similarity edge
  • Hub (M): High degree (connects many docs)
  • Bridge (F-J): Connects communities (cross-domain)
  • Dense clusters: Communities (related topics)

Key insights from this topology: - Community 0 contains ML papers (tightly connected) - Community 1 contains medical papers (separate cluster) - Node M is a hub (survey paper? foundational concept?) - Edge F-J is a bridge (ML application in medicine?)

Betweenness Centrality¶

Definition: How often a node lies on shortest paths between other nodes

Interpretation: - High betweenness = bridge (connects different parts of graph) - Low betweenness = internal to community

import networkx as nx
betweenness = nx.betweenness_centrality(rag.graph)

Example:

Node "optimization-theory": betweenness = 0.35
  → Bridges ML community and Operations Research community

Node "transformer-intro": betweenness = 0.02
  → Internal to NLP community, not a bridge

Clustering Coefficient¶

Definition: How connected a node's neighbors are

Interpretation: - High clustering = tightly knit neighborhood - Low clustering = star pattern (hub with unconnected neighbors)

clustering = nx.clustering(rag.graph, 'node_id')

Similarity Components¶

Embedding Components¶

TF-IDF Embedding¶

What: Term Frequency-Inverse Document Frequency

Pros: - Fast (no neural network) - Interpretable (sparse vectors) - Works well for keywords - No external dependencies

Cons: - Bag-of-words (ignores order) - No semantic understanding - Vocabulary limited to training data

Use for: Titles, tags, short text, keywords

- type: field_embedding
  field: title
  model: tfidf
  weight: 0.3

Sentence Transformers¶

What: Neural embeddings (BERT-based)

Pros: - Semantic understanding - Handles synonyms - Contextual (order matters) - Pre-trained models available

Cons: - Slower than TF-IDF - Requires model download - Dense vectors (higher memory)

Use for: Abstracts, descriptions, full text

- type: field_embedding
  field: abstract
  model: sentence_bert
  model_name: all-MiniLM-L6-v2
  weight: 0.5

Ollama (Remote)¶

What: Remote embedding service

Pros: - Latest models - No local compute - Scalable

Cons: - Requires network - API costs - Latency

Use for: Production deployments, large-scale

- type: field_embedding
  field: content
  model: ollama
  model_name: nomic-embed-text
  host: http://localhost:11434
  weight: 0.7

Attribute Components¶

Jaccard Similarity¶

Formula: |A ∩ B| / |A ∪ B|

Example:

A = {'ml', 'nlp', 'transformers'}
B = {'nlp', 'transformers', 'bert'}

intersection = {'nlp', 'transformers'}  # |A ∩ B| = 2
union = {'ml', 'nlp', 'transformers', 'bert'}  # |A ∪ B| = 4

jaccard = 2 / 4 = 0.5

Use for: Tags, keywords, categories

Dice Similarity¶

Formula: 2|A ∩ B| / (|A| + |B|)

Example:

A = {'ml', 'nlp', 'transformers'}
B = {'nlp', 'transformers', 'bert'}

intersection = {'nlp', 'transformers'}  # |A ∩ B| = 2

dice = 2 × 2 / (3 + 3) = 4 / 6 = 0.67

Use for: When you want to emphasize overlap more than Jaccard

Exact Match¶

Formula: 1.0 if A == B, else 0.0

Example:

A = 'electronics'
B = 'electronics'
exact = 1.0

A = 'electronics'
B = 'books'
exact = 0.0

Use for: Categories, IDs, exact-match fields

Network Analysis¶

Community Detection¶

Goal: Find clusters of related documents

Algorithm: Louvain (fast, hierarchical)

Usage:

# Detect communities
communities = rag.detect_communities()

# Group by community
from collections import defaultdict
comm_groups = defaultdict(list)
for node_id, comm_id in communities.items():
    comm_groups[comm_id].append(node_id)

# Analyze
for comm_id, nodes in comm_groups.items():
    # Auto-tag community
    tags = rag.auto_tag_community(comm_id)
    print(f"Community {comm_id}: {', '.join(tags)}")

Auto-tagging: - Sample random nodes from community - Extract text content - Run TF-IDF to find distinctive terms - Return top-K keywords

Hub Identification¶

Goal: Find highly connected nodes

Criteria: Degree ≥ threshold

Usage:

# Find hubs (min 10 connections)
hubs = rag.get_hub_nodes(min_degree=10)

# Analyze hubs
for node_id in hubs:
    degree = rag.graph.degree(node_id)
    community = rag.get_community_for_node(node_id)
    print(f"{node_id}: {degree} connections in community {community}")

Why hubs matter: - Represent foundational concepts - Good candidates for retrieval (versatile) - Often overview documents - Connect specialized sub-topics

Bridge Identification¶

Goal: Find nodes connecting communities

Criteria: Betweenness centrality ≥ threshold

Usage:

# Find bridges
bridges = rag.get_bridge_nodes(min_betweenness=0.1)

# Analyze bridges
for node_id in bridges:
    # Get neighboring communities
    neighbors = rag.get_neighbors(node_id, k_hops=1)
    neighbor_comms = {rag.get_community_for_node(n) for n in neighbors}

    print(f"{node_id} connects communities: {neighbor_comms}")

Why bridges matter: - Enable cross-domain transfer - Often interdisciplinary work - Good for exploration (diverse results) - Connect different perspectives

Retrieval Strategies¶

Strategy Overview¶

Strategy	Focus	Use Case
similarity	Pure cosine	Baseline, simple queries
community	Same cluster	Domain-specific retrieval
hub	High degree	Foundational documents
bridge	Cross-domain	Interdisciplinary queries
hybrid	Combination	General use (recommended)

Similarity Strategy (Baseline)¶

Algorithm: Pure cosine similarity

results = rag.search(query).with_strategy('similarity').top(10)

When to use: - Baseline comparison - Very specific queries - Small, homogeneous datasets

Community Strategy¶

Algorithm: Boost nodes in same community as top results

results = rag.search(query).with_strategy('community').top(10)

How it works: 1. Find top-K by similarity 2. Detect their communities 3. Boost other nodes in those communities

When to use: - Want domain-coherent results - Exploring a specific topic - Need contextual documents

Hub Strategy¶

Algorithm: Boost high-degree nodes

results = rag.search(query).with_strategy('hub').top(10)

How it works: 1. Find top-K by similarity 2. Compute degree for each 3. Boost high-degree nodes

When to use: - Want foundational documents - Overview/survey papers - Versatile knowledge

Bridge Strategy¶

Algorithm: Boost high-betweenness nodes

results = rag.search(query).with_strategy('bridge').top(10)

How it works: 1. Find top-K by similarity 2. Compute betweenness for each 3. Boost bridge nodes

When to use: - Cross-domain queries - Want diverse perspectives - Transfer learning tasks

Hybrid Strategy (Recommended)¶

Algorithm: Intelligent combination of all strategies

results = rag.search(query).with_strategy('hybrid').top(10)

How it works: 1. Pure similarity (baseline) 2. Community boost (coherence) 3. Hub boost (foundation) 4. Bridge boost (diversity) 5. Combine with learned weights

When to use: - General purpose (default) - Want balanced results - Let system optimize

Design Patterns¶

Pattern 1: Multi-Field Products¶

# E-commerce: semantic + categorical
schema:
  name: text
  description: text
  category: text
  brand: text

embeddings:
  name_vec:
    field: name
    model: tfidf
  desc_vec:
    field: description
    model: tfidf
  product_vec:
    combine:
      - ref: name_vec
        weight: 0.3
      - ref: desc_vec
        weight: 0.7

similarities:
  text_sim:
    embedding: product_vec
  category_sim:
    field: category
    metric: exact
  brand_sim:
    field: brand
    metric: exact
  overall:
    combine:
      - ref: text_sim
        weight: 0.6
      - ref: category_sim
        weight: 0.3
      - ref: brand_sim
        weight: 0.1

network:
  edges:
    similarity: overall
    min: 0.4

Pattern 2: Research Papers¶

# Papers: precise title + semantic abstract + metadata
schema:
  title: text
  abstract: text
  tags:
    type: list
    default: []
  authors:
    type: list
    default: []

embeddings:
  title_vec:
    field: title
    model: tfidf
  abstract_vec:
    field: abstract
    model: tfidf
    chunking:
      method: sentences
      max_tokens: 512
  text_vec:
    combine:
      - ref: title_vec
        weight: 0.4
      - ref: abstract_vec
        weight: 0.6

similarities:
  text_sim:
    embedding: text_vec
  tag_sim:
    field: tags
    metric: jaccard
  author_sim:
    field: authors
    metric: jaccard
  overall:
    combine:
      - ref: text_sim
        weight: 0.7
      - ref: tag_sim
        weight: 0.2
      - ref: author_sim
        weight: 0.1

network:
  edges:
    similarity: overall
    min: 0.35

Pattern 3: Hierarchical Content¶

# Blog posts: title > summary > full text
schema:
  title: text
  summary: text
  content: text
  tags:
    type: list
    default: []

embeddings:
  title_vec:
    field: title
    model: tfidf
  summary_vec:
    field: summary
    model: tfidf
  content_vec:
    field: content
    model: tfidf
    chunking:
      method: sentences
      max_tokens: 1024
  text_vec:
    combine:
      - ref: title_vec
        weight: 0.4
      - ref: summary_vec
        weight: 0.35
      - ref: content_vec
        weight: 0.25

similarities:
  text_sim:
    embedding: text_vec
  tag_sim:
    field: tags
    metric: jaccard
  overall:
    combine:
      - ref: text_sim
        weight: 0.9
      - ref: tag_sim
        weight: 0.1

network:
  edges:
    similarity: overall
    min: 0.4

Pattern 4: Conversations¶

# Chat logs: role-weighted messages
schema:
  user_message: text
  assistant_message: text
  topic: text

embeddings:
  user_vec:
    field: user_message
    model: tfidf
  assistant_vec:
    field: assistant_message
    model: tfidf
  chat_vec:
    combine:
      - ref: user_vec
        weight: 0.6      # User more important
      - ref: assistant_vec
        weight: 0.4

similarities:
  chat_sim:
    embedding: chat_vec
  topic_sim:
    field: topic
    metric: exact
  overall:
    combine:
      - ref: chat_sim
        weight: 0.8
      - ref: topic_sim
        weight: 0.2

network:
  edges:
    similarity: overall
    min: 0.4

Summary¶

Structured Similarity: - ✅ Field-level embeddings for fine-grained control - ✅ Hybrid components (embeddings + attributes) - ✅ Weighted combination with automatic normalization - ✅ Chunking for long text - ✅ Explainable scores

Network Topology: - ✅ Similarity graph with threshold filtering - ✅ Community detection for topic clusters - ✅ Hub identification for foundational concepts - ✅ Bridge identification for cross-domain transfer - ✅ Topology-aware retrieval strategies

Next Steps: - See YAML DSL Reference for complete configuration syntax - See API Reference for programmatic usage - See Tutorials for end-to-end examples

Complex Network RAG - Structure + Topology = Smarter Retrieval