Getting Started with Complex Network RAG¶

This guide will walk you through installing, configuring, and using Complex Network RAG from scratch. By the end, you'll understand the core concepts and be able to build your own knowledge graphs.

Table of Contents¶

Installation
Your First Knowledge Graph
Understanding the Three Interfaces
Working with Structured Documents
Network Analysis
Next Steps

Learning Pathway¶

This guide follows a progressive learning path:

Level 1: Simple Text Documents
┌──────────────────────────────┐
│  Plain text → Embeddings →   │
│  Basic similarity matching   │
└──────────┬───────────────────┘
           │
           ▼
Level 2: Network Structure
┌──────────────────────────────┐
│  Similarity graph →          │
│  Communities, hubs, bridges  │
└──────────┬───────────────────┘
           │
           ▼
Level 3: Structured Documents
┌──────────────────────────────┐
│  Field-specific embeddings → │
│  Hybrid similarity (YAML)    │
└──────────┬───────────────────┘
           │
           ▼
Level 4: Production Use
┌──────────────────────────────┐
│  Choose your interface:      │
│  • REPL (exploration)        │
│  • CLI (automation)          │
│  • API (integration)         │
└──────────────────────────────┘

Time estimates: - Level 1: 15 minutes - Level 2: 20 minutes - Level 3: 30 minutes - Level 4: 20 minutes

Total: ~90 minutes to full proficiency

Installation¶

Prerequisites¶

Python 3.8 or higher
pip package manager

Basic Installation¶

# Clone the repository
git clone https://github.com/yourusername/complex-network-rag.git
cd complex-network-rag

# Install dependencies
pip install -r requirements.txt

# Optional: Install in development mode
pip install -e .

Verify Installation¶

# Check if CLI works
network-rag version

# Run tests to ensure everything is working
pytest tests/ -v

Your First Knowledge Graph¶

Let's build a simple knowledge graph of machine learning papers.

Option 1: Using the REPL (Recommended for Beginners)¶

The REPL is perfect for learning and experimentation:

# Start the interactive shell
network-rag repl

Once in the REPL:

[no db]> db connect papers.db
✓ Connected to papers.db

papers.db> add "Attention Is All You Need - The Transformer architecture uses self-attention"
✓ Added document: doc-1

papers.db [1 docs]> add "BERT: Bidirectional transformers for language understanding"
✓ Added document: doc-2

papers.db [2 docs]> add "GPT-3: Language Models are Few-Shot Learners"
✓ Added document: doc-3

papers.db [3 docs]> build
ℹ Fitting TF-IDF vectorizer...
ℹ Building network...
✓ Network built: 3 nodes, 2 edges

papers.db [3 docs, 2 edges]> search "transformer models"
Search: transformer models
Found 3 results:

1. doc-1 (score: 0.856)
   Attention Is All You Need - The Transformer architecture uses self-attention

2. doc-2 (score: 0.742)
   BERT: Bidirectional transformers for language understanding

3. doc-3 (score: 0.621)
   GPT-3: Language Models are Few-Shot Learners

What just happened? 1. Created a new database (papers.db) 2. Added three documents about transformers 3. Built a similarity network (found 2 edges connecting similar papers) 4. Searched for "transformer models" and got ranked results

Option 2: Using Python API¶

Create a file first_rag.py:

from src.network_rag import NetworkRAG

# Create a RAG instance with TF-IDF embeddings
rag = (NetworkRAG.builder()
       .with_storage('papers.db')
       .with_tfidf_embeddings()
       .with_similarity_threshold(min_similarity=0.7)
       .build())

# Add documents
rag.add('transformer', content="Attention Is All You Need - The Transformer architecture uses self-attention")
rag.add('bert', content="BERT: Bidirectional transformers for language understanding")
rag.add('gpt3', content="GPT-3: Language Models are Few-Shot Learners")

# Build the network
print("Building network...")
rag.build_network()

# Search
print("\nSearching for 'transformer models':")
results = rag.search("transformer models").top(3)

for i, result in enumerate(results, 1):
    print(f"\n{i}. {result.id} (score: {result.score:.3f})")
    print(f"   {result.content[:80]}...")

Run it:

python first_rag.py

Option 3: Using the CLI¶

# Initialize database
network-rag db init --db papers.db

# Add documents
network-rag add "Attention Is All You Need - The Transformer architecture uses self-attention" --db papers.db
network-rag add "BERT: Bidirectional transformers for language understanding" --db papers.db
network-rag add "GPT-3: Language Models are Few-Shot Learners" --db papers.db

# Build network
network-rag build --db papers.db

# Search
network-rag search "transformer models" --db papers.db --top-k 3

Understanding the Three Interfaces¶

Complex Network RAG offers three complementary interfaces. Choose based on your use case:

1. REPL - For Exploration¶

Best for: - Learning the system - Prototyping configurations - Exploring existing databases - Interactive data analysis

Strengths: - Immediate feedback - Contextual prompts - Command history - No code required

Example workflow:

> db connect :memory:           # Quick in-memory database
> add "Document 1"
> add "Document 2"
> build
> search "query"
> graph communities             # Explore network structure

2. Python API - For Integration¶

Best for: - Production applications - Complex workflows - Integration with other systems - Programmatic control

Strengths: - Type safety - Method chaining - Rich objects - Batch operations

Example workflow:

# Fluent API with builder pattern
rag = (NetworkRAG.builder()
       .with_storage('data.db')
       .from_config('config.yaml')
       .build())

# Batch add documents
with rag.batch() as batch:
    for doc in documents:
        batch.add(doc['content'], **doc['metadata'])

# Advanced queries
results = (rag.search(query)
           .with_strategy('hybrid')
           .filter(category='ML')
           .top(10))

3. CLI - For Automation¶

Best for: - Scripts and pipelines - Batch processing - CI/CD integration - Command-line workflows

Strengths: - Shell integration - Scriptable - Standard Unix tools - Remote execution

Example workflow:

# Shell script for batch import
#!/bin/bash
network-rag db init --db papers.db
network-rag import data.jsonl --format jsonl --db papers.db
network-rag build --db papers.db
network-rag communities --db papers.db > communities.txt

Working with Structured Documents¶

So far we've used simple text documents. But Complex Network RAG really shines with structured documents where you want different fields to contribute differently to similarity.

Why Structured Similarity?¶

Consider research papers: - Title: Should match very precisely (high weight) - Abstract: Captures main ideas (medium weight, handle long text) - Tags: Exact set overlap (Jaccard similarity, not embedding) - Authors: Exact match or not (boolean)

Creating a YAML Configuration¶

Create config/my_papers.yaml:

document:
  fields:
    # Text fields that will be embedded
    - name: title
      type: text
      embed: true

    - name: abstract
      type: text
      embed: true

    # Metadata fields (no embedding needed)
    - name: tags
      type: list

    - name: authors
      type: list

similarity:
  components:
    # Title embedding (30% weight)
    - type: field_embedding
      field: title
      model: tfidf
      weight: 0.3
      min_similarity: 0.5

    # Abstract embedding with chunking (40% weight)
    - type: field_embedding
      field: abstract
      model: tfidf
      weight: 0.4
      min_similarity: 0.3
      chunking:
        method: sentences
        max_tokens: 512
        overlap: 50

    # Tag similarity using Jaccard (20% weight)
    - type: attribute_similarity
      field: tags
      metric: jaccard
      weight: 0.2
      min_similarity: 0.1

    # Author similarity using Jaccard (10% weight)
    - type: attribute_similarity
      field: authors
      metric: jaccard
      weight: 0.1
      min_similarity: 0.0

  # Overall threshold for creating edges
  min_combined_similarity: 0.4

Using the Configuration¶

With Python:

from src.network_rag import NetworkRAG

# Load configuration
rag = (NetworkRAG.builder()
       .with_storage('papers.db')
       .from_config('config/my_papers.yaml')
       .build())

# Add structured documents
rag.add('paper1', document={
    'title': 'Attention Is All You Need',
    'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...',
    'tags': ['transformers', 'attention', 'seq2seq'],
    'authors': ['Vaswani', 'Shazeer', 'Parmar', 'Uszkoreit']
})

rag.add('paper2', document={
    'title': 'BERT: Pre-training of Deep Bidirectional Transformers',
    'abstract': 'We introduce a new language representation model called BERT...',
    'tags': ['transformers', 'pretraining', 'nlp'],
    'authors': ['Devlin', 'Chang', 'Lee', 'Toutanova']
})

# Build and search
rag.build_network()
results = rag.search('transformer architecture').top(10)

With REPL:

[no db]> config load config/my_papers.yaml
✓ Loaded configuration from config/my_papers.yaml

[no db]> db connect papers.db
✓ Connected to papers.db

papers.db> # Now add documents as before...

With CLI:

network-rag db init --config config/my_papers.yaml --db papers.db
network-rag add --config config/my_papers.yaml --db papers.db # ...

How It Works¶

When you add a document: 1. Field-level embedding: Each embedded field gets its own embedding 2. Chunking: Long fields are automatically split (abstract → sentences) 3. Storage: Embeddings stored hierarchically (node → field → chunks)

When building the network: 1. Component similarity: Each component computes its similarity 2. Weighted combination: Components combined with weights 3. Threshold: Edge created if combined similarity ≥ threshold

Example:

Paper A vs Paper B:
  title similarity:    0.8 × 0.3 = 0.24
  abstract similarity: 0.6 × 0.4 = 0.24
  tag similarity:      0.5 × 0.2 = 0.10
  author similarity:   0.0 × 0.1 = 0.00
  ────────────────────────────────────
  Combined:                       0.58  ✓ (≥ 0.4, create edge)

Network Analysis¶

Once you've built your network, you can analyze its structure:

Detecting Communities¶

REPL:

papers.db [10 docs, 24 edges]> graph communities

Detected 3 communities:

Community 0: 5 nodes (NLP papers)
  - transformer_paper
  - bert_paper
  - gpt3_paper
  - ...

Community 1: 3 nodes (Computer Vision papers)
  - vit_paper
  - resnet_paper
  - ...

Community 2: 2 nodes (Speech Recognition)
  - wav2vec_paper
  - conformer_paper

Python:

# Detect communities
communities = rag.detect_communities()

# Group nodes by community
from collections import defaultdict
comm_groups = defaultdict(list)
for node_id, comm_id in communities.items():
    comm_groups[comm_id].append(node_id)

# Analyze each community
for comm_id, nodes in comm_groups.items():
    print(f"\nCommunity {comm_id}: {len(nodes)} papers")

    # Auto-tag community
    tags = rag.auto_tag_community(comm_id, n_samples=10)
    print(f"  Keywords: {', '.join(tags)}")

Finding Hubs¶

Hubs are highly connected nodes that often represent foundational concepts:

# Find hub nodes (min 10 connections)
hubs = rag.get_hub_nodes(min_degree=10)

for node_id in hubs:
    node = rag.storage.get_node(node_id)
    degree = rag.graph.degree(node_id)
    print(f"{node_id}: {degree} connections")
    print(f"  {node['content_text'][:80]}...")

Finding Bridges¶

Bridges connect different communities and enable knowledge transfer:

# Find bridge nodes (high betweenness centrality)
bridges = rag.get_bridge_nodes(min_betweenness=0.1)

for node_id in bridges:
    # Get communities this bridge connects
    neighbors = rag.get_neighbors(node_id, k_hops=1)
    neighbor_communities = {rag.get_community_for_node(n) for n in neighbors}

    print(f"{node_id} bridges communities: {neighbor_communities}")

Network Statistics¶

# Get graph
graph = rag.graph

# Basic stats
print(f"Nodes: {len(graph.nodes())}")
print(f"Edges: {len(graph.edges())}")

# Density
if len(graph.nodes()) > 1:
    max_edges = len(graph.nodes()) * (len(graph.nodes()) - 1) / 2
    density = len(graph.edges()) / max_edges
    print(f"Density: {density:.3f}")

# Average degree
avg_degree = 2 * len(graph.edges()) / len(graph.nodes())
print(f"Average degree: {avg_degree:.1f}")

Next Steps¶

Tutorials¶

For complete end-to-end examples, see: - Tutorials - Research papers, products, chat logs - examples/ - Working code examples

Deep Dives¶

To understand the system better: - Core Concepts - Structured similarity and network topology explained - YAML DSL Reference - Complete YAML configuration reference - Chunking Guide - Text chunking strategies

API References¶

For detailed API documentation: - API Reference - Fluent API and NetworkRAG class - CLI Reference - All CLI commands - Fluent API Guide - Advanced patterns

Examples in the Repository¶

Check out these working examples:

# Basic usage
python examples/basic_usage.py

# Fluent API
python examples/fluent_api.py

# Structured similarity
python examples/structured_tutorial.py

# API comparison
python examples/api_comparison.py

Configuration Templates¶

The config/ directory has ready-to-use configurations:

papers_minimal.yaml - Simple paper configuration
papers_full.yaml - Complete paper configuration with chunking
products_basic.yaml - E-commerce products
conversations.yaml - Chat messages with role weighting

Common Patterns¶

Batch Import from JSONL¶

import json
from src.network_rag import NetworkRAG

rag = NetworkRAG.builder().from_config('config.yaml').build()

# Batch import
with rag.batch() as batch:
    with open('documents.jsonl') as f:
        for line in f:
            doc = json.loads(line)
            batch.add(doc['content'], id=doc['id'], **doc['metadata'])

Incremental Updates¶

# Load existing database
rag = NetworkRAG.builder().with_storage('existing.db').build()

# Add new documents
new_docs = get_new_documents()
for doc in new_docs:
    rag.add(doc['content'], **doc['metadata'])

# Rebuild network (only recomputes for new docs)
rag.build_network(rebuild=True)

Custom Retrieval¶

# Combine multiple strategies
results = (rag.search(query)
           .with_strategy('hybrid')
           .in_community(0)              # Restrict to community 0
           .expand_neighbors(hops=2)     # Include 2-hop neighbors
           .prioritize_hubs()            # Boost hub nodes
           .filter(year=2024)            # Metadata filter
           .top(20))

Troubleshooting¶

Common Issues¶

"No module named 'src'" - Make sure you're in the project root directory - Or install with pip install -e .

"Database locked" - SQLite doesn't handle concurrent writes well - Use separate processes or implement locking

"Out of memory during build_network()" - Reduce dataset size - Increase min_similarity threshold (fewer edges) - Use chunking to reduce embedding dimensions

"Search returns no results" - Check if network is built: rag.build_network() - Lower min_similarity threshold - Verify documents were added correctly

Getting Help¶

Check issues
Read Core Concepts for deeper understanding
Review examples/ for working code
See Claude Code Guide for development guide

Summary¶

You've learned: - ✅ How to install Complex Network RAG - ✅ The three interfaces (REPL, Python, CLI) - ✅ How to build a simple knowledge graph - ✅ Working with structured documents via YAML - ✅ Network analysis (communities, hubs, bridges) - ✅ Where to find more documentation

Next: Dive into Core Concepts to understand structured similarity and network topology in depth, or explore Tutorials for complete examples.

Welcome to Complex Network RAG! Build smarter knowledge graphs with topology-aware retrieval.