Skip to content

Getting Started with Complex Network RAG

This guide will walk you through installing, configuring, and using Complex Network RAG from scratch. By the end, you'll understand the core concepts and be able to build your own knowledge graphs.

Table of Contents

  1. Installation
  2. Your First Knowledge Graph
  3. Understanding the Three Interfaces
  4. Working with Structured Documents
  5. Network Analysis
  6. Next Steps

Learning Pathway

This guide follows a progressive learning path:

Level 1: Simple Text Documents
┌──────────────────────────────┐
│  Plain text → Embeddings →   │
│  Basic similarity matching   │
└──────────┬───────────────────┘
Level 2: Network Structure
┌──────────────────────────────┐
│  Similarity graph →          │
│  Communities, hubs, bridges  │
└──────────┬───────────────────┘
Level 3: Structured Documents
┌──────────────────────────────┐
│  Field-specific embeddings → │
│  Hybrid similarity (YAML)    │
└──────────┬───────────────────┘
Level 4: Production Use
┌──────────────────────────────┐
│  Choose your interface:      │
│  • REPL (exploration)        │
│  • CLI (automation)          │
│  • API (integration)         │
└──────────────────────────────┘

Time estimates: - Level 1: 15 minutes - Level 2: 20 minutes - Level 3: 30 minutes - Level 4: 20 minutes

Total: ~90 minutes to full proficiency

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Basic Installation

# Clone the repository
git clone https://github.com/yourusername/complex-network-rag.git
cd complex-network-rag

# Install dependencies
pip install -r requirements.txt

# Optional: Install in development mode
pip install -e .

Verify Installation

# Check if CLI works
network-rag version

# Run tests to ensure everything is working
pytest tests/ -v

Your First Knowledge Graph

Let's build a simple knowledge graph of machine learning papers.

The REPL is perfect for learning and experimentation:

# Start the interactive shell
network-rag repl

Once in the REPL:

[no db]> db connect papers.db
✓ Connected to papers.db

papers.db> add "Attention Is All You Need - The Transformer architecture uses self-attention"
✓ Added document: doc-1

papers.db [1 docs]> add "BERT: Bidirectional transformers for language understanding"
✓ Added document: doc-2

papers.db [2 docs]> add "GPT-3: Language Models are Few-Shot Learners"
✓ Added document: doc-3

papers.db [3 docs]> build
ℹ Fitting TF-IDF vectorizer...
ℹ Building network...
✓ Network built: 3 nodes, 2 edges

papers.db [3 docs, 2 edges]> search "transformer models"
Search: transformer models
Found 3 results:

1. doc-1 (score: 0.856)
   Attention Is All You Need - The Transformer architecture uses self-attention

2. doc-2 (score: 0.742)
   BERT: Bidirectional transformers for language understanding

3. doc-3 (score: 0.621)
   GPT-3: Language Models are Few-Shot Learners

What just happened? 1. Created a new database (papers.db) 2. Added three documents about transformers 3. Built a similarity network (found 2 edges connecting similar papers) 4. Searched for "transformer models" and got ranked results

Option 2: Using Python API

Create a file first_rag.py:

from src.network_rag import NetworkRAG

# Create a RAG instance with TF-IDF embeddings
rag = (NetworkRAG.builder()
       .with_storage('papers.db')
       .with_tfidf_embeddings()
       .with_similarity_threshold(min_similarity=0.7)
       .build())

# Add documents
rag.add('transformer', content="Attention Is All You Need - The Transformer architecture uses self-attention")
rag.add('bert', content="BERT: Bidirectional transformers for language understanding")
rag.add('gpt3', content="GPT-3: Language Models are Few-Shot Learners")

# Build the network
print("Building network...")
rag.build_network()

# Search
print("\nSearching for 'transformer models':")
results = rag.search("transformer models").top(3)

for i, result in enumerate(results, 1):
    print(f"\n{i}. {result.id} (score: {result.score:.3f})")
    print(f"   {result.content[:80]}...")

Run it:

python first_rag.py

Option 3: Using the CLI

# Initialize database
network-rag db init --db papers.db

# Add documents
network-rag add "Attention Is All You Need - The Transformer architecture uses self-attention" --db papers.db
network-rag add "BERT: Bidirectional transformers for language understanding" --db papers.db
network-rag add "GPT-3: Language Models are Few-Shot Learners" --db papers.db

# Build network
network-rag build --db papers.db

# Search
network-rag search "transformer models" --db papers.db --top-k 3

Understanding the Three Interfaces

Complex Network RAG offers three complementary interfaces. Choose based on your use case:

1. REPL - For Exploration

Best for: - Learning the system - Prototyping configurations - Exploring existing databases - Interactive data analysis

Strengths: - Immediate feedback - Contextual prompts - Command history - No code required

Example workflow:

> db connect :memory:           # Quick in-memory database
> add "Document 1"
> add "Document 2"
> build
> search "query"
> graph communities             # Explore network structure

2. Python API - For Integration

Best for: - Production applications - Complex workflows - Integration with other systems - Programmatic control

Strengths: - Type safety - Method chaining - Rich objects - Batch operations

Example workflow:

# Fluent API with builder pattern
rag = (NetworkRAG.builder()
       .with_storage('data.db')
       .from_config('config.yaml')
       .build())

# Batch add documents
with rag.batch() as batch:
    for doc in documents:
        batch.add(doc['content'], **doc['metadata'])

# Advanced queries
results = (rag.search(query)
           .with_strategy('hybrid')
           .filter(category='ML')
           .top(10))

3. CLI - For Automation

Best for: - Scripts and pipelines - Batch processing - CI/CD integration - Command-line workflows

Strengths: - Shell integration - Scriptable - Standard Unix tools - Remote execution

Example workflow:

# Shell script for batch import
#!/bin/bash
network-rag db init --db papers.db
network-rag import data.jsonl --format jsonl --db papers.db
network-rag build --db papers.db
network-rag communities --db papers.db > communities.txt

Working with Structured Documents

So far we've used simple text documents. But Complex Network RAG really shines with structured documents where you want different fields to contribute differently to similarity.

Why Structured Similarity?

Consider research papers: - Title: Should match very precisely (high weight) - Abstract: Captures main ideas (medium weight, handle long text) - Tags: Exact set overlap (Jaccard similarity, not embedding) - Authors: Exact match or not (boolean)

Creating a YAML Configuration

Create config/my_papers.yaml:

document:
  fields:
    # Text fields that will be embedded
    - name: title
      type: text
      embed: true

    - name: abstract
      type: text
      embed: true

    # Metadata fields (no embedding needed)
    - name: tags
      type: list

    - name: authors
      type: list

similarity:
  components:
    # Title embedding (30% weight)
    - type: field_embedding
      field: title
      model: tfidf
      weight: 0.3
      min_similarity: 0.5

    # Abstract embedding with chunking (40% weight)
    - type: field_embedding
      field: abstract
      model: tfidf
      weight: 0.4
      min_similarity: 0.3
      chunking:
        method: sentences
        max_tokens: 512
        overlap: 50

    # Tag similarity using Jaccard (20% weight)
    - type: attribute_similarity
      field: tags
      metric: jaccard
      weight: 0.2
      min_similarity: 0.1

    # Author similarity using Jaccard (10% weight)
    - type: attribute_similarity
      field: authors
      metric: jaccard
      weight: 0.1
      min_similarity: 0.0

  # Overall threshold for creating edges
  min_combined_similarity: 0.4

Using the Configuration

With Python:

from src.network_rag import NetworkRAG

# Load configuration
rag = (NetworkRAG.builder()
       .with_storage('papers.db')
       .from_config('config/my_papers.yaml')
       .build())

# Add structured documents
rag.add('paper1', document={
    'title': 'Attention Is All You Need',
    'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...',
    'tags': ['transformers', 'attention', 'seq2seq'],
    'authors': ['Vaswani', 'Shazeer', 'Parmar', 'Uszkoreit']
})

rag.add('paper2', document={
    'title': 'BERT: Pre-training of Deep Bidirectional Transformers',
    'abstract': 'We introduce a new language representation model called BERT...',
    'tags': ['transformers', 'pretraining', 'nlp'],
    'authors': ['Devlin', 'Chang', 'Lee', 'Toutanova']
})

# Build and search
rag.build_network()
results = rag.search('transformer architecture').top(10)

With REPL:

[no db]> config load config/my_papers.yaml
✓ Loaded configuration from config/my_papers.yaml

[no db]> db connect papers.db
✓ Connected to papers.db

papers.db> # Now add documents as before...

With CLI:

network-rag db init --config config/my_papers.yaml --db papers.db
network-rag add --config config/my_papers.yaml --db papers.db # ...

How It Works

When you add a document: 1. Field-level embedding: Each embedded field gets its own embedding 2. Chunking: Long fields are automatically split (abstract → sentences) 3. Storage: Embeddings stored hierarchically (node → field → chunks)

When building the network: 1. Component similarity: Each component computes its similarity 2. Weighted combination: Components combined with weights 3. Threshold: Edge created if combined similarity ≥ threshold

Example:

Paper A vs Paper B:
  title similarity:    0.8 × 0.3 = 0.24
  abstract similarity: 0.6 × 0.4 = 0.24
  tag similarity:      0.5 × 0.2 = 0.10
  author similarity:   0.0 × 0.1 = 0.00
  ────────────────────────────────────
  Combined:                       0.58  ✓ (≥ 0.4, create edge)

Network Analysis

Once you've built your network, you can analyze its structure:

Detecting Communities

REPL:

papers.db [10 docs, 24 edges]> graph communities

Detected 3 communities:

Community 0: 5 nodes (NLP papers)
  - transformer_paper
  - bert_paper
  - gpt3_paper
  - ...

Community 1: 3 nodes (Computer Vision papers)
  - vit_paper
  - resnet_paper
  - ...

Community 2: 2 nodes (Speech Recognition)
  - wav2vec_paper
  - conformer_paper

Python:

# Detect communities
communities = rag.detect_communities()

# Group nodes by community
from collections import defaultdict
comm_groups = defaultdict(list)
for node_id, comm_id in communities.items():
    comm_groups[comm_id].append(node_id)

# Analyze each community
for comm_id, nodes in comm_groups.items():
    print(f"\nCommunity {comm_id}: {len(nodes)} papers")

    # Auto-tag community
    tags = rag.auto_tag_community(comm_id, n_samples=10)
    print(f"  Keywords: {', '.join(tags)}")

Finding Hubs

Hubs are highly connected nodes that often represent foundational concepts:

# Find hub nodes (min 10 connections)
hubs = rag.get_hub_nodes(min_degree=10)

for node_id in hubs:
    node = rag.storage.get_node(node_id)
    degree = rag.graph.degree(node_id)
    print(f"{node_id}: {degree} connections")
    print(f"  {node['content_text'][:80]}...")

Finding Bridges

Bridges connect different communities and enable knowledge transfer:

# Find bridge nodes (high betweenness centrality)
bridges = rag.get_bridge_nodes(min_betweenness=0.1)

for node_id in bridges:
    # Get communities this bridge connects
    neighbors = rag.get_neighbors(node_id, k_hops=1)
    neighbor_communities = {rag.get_community_for_node(n) for n in neighbors}

    print(f"{node_id} bridges communities: {neighbor_communities}")

Network Statistics

# Get graph
graph = rag.graph

# Basic stats
print(f"Nodes: {len(graph.nodes())}")
print(f"Edges: {len(graph.edges())}")

# Density
if len(graph.nodes()) > 1:
    max_edges = len(graph.nodes()) * (len(graph.nodes()) - 1) / 2
    density = len(graph.edges()) / max_edges
    print(f"Density: {density:.3f}")

# Average degree
avg_degree = 2 * len(graph.edges()) / len(graph.nodes())
print(f"Average degree: {avg_degree:.1f}")

Next Steps

Tutorials

For complete end-to-end examples, see: - Tutorials - Research papers, products, chat logs - examples/ - Working code examples

Deep Dives

To understand the system better: - Core Concepts - Structured similarity and network topology explained - YAML DSL Reference - Complete YAML configuration reference - Chunking Guide - Text chunking strategies

API References

For detailed API documentation: - API Reference - Fluent API and NetworkRAG class - CLI Reference - All CLI commands - Fluent API Guide - Advanced patterns

Examples in the Repository

Check out these working examples:

# Basic usage
python examples/basic_usage.py

# Fluent API
python examples/fluent_api.py

# Structured similarity
python examples/structured_tutorial.py

# API comparison
python examples/api_comparison.py

Configuration Templates

The config/ directory has ready-to-use configurations:

  • papers_minimal.yaml - Simple paper configuration
  • papers_full.yaml - Complete paper configuration with chunking
  • products_basic.yaml - E-commerce products
  • conversations.yaml - Chat messages with role weighting

Common Patterns

Batch Import from JSONL

import json
from src.network_rag import NetworkRAG

rag = NetworkRAG.builder().from_config('config.yaml').build()

# Batch import
with rag.batch() as batch:
    with open('documents.jsonl') as f:
        for line in f:
            doc = json.loads(line)
            batch.add(doc['content'], id=doc['id'], **doc['metadata'])

Incremental Updates

# Load existing database
rag = NetworkRAG.builder().with_storage('existing.db').build()

# Add new documents
new_docs = get_new_documents()
for doc in new_docs:
    rag.add(doc['content'], **doc['metadata'])

# Rebuild network (only recomputes for new docs)
rag.build_network(rebuild=True)

Custom Retrieval

# Combine multiple strategies
results = (rag.search(query)
           .with_strategy('hybrid')
           .in_community(0)              # Restrict to community 0
           .expand_neighbors(hops=2)     # Include 2-hop neighbors
           .prioritize_hubs()            # Boost hub nodes
           .filter(year=2024)            # Metadata filter
           .top(20))

Troubleshooting

Common Issues

"No module named 'src'" - Make sure you're in the project root directory - Or install with pip install -e .

"Database locked" - SQLite doesn't handle concurrent writes well - Use separate processes or implement locking

"Out of memory during build_network()" - Reduce dataset size - Increase min_similarity threshold (fewer edges) - Use chunking to reduce embedding dimensions

"Search returns no results" - Check if network is built: rag.build_network() - Lower min_similarity threshold - Verify documents were added correctly

Getting Help

Summary

You've learned: - ✅ How to install Complex Network RAG - ✅ The three interfaces (REPL, Python, CLI) - ✅ How to build a simple knowledge graph - ✅ Working with structured documents via YAML - ✅ Network analysis (communities, hubs, bridges) - ✅ Where to find more documentation

Next: Dive into Core Concepts to understand structured similarity and network topology in depth, or explore Tutorials for complete examples.


Welcome to Complex Network RAG! Build smarter knowledge graphs with topology-aware retrieval.