Getting Started with Complex Network RAG¶
This guide will walk you through installing, configuring, and using Complex Network RAG from scratch. By the end, you'll understand the core concepts and be able to build your own knowledge graphs.
Table of Contents¶
- Installation
- Your First Knowledge Graph
- Understanding the Three Interfaces
- Working with Structured Documents
- Network Analysis
- Next Steps
Learning Pathway¶
This guide follows a progressive learning path:
Level 1: Simple Text Documents
┌──────────────────────────────┐
│ Plain text → Embeddings → │
│ Basic similarity matching │
└──────────┬───────────────────┘
│
▼
Level 2: Network Structure
┌──────────────────────────────┐
│ Similarity graph → │
│ Communities, hubs, bridges │
└──────────┬───────────────────┘
│
▼
Level 3: Structured Documents
┌──────────────────────────────┐
│ Field-specific embeddings → │
│ Hybrid similarity (YAML) │
└──────────┬───────────────────┘
│
▼
Level 4: Production Use
┌──────────────────────────────┐
│ Choose your interface: │
│ • REPL (exploration) │
│ • CLI (automation) │
│ • API (integration) │
└──────────────────────────────┘
Time estimates: - Level 1: 15 minutes - Level 2: 20 minutes - Level 3: 30 minutes - Level 4: 20 minutes
Total: ~90 minutes to full proficiency
Installation¶
Prerequisites¶
- Python 3.8 or higher
- pip package manager
Basic Installation¶
# Clone the repository
git clone https://github.com/yourusername/complex-network-rag.git
cd complex-network-rag
# Install dependencies
pip install -r requirements.txt
# Optional: Install in development mode
pip install -e .
Verify Installation¶
# Check if CLI works
network-rag version
# Run tests to ensure everything is working
pytest tests/ -v
Your First Knowledge Graph¶
Let's build a simple knowledge graph of machine learning papers.
Option 1: Using the REPL (Recommended for Beginners)¶
The REPL is perfect for learning and experimentation:
Once in the REPL:
[no db]> db connect papers.db
✓ Connected to papers.db
papers.db> add "Attention Is All You Need - The Transformer architecture uses self-attention"
✓ Added document: doc-1
papers.db [1 docs]> add "BERT: Bidirectional transformers for language understanding"
✓ Added document: doc-2
papers.db [2 docs]> add "GPT-3: Language Models are Few-Shot Learners"
✓ Added document: doc-3
papers.db [3 docs]> build
ℹ Fitting TF-IDF vectorizer...
ℹ Building network...
✓ Network built: 3 nodes, 2 edges
papers.db [3 docs, 2 edges]> search "transformer models"
Search: transformer models
Found 3 results:
1. doc-1 (score: 0.856)
Attention Is All You Need - The Transformer architecture uses self-attention
2. doc-2 (score: 0.742)
BERT: Bidirectional transformers for language understanding
3. doc-3 (score: 0.621)
GPT-3: Language Models are Few-Shot Learners
What just happened?
1. Created a new database (papers.db)
2. Added three documents about transformers
3. Built a similarity network (found 2 edges connecting similar papers)
4. Searched for "transformer models" and got ranked results
Option 2: Using Python API¶
Create a file first_rag.py:
from src.network_rag import NetworkRAG
# Create a RAG instance with TF-IDF embeddings
rag = (NetworkRAG.builder()
.with_storage('papers.db')
.with_tfidf_embeddings()
.with_similarity_threshold(min_similarity=0.7)
.build())
# Add documents
rag.add('transformer', content="Attention Is All You Need - The Transformer architecture uses self-attention")
rag.add('bert', content="BERT: Bidirectional transformers for language understanding")
rag.add('gpt3', content="GPT-3: Language Models are Few-Shot Learners")
# Build the network
print("Building network...")
rag.build_network()
# Search
print("\nSearching for 'transformer models':")
results = rag.search("transformer models").top(3)
for i, result in enumerate(results, 1):
print(f"\n{i}. {result.id} (score: {result.score:.3f})")
print(f" {result.content[:80]}...")
Run it:
Option 3: Using the CLI¶
# Initialize database
network-rag db init --db papers.db
# Add documents
network-rag add "Attention Is All You Need - The Transformer architecture uses self-attention" --db papers.db
network-rag add "BERT: Bidirectional transformers for language understanding" --db papers.db
network-rag add "GPT-3: Language Models are Few-Shot Learners" --db papers.db
# Build network
network-rag build --db papers.db
# Search
network-rag search "transformer models" --db papers.db --top-k 3
Understanding the Three Interfaces¶
Complex Network RAG offers three complementary interfaces. Choose based on your use case:
1. REPL - For Exploration¶
Best for: - Learning the system - Prototyping configurations - Exploring existing databases - Interactive data analysis
Strengths: - Immediate feedback - Contextual prompts - Command history - No code required
Example workflow:
> db connect :memory: # Quick in-memory database
> add "Document 1"
> add "Document 2"
> build
> search "query"
> graph communities # Explore network structure
2. Python API - For Integration¶
Best for: - Production applications - Complex workflows - Integration with other systems - Programmatic control
Strengths: - Type safety - Method chaining - Rich objects - Batch operations
Example workflow:
# Fluent API with builder pattern
rag = (NetworkRAG.builder()
.with_storage('data.db')
.from_config('config.yaml')
.build())
# Batch add documents
with rag.batch() as batch:
for doc in documents:
batch.add(doc['content'], **doc['metadata'])
# Advanced queries
results = (rag.search(query)
.with_strategy('hybrid')
.filter(category='ML')
.top(10))
3. CLI - For Automation¶
Best for: - Scripts and pipelines - Batch processing - CI/CD integration - Command-line workflows
Strengths: - Shell integration - Scriptable - Standard Unix tools - Remote execution
Example workflow:
# Shell script for batch import
#!/bin/bash
network-rag db init --db papers.db
network-rag import data.jsonl --format jsonl --db papers.db
network-rag build --db papers.db
network-rag communities --db papers.db > communities.txt
Working with Structured Documents¶
So far we've used simple text documents. But Complex Network RAG really shines with structured documents where you want different fields to contribute differently to similarity.
Why Structured Similarity?¶
Consider research papers: - Title: Should match very precisely (high weight) - Abstract: Captures main ideas (medium weight, handle long text) - Tags: Exact set overlap (Jaccard similarity, not embedding) - Authors: Exact match or not (boolean)
Creating a YAML Configuration¶
Create config/my_papers.yaml:
document:
fields:
# Text fields that will be embedded
- name: title
type: text
embed: true
- name: abstract
type: text
embed: true
# Metadata fields (no embedding needed)
- name: tags
type: list
- name: authors
type: list
similarity:
components:
# Title embedding (30% weight)
- type: field_embedding
field: title
model: tfidf
weight: 0.3
min_similarity: 0.5
# Abstract embedding with chunking (40% weight)
- type: field_embedding
field: abstract
model: tfidf
weight: 0.4
min_similarity: 0.3
chunking:
method: sentences
max_tokens: 512
overlap: 50
# Tag similarity using Jaccard (20% weight)
- type: attribute_similarity
field: tags
metric: jaccard
weight: 0.2
min_similarity: 0.1
# Author similarity using Jaccard (10% weight)
- type: attribute_similarity
field: authors
metric: jaccard
weight: 0.1
min_similarity: 0.0
# Overall threshold for creating edges
min_combined_similarity: 0.4
Using the Configuration¶
With Python:
from src.network_rag import NetworkRAG
# Load configuration
rag = (NetworkRAG.builder()
.with_storage('papers.db')
.from_config('config/my_papers.yaml')
.build())
# Add structured documents
rag.add('paper1', document={
'title': 'Attention Is All You Need',
'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...',
'tags': ['transformers', 'attention', 'seq2seq'],
'authors': ['Vaswani', 'Shazeer', 'Parmar', 'Uszkoreit']
})
rag.add('paper2', document={
'title': 'BERT: Pre-training of Deep Bidirectional Transformers',
'abstract': 'We introduce a new language representation model called BERT...',
'tags': ['transformers', 'pretraining', 'nlp'],
'authors': ['Devlin', 'Chang', 'Lee', 'Toutanova']
})
# Build and search
rag.build_network()
results = rag.search('transformer architecture').top(10)
With REPL:
[no db]> config load config/my_papers.yaml
✓ Loaded configuration from config/my_papers.yaml
[no db]> db connect papers.db
✓ Connected to papers.db
papers.db> # Now add documents as before...
With CLI:
network-rag db init --config config/my_papers.yaml --db papers.db
network-rag add --config config/my_papers.yaml --db papers.db # ...
How It Works¶
When you add a document: 1. Field-level embedding: Each embedded field gets its own embedding 2. Chunking: Long fields are automatically split (abstract → sentences) 3. Storage: Embeddings stored hierarchically (node → field → chunks)
When building the network: 1. Component similarity: Each component computes its similarity 2. Weighted combination: Components combined with weights 3. Threshold: Edge created if combined similarity ≥ threshold
Example:
Paper A vs Paper B:
title similarity: 0.8 × 0.3 = 0.24
abstract similarity: 0.6 × 0.4 = 0.24
tag similarity: 0.5 × 0.2 = 0.10
author similarity: 0.0 × 0.1 = 0.00
────────────────────────────────────
Combined: 0.58 ✓ (≥ 0.4, create edge)
Network Analysis¶
Once you've built your network, you can analyze its structure:
Detecting Communities¶
REPL:
papers.db [10 docs, 24 edges]> graph communities
Detected 3 communities:
Community 0: 5 nodes (NLP papers)
- transformer_paper
- bert_paper
- gpt3_paper
- ...
Community 1: 3 nodes (Computer Vision papers)
- vit_paper
- resnet_paper
- ...
Community 2: 2 nodes (Speech Recognition)
- wav2vec_paper
- conformer_paper
Python:
# Detect communities
communities = rag.detect_communities()
# Group nodes by community
from collections import defaultdict
comm_groups = defaultdict(list)
for node_id, comm_id in communities.items():
comm_groups[comm_id].append(node_id)
# Analyze each community
for comm_id, nodes in comm_groups.items():
print(f"\nCommunity {comm_id}: {len(nodes)} papers")
# Auto-tag community
tags = rag.auto_tag_community(comm_id, n_samples=10)
print(f" Keywords: {', '.join(tags)}")
Finding Hubs¶
Hubs are highly connected nodes that often represent foundational concepts:
# Find hub nodes (min 10 connections)
hubs = rag.get_hub_nodes(min_degree=10)
for node_id in hubs:
node = rag.storage.get_node(node_id)
degree = rag.graph.degree(node_id)
print(f"{node_id}: {degree} connections")
print(f" {node['content_text'][:80]}...")
Finding Bridges¶
Bridges connect different communities and enable knowledge transfer:
# Find bridge nodes (high betweenness centrality)
bridges = rag.get_bridge_nodes(min_betweenness=0.1)
for node_id in bridges:
# Get communities this bridge connects
neighbors = rag.get_neighbors(node_id, k_hops=1)
neighbor_communities = {rag.get_community_for_node(n) for n in neighbors}
print(f"{node_id} bridges communities: {neighbor_communities}")
Network Statistics¶
# Get graph
graph = rag.graph
# Basic stats
print(f"Nodes: {len(graph.nodes())}")
print(f"Edges: {len(graph.edges())}")
# Density
if len(graph.nodes()) > 1:
max_edges = len(graph.nodes()) * (len(graph.nodes()) - 1) / 2
density = len(graph.edges()) / max_edges
print(f"Density: {density:.3f}")
# Average degree
avg_degree = 2 * len(graph.edges()) / len(graph.nodes())
print(f"Average degree: {avg_degree:.1f}")
Next Steps¶
Tutorials¶
For complete end-to-end examples, see: - Tutorials - Research papers, products, chat logs - examples/ - Working code examples
Deep Dives¶
To understand the system better: - Core Concepts - Structured similarity and network topology explained - YAML DSL Reference - Complete YAML configuration reference - Chunking Guide - Text chunking strategies
API References¶
For detailed API documentation: - API Reference - Fluent API and NetworkRAG class - CLI Reference - All CLI commands - Fluent API Guide - Advanced patterns
Examples in the Repository¶
Check out these working examples:
# Basic usage
python examples/basic_usage.py
# Fluent API
python examples/fluent_api.py
# Structured similarity
python examples/structured_tutorial.py
# API comparison
python examples/api_comparison.py
Configuration Templates¶
The config/ directory has ready-to-use configurations:
papers_minimal.yaml- Simple paper configurationpapers_full.yaml- Complete paper configuration with chunkingproducts_basic.yaml- E-commerce productsconversations.yaml- Chat messages with role weighting
Common Patterns¶
Batch Import from JSONL¶
import json
from src.network_rag import NetworkRAG
rag = NetworkRAG.builder().from_config('config.yaml').build()
# Batch import
with rag.batch() as batch:
with open('documents.jsonl') as f:
for line in f:
doc = json.loads(line)
batch.add(doc['content'], id=doc['id'], **doc['metadata'])
Incremental Updates¶
# Load existing database
rag = NetworkRAG.builder().with_storage('existing.db').build()
# Add new documents
new_docs = get_new_documents()
for doc in new_docs:
rag.add(doc['content'], **doc['metadata'])
# Rebuild network (only recomputes for new docs)
rag.build_network(rebuild=True)
Custom Retrieval¶
# Combine multiple strategies
results = (rag.search(query)
.with_strategy('hybrid')
.in_community(0) # Restrict to community 0
.expand_neighbors(hops=2) # Include 2-hop neighbors
.prioritize_hubs() # Boost hub nodes
.filter(year=2024) # Metadata filter
.top(20))
Troubleshooting¶
Common Issues¶
"No module named 'src'"
- Make sure you're in the project root directory
- Or install with pip install -e .
"Database locked" - SQLite doesn't handle concurrent writes well - Use separate processes or implement locking
"Out of memory during build_network()" - Reduce dataset size - Increase min_similarity threshold (fewer edges) - Use chunking to reduce embedding dimensions
"Search returns no results"
- Check if network is built: rag.build_network()
- Lower min_similarity threshold
- Verify documents were added correctly
Getting Help¶
- Check issues
- Read Core Concepts for deeper understanding
- Review examples/ for working code
- See Claude Code Guide for development guide
Summary¶
You've learned: - ✅ How to install Complex Network RAG - ✅ The three interfaces (REPL, Python, CLI) - ✅ How to build a simple knowledge graph - ✅ Working with structured documents via YAML - ✅ Network analysis (communities, hubs, bridges) - ✅ Where to find more documentation
Next: Dive into Core Concepts to understand structured similarity and network topology in depth, or explore Tutorials for complete examples.
Welcome to Complex Network RAG! Build smarter knowledge graphs with topology-aware retrieval.