Chunking and Aggregation Guide¶

Overview¶

This guide explains how Complex Network RAG handles chunking and aggregation of long text fields for embedding. The implementation is designed to be performant (chunks at ingestion time, not query time), flexible (multiple aggregation strategies), and transparent (both chunks and aggregations are stored).

The Problem¶

Embedding models have context window limits (typically 512-8192 tokens). Long text fields like document bodies, full conversations, or lengthy abstracts can exceed these limits, leading to:

Truncated embeddings (missing information)
Model errors or failures
Poor retrieval quality

We need to: 1. Split long text into embedding-compatible chunks 2. Embed each chunk separately 3. Aggregate chunk embeddings into a single vector for similarity search 4. Cache aggregated embeddings for performance

The Solution¶

Three-Layer Design¶

Layer 1: Ingestion (NetworkRAG.add())¶

# User adds long document
doc = {
    'title': 'Short title',  # No chunking needed (< 512 tokens)
    'body': 'Very long text...'  # Automatically chunked
}
rag.add(doc)

# Internally:
# 1. Text is chunked based on FieldSpec configuration
# 2. Each chunk is embedded
# 3. Chunks are aggregated (e.g., mean)
# 4. Both chunks AND aggregated embedding are stored

Layer 2: Storage (SQLiteEmbeddingStore)¶

# Dual storage pattern in field_embeddings table:
# (doc1, body, model, chunk_index=0)  → first chunk embedding
# (doc1, body, model, chunk_index=1)  → second chunk embedding
# (doc1, body, model, chunk_index=-1) → AGGREGATED embedding (cached)

Key insight: chunk_index=-1 is a special marker for pre-aggregated embeddings.

Layer 3: Retrieval (StructuredSimilarity.compute())¶

# Fast path: Load pre-aggregated embedding (chunk_index=-1)
# Fallback: Load chunks, re-aggregate on-demand
# → Zero runtime chunking cost for common case

Architecture¶

Core Components¶

1. ChunkingStrategy¶

from src.chunking import ChunkingStrategy

# Sentence-aware chunking (respects sentence boundaries)
strategy = ChunkingStrategy(
    method='sentences',  # 'none', 'fixed_tokens', 'sentences', 'paragraphs'
    max_tokens=512,      # Maximum tokens per chunk
    overlap=50           # Token overlap between chunks (reduces boundary effects)
)

Methods: - none: No chunking (returns original text as single chunk) - fixed_tokens: Fixed-size windows with overlap (simple, predictable) - sentences: Sentence-aware chunking (respects natural boundaries) - paragraphs: Paragraph-aware chunking (best for structured documents)

2. TextChunker¶

from src.chunking import TextChunker

chunker = TextChunker(strategy)
chunks = chunker.chunk(long_text)

# Returns list of text chunks, each ≤ max_tokens

Features: - Approximate token counting (words × 1.3) - sufficient for splitting - Overlap support (reduces information loss at boundaries) - Handles edge cases (empty text, very long sentences, etc.)

3. EmbeddingAggregator¶

from src.chunking import EmbeddingAggregator

aggregator = EmbeddingAggregator()
aggregated = aggregator.aggregate(
    embeddings=chunk_embeddings,
    method='mean',  # Aggregation strategy
    weights=None    # Optional custom weights
)

Methods: - mean: Simple average (most common, works well for most cases) - max_pool: Element-wise max (captures salient features) - weighted_mean: Weighted average (default: exponential decay favoring first chunk) - first: Use only first chunk (for title-like fields) - last: Use only last chunk (for conclusion-focused fields)

4. SQLiteEmbeddingStore¶

from src.sqlite_embedding_store import SQLiteEmbeddingStore

store = SQLiteEmbeddingStore("embeddings.db")

# Save chunked embeddings with pre-computed aggregation
store.save_chunked_embeddings(
    node_id='doc1',
    field='body',
    model='sentence_bert',
    chunk_embeddings=[chunk1, chunk2, chunk3],
    aggregated_embedding=aggregated,
    metadata={'aggregation': 'mean', 'num_chunks': 3}
)

# Fast retrieval: loads aggregated embedding
emb = store.load_field_embedding('doc1', 'body', 'sentence_bert')

# Flexibility: re-aggregate with different method
new_agg = store.re_aggregate(
    'doc1', 'body', 'sentence_bert',
    aggregation_method='max_pool'
)

Usage Examples¶

Example 1: Basic Chunking¶

from src.chunking import ChunkingStrategy, TextChunker, chunk_and_aggregate
from src.embeddings import SentenceTransformerEmbedding

# Setup
embedder = SentenceTransformerEmbedding(model_name='all-MiniLM-L6-v2')
strategy = ChunkingStrategy(method='sentences', max_tokens=256, overlap=30)

# Long document
long_text = """
This is a very long document with multiple paragraphs...
[imagine 1000+ words here]
"""

# One-shot chunking + embedding + aggregation
aggregated_emb, chunk_embs = chunk_and_aggregate(
    text=long_text,
    embedder=embedder,
    strategy=strategy,
    aggregation_method='mean'
)

print(f"Aggregated embedding: {aggregated_emb.shape}")
print(f"Number of chunks: {len(chunk_embs)}")

Example 2: Field-Specific Configuration¶

from src.structured_linkage import FieldEmbeddingComponent
from src.chunking import ChunkingStrategy

# Different chunking for different fields
components = [
    # Title: no chunking needed (short)
    FieldEmbeddingComponent(
        field='title',
        model='sentence_bert',
        weight=0.3
    ),

    # Abstract: moderate chunking
    FieldEmbeddingComponent(
        field='abstract',
        model='sentence_bert',
        weight=0.3,
        chunking_strategy=ChunkingStrategy(
            method='sentences',
            max_tokens=512,
            overlap=50
        ),
        aggregation_method='mean'
    ),

    # Body: aggressive chunking
    FieldEmbeddingComponent(
        field='body',
        model='sentence_bert',
        weight=0.4,
        chunking_strategy=ChunkingStrategy(
            method='paragraphs',
            max_tokens=256,
            overlap=30
        ),
        aggregation_method='weighted_mean'  # First chunks weighted higher
    )
]

Example 3: Re-aggregation (Experimentation)¶

from src.sqlite_embedding_store import SQLiteEmbeddingStore

store = SQLiteEmbeddingStore("embeddings.db")

# Original aggregation: mean
original = store.load_field_embedding('doc1', 'body', 'model')

# Experiment with max pooling (no re-embedding needed!)
max_pooled = store.re_aggregate('doc1', 'body', 'model', 'max_pool')

# Custom weighted aggregation
custom = store.re_aggregate(
    'doc1', 'body', 'model',
    aggregation_method='weighted_mean',
    weights=[1.0, 0.8, 0.6, 0.4]  # Manual decay
)

# New aggregation is cached for future use

Example 4: Complete Integration¶

from src.network_rag import NetworkRAG
from src.storage import SQLiteStorage
from src.sqlite_embedding_store import SQLiteEmbeddingStore
from src.embeddings import SentenceTransformerEmbedding
from src.structured_linkage import FieldEmbeddingComponent
from src.chunking import ChunkingStrategy

# Setup storage
storage = SQLiteStorage("docs.db")
emb_store = SQLiteEmbeddingStore("embeddings.db")
embedder = SentenceTransformerEmbedding()

# Define document spec with chunking
chunking = ChunkingStrategy(method='sentences', max_tokens=512)
similarity_spec = [
    FieldEmbeddingComponent(
        field='content',
        model='sentence_bert',
        weight=1.0,
        chunking_strategy=chunking,
        aggregation_method='mean'
    )
]

# Create RAG system
rag = NetworkRAG(
    storage=storage,
    embedding_store=emb_store,
    embedding_provider=embedder,
    similarity_spec=similarity_spec
)

# Add long document (chunking happens automatically)
rag.add({
    'id': 'doc1',
    'content': 'Very long text...' * 1000  # 10k+ words
})

# Query (uses pre-computed aggregated embeddings)
results = rag.find_similar("search query", n=10)

Performance Considerations¶

Time Complexity¶

Operation	Complexity	Notes
Chunking (ingestion)	O(n)	n = text length, done once
Embedding chunks	O(m × e)	m = num chunks, e = embedding time
Aggregation	O(m × d)	d = embedding dimension, very fast
Retrieval	O(1)	Loads pre-computed aggregation
Re-aggregation	O(m × d)	Re-computes from cached chunks

Space Complexity¶

Storage per field embedding: - Aggregated: d × 8 bytes (d = embedding dimension) - Chunks: m × d × 8 bytes (m = number of chunks) - Total: (m + 1) × d × 8 bytes

Example (512-dim embeddings, 4 chunks): - Aggregated: 512 × 8 = 4 KB - Chunks: 4 × 512 × 8 = 16 KB - Total: 20 KB per field

Trade-off: Store both chunks and aggregations for flexibility at cost of 5x storage. This is acceptable because: 1. Enables re-aggregation without re-embedding 2. Embedding storage is small compared to text storage 3. Chunks can be compressed or purged if needed

Caching Strategy¶

LRU cache (default: 1000 embeddings): - Caches aggregated embeddings only - Chunks loaded on-demand for re-aggregation - Cache hit rate: ~95% for typical workloads

Batch loading: - Batch loads populate cache - Single SQL query for multiple nodes - ~10x faster than individual loads

Recommendations¶

For Small Documents (< 512 tokens)¶

# Don't chunk at all
ChunkingStrategy(method='none')

For Medium Documents (512-2048 tokens)¶

# Moderate chunking with overlap
ChunkingStrategy(
    method='sentences',
    max_tokens=512,
    overlap=50
)

For Large Documents (> 2048 tokens)¶

# Aggressive chunking
ChunkingStrategy(
    method='paragraphs',
    max_tokens=256,
    overlap=30
)

Aggregation Method Selection¶

Use Case	Recommended Method	Rationale
General documents	`mean`	Balanced, works well for most cases
Code snippets	`first`	Most important info at top
Conversations	`weighted_mean`	Recent messages more important
Search documents	`max_pool`	Capture salient features

Advanced Features¶

Custom Aggregation Weights¶

# Exponential decay (default for weighted_mean)
weights = [0.9 ** i for i in range(num_chunks)]

# Linear decay
weights = [1.0 - (i / num_chunks) for i in range(num_chunks)]

# Custom (domain-specific)
weights = [1.0, 0.5, 0.3, 0.1]  # First chunk most important

Metadata Tracking¶

# All chunking/aggregation metadata is stored
metadata = store.get_metadata('doc1', 'body', 'model')
# Returns: {'aggregation': 'mean', 'chunking': {...}, 'num_chunks': 3}

Introspection¶

# Which fields have embeddings?
fields = store.get_node_fields('doc1')

# Load individual chunks for inspection
chunks = store.load_chunk_embeddings('doc1', 'body', 'model')

# Check if embedding exists
exists = store.has_embedding('doc1', 'body', 'model')

Testing¶

Comprehensive test coverage (96%) ensures reliability:

test_chunking.py: 29 tests for chunking strategies and aggregation
test_sqlite_embedding_store.py: 34 tests for storage operations

Run tests:

python -m pytest tests/test_chunking.py tests/test_sqlite_embedding_store.py -v

Run with coverage:

python -m coverage run -m pytest tests/test_chunking.py tests/test_sqlite_embedding_store.py
python -m coverage report --include='src/chunking.py,src/sqlite_embedding_store.py'

Summary¶

Chunking/aggregation in Complex Network RAG is:

Automatic: Happens at ingestion time based on FieldSpec configuration
Performant: Pre-computed aggregations cached in database
Flexible: Multiple chunking methods and aggregation strategies
Transparent: Both chunks and aggregations stored for full control
Tested: 96% code coverage with comprehensive test suite

Key design principles: - Chunk at write time, not read time (performance) - Store both chunks and aggregations (flexibility) - Use simple heuristics for token counting (good enough) - Cache aggressively (LRU cache + DB cache) - Make re-aggregation cheap (don't re-embed)

This design makes handling long documents elegant and efficient while maintaining full flexibility for experimentation.