Chunking and Aggregation Guide¶
Overview¶
This guide explains how Complex Network RAG handles chunking and aggregation of long text fields for embedding. The implementation is designed to be performant (chunks at ingestion time, not query time), flexible (multiple aggregation strategies), and transparent (both chunks and aggregations are stored).
Table of Contents¶
The Problem¶
Embedding models have context window limits (typically 512-8192 tokens). Long text fields like document bodies, full conversations, or lengthy abstracts can exceed these limits, leading to:
- Truncated embeddings (missing information)
- Model errors or failures
- Poor retrieval quality
We need to: 1. Split long text into embedding-compatible chunks 2. Embed each chunk separately 3. Aggregate chunk embeddings into a single vector for similarity search 4. Cache aggregated embeddings for performance
The Solution¶
Three-Layer Design¶
Layer 1: Ingestion (NetworkRAG.add())¶
# User adds long document
doc = {
'title': 'Short title', # No chunking needed (< 512 tokens)
'body': 'Very long text...' # Automatically chunked
}
rag.add(doc)
# Internally:
# 1. Text is chunked based on FieldSpec configuration
# 2. Each chunk is embedded
# 3. Chunks are aggregated (e.g., mean)
# 4. Both chunks AND aggregated embedding are stored
Layer 2: Storage (SQLiteEmbeddingStore)¶
# Dual storage pattern in field_embeddings table:
# (doc1, body, model, chunk_index=0) → first chunk embedding
# (doc1, body, model, chunk_index=1) → second chunk embedding
# (doc1, body, model, chunk_index=-1) → AGGREGATED embedding (cached)
Key insight: chunk_index=-1 is a special marker for pre-aggregated embeddings.
Layer 3: Retrieval (StructuredSimilarity.compute())¶
# Fast path: Load pre-aggregated embedding (chunk_index=-1)
# Fallback: Load chunks, re-aggregate on-demand
# → Zero runtime chunking cost for common case
Architecture¶
Core Components¶
1. ChunkingStrategy¶
from src.chunking import ChunkingStrategy
# Sentence-aware chunking (respects sentence boundaries)
strategy = ChunkingStrategy(
method='sentences', # 'none', 'fixed_tokens', 'sentences', 'paragraphs'
max_tokens=512, # Maximum tokens per chunk
overlap=50 # Token overlap between chunks (reduces boundary effects)
)
Methods:
- none: No chunking (returns original text as single chunk)
- fixed_tokens: Fixed-size windows with overlap (simple, predictable)
- sentences: Sentence-aware chunking (respects natural boundaries)
- paragraphs: Paragraph-aware chunking (best for structured documents)
2. TextChunker¶
from src.chunking import TextChunker
chunker = TextChunker(strategy)
chunks = chunker.chunk(long_text)
# Returns list of text chunks, each ≤ max_tokens
Features: - Approximate token counting (words × 1.3) - sufficient for splitting - Overlap support (reduces information loss at boundaries) - Handles edge cases (empty text, very long sentences, etc.)
3. EmbeddingAggregator¶
from src.chunking import EmbeddingAggregator
aggregator = EmbeddingAggregator()
aggregated = aggregator.aggregate(
embeddings=chunk_embeddings,
method='mean', # Aggregation strategy
weights=None # Optional custom weights
)
Methods:
- mean: Simple average (most common, works well for most cases)
- max_pool: Element-wise max (captures salient features)
- weighted_mean: Weighted average (default: exponential decay favoring first chunk)
- first: Use only first chunk (for title-like fields)
- last: Use only last chunk (for conclusion-focused fields)
4. SQLiteEmbeddingStore¶
from src.sqlite_embedding_store import SQLiteEmbeddingStore
store = SQLiteEmbeddingStore("embeddings.db")
# Save chunked embeddings with pre-computed aggregation
store.save_chunked_embeddings(
node_id='doc1',
field='body',
model='sentence_bert',
chunk_embeddings=[chunk1, chunk2, chunk3],
aggregated_embedding=aggregated,
metadata={'aggregation': 'mean', 'num_chunks': 3}
)
# Fast retrieval: loads aggregated embedding
emb = store.load_field_embedding('doc1', 'body', 'sentence_bert')
# Flexibility: re-aggregate with different method
new_agg = store.re_aggregate(
'doc1', 'body', 'sentence_bert',
aggregation_method='max_pool'
)
Usage Examples¶
Example 1: Basic Chunking¶
from src.chunking import ChunkingStrategy, TextChunker, chunk_and_aggregate
from src.embeddings import SentenceTransformerEmbedding
# Setup
embedder = SentenceTransformerEmbedding(model_name='all-MiniLM-L6-v2')
strategy = ChunkingStrategy(method='sentences', max_tokens=256, overlap=30)
# Long document
long_text = """
This is a very long document with multiple paragraphs...
[imagine 1000+ words here]
"""
# One-shot chunking + embedding + aggregation
aggregated_emb, chunk_embs = chunk_and_aggregate(
text=long_text,
embedder=embedder,
strategy=strategy,
aggregation_method='mean'
)
print(f"Aggregated embedding: {aggregated_emb.shape}")
print(f"Number of chunks: {len(chunk_embs)}")
Example 2: Field-Specific Configuration¶
from src.structured_linkage import FieldEmbeddingComponent
from src.chunking import ChunkingStrategy
# Different chunking for different fields
components = [
# Title: no chunking needed (short)
FieldEmbeddingComponent(
field='title',
model='sentence_bert',
weight=0.3
),
# Abstract: moderate chunking
FieldEmbeddingComponent(
field='abstract',
model='sentence_bert',
weight=0.3,
chunking_strategy=ChunkingStrategy(
method='sentences',
max_tokens=512,
overlap=50
),
aggregation_method='mean'
),
# Body: aggressive chunking
FieldEmbeddingComponent(
field='body',
model='sentence_bert',
weight=0.4,
chunking_strategy=ChunkingStrategy(
method='paragraphs',
max_tokens=256,
overlap=30
),
aggregation_method='weighted_mean' # First chunks weighted higher
)
]
Example 3: Re-aggregation (Experimentation)¶
from src.sqlite_embedding_store import SQLiteEmbeddingStore
store = SQLiteEmbeddingStore("embeddings.db")
# Original aggregation: mean
original = store.load_field_embedding('doc1', 'body', 'model')
# Experiment with max pooling (no re-embedding needed!)
max_pooled = store.re_aggregate('doc1', 'body', 'model', 'max_pool')
# Custom weighted aggregation
custom = store.re_aggregate(
'doc1', 'body', 'model',
aggregation_method='weighted_mean',
weights=[1.0, 0.8, 0.6, 0.4] # Manual decay
)
# New aggregation is cached for future use
Example 4: Complete Integration¶
from src.network_rag import NetworkRAG
from src.storage import SQLiteStorage
from src.sqlite_embedding_store import SQLiteEmbeddingStore
from src.embeddings import SentenceTransformerEmbedding
from src.structured_linkage import FieldEmbeddingComponent
from src.chunking import ChunkingStrategy
# Setup storage
storage = SQLiteStorage("docs.db")
emb_store = SQLiteEmbeddingStore("embeddings.db")
embedder = SentenceTransformerEmbedding()
# Define document spec with chunking
chunking = ChunkingStrategy(method='sentences', max_tokens=512)
similarity_spec = [
FieldEmbeddingComponent(
field='content',
model='sentence_bert',
weight=1.0,
chunking_strategy=chunking,
aggregation_method='mean'
)
]
# Create RAG system
rag = NetworkRAG(
storage=storage,
embedding_store=emb_store,
embedding_provider=embedder,
similarity_spec=similarity_spec
)
# Add long document (chunking happens automatically)
rag.add({
'id': 'doc1',
'content': 'Very long text...' * 1000 # 10k+ words
})
# Query (uses pre-computed aggregated embeddings)
results = rag.find_similar("search query", n=10)
Performance Considerations¶
Time Complexity¶
| Operation | Complexity | Notes |
|---|---|---|
| Chunking (ingestion) | O(n) | n = text length, done once |
| Embedding chunks | O(m × e) | m = num chunks, e = embedding time |
| Aggregation | O(m × d) | d = embedding dimension, very fast |
| Retrieval | O(1) | Loads pre-computed aggregation |
| Re-aggregation | O(m × d) | Re-computes from cached chunks |
Space Complexity¶
Storage per field embedding:
- Aggregated: d × 8 bytes (d = embedding dimension)
- Chunks: m × d × 8 bytes (m = number of chunks)
- Total: (m + 1) × d × 8 bytes
Example (512-dim embeddings, 4 chunks): - Aggregated: 512 × 8 = 4 KB - Chunks: 4 × 512 × 8 = 16 KB - Total: 20 KB per field
Trade-off: Store both chunks and aggregations for flexibility at cost of 5x storage. This is acceptable because: 1. Enables re-aggregation without re-embedding 2. Embedding storage is small compared to text storage 3. Chunks can be compressed or purged if needed
Caching Strategy¶
LRU cache (default: 1000 embeddings): - Caches aggregated embeddings only - Chunks loaded on-demand for re-aggregation - Cache hit rate: ~95% for typical workloads
Batch loading: - Batch loads populate cache - Single SQL query for multiple nodes - ~10x faster than individual loads
Recommendations¶
For Small Documents (< 512 tokens)¶
For Medium Documents (512-2048 tokens)¶
For Large Documents (> 2048 tokens)¶
Aggregation Method Selection¶
| Use Case | Recommended Method | Rationale |
|---|---|---|
| General documents | mean |
Balanced, works well for most cases |
| Code snippets | first |
Most important info at top |
| Conversations | weighted_mean |
Recent messages more important |
| Search documents | max_pool |
Capture salient features |
Advanced Features¶
Custom Aggregation Weights¶
# Exponential decay (default for weighted_mean)
weights = [0.9 ** i for i in range(num_chunks)]
# Linear decay
weights = [1.0 - (i / num_chunks) for i in range(num_chunks)]
# Custom (domain-specific)
weights = [1.0, 0.5, 0.3, 0.1] # First chunk most important
Metadata Tracking¶
# All chunking/aggregation metadata is stored
metadata = store.get_metadata('doc1', 'body', 'model')
# Returns: {'aggregation': 'mean', 'chunking': {...}, 'num_chunks': 3}
Introspection¶
# Which fields have embeddings?
fields = store.get_node_fields('doc1')
# Load individual chunks for inspection
chunks = store.load_chunk_embeddings('doc1', 'body', 'model')
# Check if embedding exists
exists = store.has_embedding('doc1', 'body', 'model')
Testing¶
Comprehensive test coverage (96%) ensures reliability:
- test_chunking.py: 29 tests for chunking strategies and aggregation
- test_sqlite_embedding_store.py: 34 tests for storage operations
Run tests:
Run with coverage:
python -m coverage run -m pytest tests/test_chunking.py tests/test_sqlite_embedding_store.py
python -m coverage report --include='src/chunking.py,src/sqlite_embedding_store.py'
Summary¶
Chunking/aggregation in Complex Network RAG is:
- Automatic: Happens at ingestion time based on FieldSpec configuration
- Performant: Pre-computed aggregations cached in database
- Flexible: Multiple chunking methods and aggregation strategies
- Transparent: Both chunks and aggregations stored for full control
- Tested: 96% code coverage with comprehensive test suite
Key design principles: - Chunk at write time, not read time (performance) - Store both chunks and aggregations (flexibility) - Use simple heuristics for token counting (good enough) - Cache aggressively (LRU cache + DB cache) - Make re-aggregation cheap (don't re-embed)
This design makes handling long documents elegant and efficient while maintaining full flexibility for experimentation.