YAML DSL Reference¶
Complete specification for the Complex Network RAG configuration language.
Table of Contents¶
- Overview
- Design Principles
- Schema Section
- Embeddings Section
- Similarities Section
- Network Section
- Complete Examples
- Best Practices
Overview¶
The DSL has four orthogonal sections, each handling one concern:
schema: # Data shape and extraction
embeddings: # Field → vector mappings
similarities: # Comparison functions
network: # Edge creation rules
Minimal Example¶
schema:
content: text
embeddings:
content_vec:
field: content
model: tfidf
similarities:
default:
embedding: content_vec
network:
edges:
similarity: default
min: 0.3
Design Principles¶
The DSL follows SICP's three key principles:
1. Primitives¶
Minimal, orthogonal building blocks:
| Primitive | Section | Purpose |
|---|---|---|
| Field | schema |
Data shape and extraction |
| Embedding | embeddings |
Field → vector mapping |
| Similarity | similarities |
Comparison function |
| Edge rule | network |
When to create edges |
2. Combination¶
Uniform syntax everywhere using combine:
# Combining embeddings
text_vec:
combine:
- ref: title_vec
weight: 0.3
- ref: abstract_vec
weight: 0.7
# Combining similarities (same syntax)
overall:
combine:
- ref: text_sim
weight: 0.8
- ref: tag_sim
weight: 0.2
3. Abstraction¶
Named definitions for reuse and clarity. Every embedding and similarity has a name that can be referenced.
Schema Section¶
Defines data shape, extraction, and computed fields.
Field Syntax¶
schema:
# Minimal: just type
title: text
# With options
abstract:
type: text
required: true
# List with default
tags:
type: list
default: []
# JSONPath extraction
authors:
type: list
from: "$.metadata.authors[*].name"
# With reduce and transforms
first_author:
type: text
from: "$.authors[*].name"
reduce: first
transform: [strip]
Field Properties¶
| Property | Type | Default | Description |
|---|---|---|---|
type |
string | text |
Data type: text, list, number, dict |
required |
bool | false |
Must be present |
default |
any | null |
Value when missing |
from |
string | - | JSONPath extraction expression |
reduce |
string | - | How to reduce multiple matches |
flatten |
bool | false |
Flatten nested lists |
transform |
list | - | Value transformations |
validate |
dict | - | Validation constraints |
JSONPath Extraction¶
Extract values from nested JSON structures:
schema:
# Direct field
title:
from: "$.title"
# Nested access
abstract:
from: "$.metadata.abstract"
# Array elements
first_author:
from: "$.authors[0].name"
# All array elements
all_authors:
type: list
from: "$.authors[*].name"
# Nested arrays (flattened)
all_affiliations:
type: list
from: "$.authors[*].affiliations[*]"
flatten: true
Reduce Operations¶
When JSONPath matches multiple values:
| Operation | Description | Example |
|---|---|---|
first |
First match | [a,b,c] → a |
last |
Last match | [a,b,c] → c |
list |
All as list | [a,b,c] → [a,b,c] |
join |
Join strings | [a,b,c] → "a, b, c" |
count |
Count matches | [a,b,c] → 3 |
unique |
Deduplicate | [a,b,a] → [a,b] |
sum |
Sum numbers | [1,2,3] → 6 |
mean |
Average | [2,4,6] → 4.0 |
min |
Minimum | [3,1,2] → 1 |
max |
Maximum | [3,1,2] → 3 |
flatten |
Flatten nested | [[a],[b,c]] → [a,b,c] |
Transforms¶
Preprocessing applied to field values:
| Transform | Description |
|---|---|
lowercase |
Convert to lowercase |
strip |
Strip whitespace |
Validators¶
Constraints checked at ingestion time:
schema:
year:
type: number
validate:
min: 1900
max: 2100
email:
type: text
validate:
pattern: "^[a-z]+@[a-z]+\\.[a-z]+$"
status:
type: text
validate:
choices: [draft, published, archived]
title:
type: text
validate:
min_length: 1
max_length: 200
| Validator | Applies To | Description |
|---|---|---|
min |
number | Minimum value |
max |
number | Maximum value |
min_length |
text, list | Minimum length |
max_length |
text, list | Maximum length |
pattern |
text | Regex pattern |
choices |
any | Allowed values |
Computed Fields¶
Derive fields from existing ones:
schema:
title: text
abstract: text
authors:
type: list
compute:
# Short form: just formula
full_text: "concat(title, ' ', abstract)"
primary_author: "first(authors)"
# Dict form with type
author_count:
formula: "count(authors)"
type: number
Formula Operations¶
| Operation | Description | Example |
|---|---|---|
concat(a, b, ...) |
Concatenate | concat(title, ' - ', abstract) |
join(field, sep) |
Join list | join(tags, ', ') |
first(field) |
First element | first(authors) |
last(field) |
Last element | last(authors) |
count(field) |
Count elements | count(tags) |
upper(field) |
Uppercase | upper(title) |
lower(field) |
Lowercase | lower(title) |
format(tpl, ...) |
Format string | format('{} by {}', title, author) |
coalesce(a, b, ...) |
First non-null | coalesce(subtitle, title) |
default(field, val) |
With fallback | default(year, 2024) |
Embeddings Section¶
Named mappings from fields to vector space.
Primitive Embedding¶
Single field to vector:
With Chunking¶
For long text fields:
embeddings:
abstract_vec:
field: abstract
model: tfidf
chunking:
method: sentences
max_tokens: 512
overlap: 50
| Chunking Option | Default | Description |
|---|---|---|
method |
sentences |
none, fixed_tokens, sentences, paragraphs |
max_tokens |
512 |
Maximum tokens per chunk |
overlap |
50 |
Token overlap between chunks |
Composite Embedding¶
Combine multiple embeddings:
embeddings:
title_vec:
field: title
model: tfidf
abstract_vec:
field: abstract
model: tfidf
# Weighted combination
text_vec:
combine:
- ref: title_vec
weight: 0.3
- ref: abstract_vec
weight: 0.7
Available Models¶
| Model | Description |
|---|---|
tfidf |
TF-IDF vectorization (fast, interpretable) |
sentence_bert |
Sentence transformers (semantic) |
ollama |
Ollama embeddings (local LLM) |
Similarities Section¶
Named comparison functions.
From Embedding¶
Uses cosine similarity:
From Field (Attribute)¶
Direct field comparison:
similarities:
tag_sim:
field: tags
metric: jaccard
year_sim:
field: year
metric: numeric
params:
max_distance: 50
Available Metrics¶
| Metric | Formula | Best For |
|---|---|---|
jaccard |
|A ∩ B| / |A ∪ B| | Tags, categories |
dice |
2|A ∩ B| / (|A| + |B|) | Emphasize overlap |
overlap |
|A ∩ B| / min(|A|, |B|) | Subset matching |
exact |
1.0 if A == B else 0.0 | Exact match |
numeric |
1 - |a-b|/max_distance | Numeric ranges |
Composite Similarity¶
Combine multiple similarities:
similarities:
text_sim:
embedding: text_vec
tag_sim:
field: tags
metric: jaccard
overall:
combine:
- ref: text_sim
weight: 0.7
- ref: tag_sim
weight: 0.3
Network Section¶
Edge creation and community detection rules.
Edges¶
network:
edges:
similarity: overall # Which similarity to use
min: 0.35 # Create edge if >= 0.35
strong: 0.50 # Mark as strong if >= 0.50
max_per_node: 100 # Optional: limit edges per node
max_total: 50000 # Optional: limit total edges
Weights¶
Transform similarity scores:
network:
weights:
transform: sqrt # identity, sqrt, square, log
normalize: true # Normalize to [0, 1]
| Transform | Effect | Use Case |
|---|---|---|
identity |
No change | Default |
sqrt |
Emphasize weak links | Exploration |
square |
Emphasize strong links | Precision |
log |
Compress range | Wide distribution |
Communities¶
Complete Examples¶
Academic Papers¶
schema:
title:
type: text
required: true
transform: [strip]
abstract: text
authors:
type: list
from: "$.metadata.authors[*].name"
tags:
type: list
from: "$.metadata.keywords"
transform: [lowercase]
default: []
year:
type: number
from: "$.metadata.year"
validate:
min: 1900
max: 2100
compute:
full_text: "concat(title, ' ', abstract)"
primary_author: "first(authors)"
citation: "format('{} et al. ({})', first(authors), year)"
embeddings:
title_vec:
field: title
model: tfidf
abstract_vec:
field: abstract
model: tfidf
chunking:
method: sentences
max_tokens: 512
text_vec:
combine:
- ref: title_vec
weight: 0.3
- ref: abstract_vec
weight: 0.7
similarities:
text_sim:
embedding: text_vec
tag_sim:
field: tags
metric: jaccard
author_sim:
field: authors
metric: jaccard
overall:
combine:
- ref: text_sim
weight: 0.6
- ref: tag_sim
weight: 0.25
- ref: author_sim
weight: 0.15
network:
edges:
similarity: overall
min: 0.35
strong: 0.50
communities:
algorithm: louvain
E-commerce Products¶
schema:
name:
type: text
required: true
validate:
min_length: 1
max_length: 200
description: text
category:
type: text
required: true
validate:
choices: [electronics, clothing, home, sports]
brand: text
price:
type: number
validate:
min: 0
tags:
type: list
transform: [lowercase]
default: []
compute:
display_title: "concat(brand, ' ', name)"
embeddings:
name_vec:
field: name
model: tfidf
desc_vec:
field: description
model: tfidf
product_vec:
combine:
- ref: name_vec
weight: 0.4
- ref: desc_vec
weight: 0.6
similarities:
text_sim:
embedding: product_vec
category_sim:
field: category
metric: exact
tag_sim:
field: tags
metric: jaccard
brand_sim:
field: brand
metric: exact
overall:
combine:
- ref: text_sim
weight: 0.5
- ref: category_sim
weight: 0.3
- ref: tag_sim
weight: 0.15
- ref: brand_sim
weight: 0.05
network:
edges:
similarity: overall
min: 0.4
max_per_node: 50
communities:
algorithm: louvain
Best Practices¶
1. Separate Concerns¶
Keep each section focused: - Schema: Only data shape and extraction - Embeddings: Only vectorization - Similarities: Only comparison logic - Network: Only edge rules
2. Name Things Clearly¶
# Good: descriptive names
embeddings:
semantic_content: ...
similarities:
topic_overlap: ...
# Avoid: generic names
embeddings:
vec1: ...
similarities:
sim: ...
3. Use JSONPath for Nested Data¶
Don't preprocess JSON. Extract directly:
4. Validate at Ingestion¶
Catch bad data early:
5. Start Simple¶
Begin with minimal config, add complexity as needed:
# Start here
schema:
content: text
embeddings:
content_vec:
field: content
model: tfidf
similarities:
default:
embedding: content_vec
network:
edges:
similarity: default
min: 0.3
6. Use Composition for Flexibility¶
Build complex behaviors from simple parts:
similarities:
# Simple parts
text_sim: { embedding: text_vec }
tag_sim: { field: tags, metric: jaccard }
# Composed
overall:
combine:
- ref: text_sim
weight: 0.7
- ref: tag_sim
weight: 0.3
See Also¶
- Core Concepts - Understanding structured similarity
- Chunking Guide - Text chunking strategies
- API Reference - Python API documentation