Skip to content

YAML DSL Reference

Complete specification for the Complex Network RAG configuration language.

Table of Contents

  1. Overview
  2. Design Principles
  3. Schema Section
  4. Embeddings Section
  5. Similarities Section
  6. Network Section
  7. Complete Examples
  8. Best Practices

Overview

The DSL has four orthogonal sections, each handling one concern:

schema:       # Data shape and extraction
embeddings:   # Field → vector mappings
similarities: # Comparison functions
network:      # Edge creation rules

Minimal Example

schema:
  content: text

embeddings:
  content_vec:
    field: content
    model: tfidf

similarities:
  default:
    embedding: content_vec

network:
  edges:
    similarity: default
    min: 0.3

Design Principles

The DSL follows SICP's three key principles:

1. Primitives

Minimal, orthogonal building blocks:

Primitive Section Purpose
Field schema Data shape and extraction
Embedding embeddings Field → vector mapping
Similarity similarities Comparison function
Edge rule network When to create edges

2. Combination

Uniform syntax everywhere using combine:

# Combining embeddings
text_vec:
  combine:
    - ref: title_vec
      weight: 0.3
    - ref: abstract_vec
      weight: 0.7

# Combining similarities (same syntax)
overall:
  combine:
    - ref: text_sim
      weight: 0.8
    - ref: tag_sim
      weight: 0.2

3. Abstraction

Named definitions for reuse and clarity. Every embedding and similarity has a name that can be referenced.

Schema Section

Defines data shape, extraction, and computed fields.

Field Syntax

schema:
  # Minimal: just type
  title: text

  # With options
  abstract:
    type: text
    required: true

  # List with default
  tags:
    type: list
    default: []

  # JSONPath extraction
  authors:
    type: list
    from: "$.metadata.authors[*].name"

  # With reduce and transforms
  first_author:
    type: text
    from: "$.authors[*].name"
    reduce: first
    transform: [strip]

Field Properties

Property Type Default Description
type string text Data type: text, list, number, dict
required bool false Must be present
default any null Value when missing
from string - JSONPath extraction expression
reduce string - How to reduce multiple matches
flatten bool false Flatten nested lists
transform list - Value transformations
validate dict - Validation constraints

JSONPath Extraction

Extract values from nested JSON structures:

schema:
  # Direct field
  title:
    from: "$.title"

  # Nested access
  abstract:
    from: "$.metadata.abstract"

  # Array elements
  first_author:
    from: "$.authors[0].name"

  # All array elements
  all_authors:
    type: list
    from: "$.authors[*].name"

  # Nested arrays (flattened)
  all_affiliations:
    type: list
    from: "$.authors[*].affiliations[*]"
    flatten: true

Reduce Operations

When JSONPath matches multiple values:

Operation Description Example
first First match [a,b,c]a
last Last match [a,b,c]c
list All as list [a,b,c][a,b,c]
join Join strings [a,b,c]"a, b, c"
count Count matches [a,b,c]3
unique Deduplicate [a,b,a][a,b]
sum Sum numbers [1,2,3]6
mean Average [2,4,6]4.0
min Minimum [3,1,2]1
max Maximum [3,1,2]3
flatten Flatten nested [[a],[b,c]][a,b,c]

Transforms

Preprocessing applied to field values:

schema:
  title:
    transform: [lowercase, strip]
Transform Description
lowercase Convert to lowercase
strip Strip whitespace

Validators

Constraints checked at ingestion time:

schema:
  year:
    type: number
    validate:
      min: 1900
      max: 2100

  email:
    type: text
    validate:
      pattern: "^[a-z]+@[a-z]+\\.[a-z]+$"

  status:
    type: text
    validate:
      choices: [draft, published, archived]

  title:
    type: text
    validate:
      min_length: 1
      max_length: 200
Validator Applies To Description
min number Minimum value
max number Maximum value
min_length text, list Minimum length
max_length text, list Maximum length
pattern text Regex pattern
choices any Allowed values

Computed Fields

Derive fields from existing ones:

schema:
  title: text
  abstract: text
  authors:
    type: list

  compute:
    # Short form: just formula
    full_text: "concat(title, ' ', abstract)"
    primary_author: "first(authors)"

    # Dict form with type
    author_count:
      formula: "count(authors)"
      type: number

Formula Operations

Operation Description Example
concat(a, b, ...) Concatenate concat(title, ' - ', abstract)
join(field, sep) Join list join(tags, ', ')
first(field) First element first(authors)
last(field) Last element last(authors)
count(field) Count elements count(tags)
upper(field) Uppercase upper(title)
lower(field) Lowercase lower(title)
format(tpl, ...) Format string format('{} by {}', title, author)
coalesce(a, b, ...) First non-null coalesce(subtitle, title)
default(field, val) With fallback default(year, 2024)

Embeddings Section

Named mappings from fields to vector space.

Primitive Embedding

Single field to vector:

embeddings:
  title_vec:
    field: title
    model: tfidf

With Chunking

For long text fields:

embeddings:
  abstract_vec:
    field: abstract
    model: tfidf
    chunking:
      method: sentences
      max_tokens: 512
      overlap: 50
Chunking Option Default Description
method sentences none, fixed_tokens, sentences, paragraphs
max_tokens 512 Maximum tokens per chunk
overlap 50 Token overlap between chunks

Composite Embedding

Combine multiple embeddings:

embeddings:
  title_vec:
    field: title
    model: tfidf

  abstract_vec:
    field: abstract
    model: tfidf

  # Weighted combination
  text_vec:
    combine:
      - ref: title_vec
        weight: 0.3
      - ref: abstract_vec
        weight: 0.7

Available Models

Model Description
tfidf TF-IDF vectorization (fast, interpretable)
sentence_bert Sentence transformers (semantic)
ollama Ollama embeddings (local LLM)

Similarities Section

Named comparison functions.

From Embedding

Uses cosine similarity:

similarities:
  text_sim:
    embedding: text_vec

From Field (Attribute)

Direct field comparison:

similarities:
  tag_sim:
    field: tags
    metric: jaccard

  year_sim:
    field: year
    metric: numeric
    params:
      max_distance: 50

Available Metrics

Metric Formula Best For
jaccard |A ∩ B| / |A ∪ B| Tags, categories
dice 2|A ∩ B| / (|A| + |B|) Emphasize overlap
overlap |A ∩ B| / min(|A|, |B|) Subset matching
exact 1.0 if A == B else 0.0 Exact match
numeric 1 - |a-b|/max_distance Numeric ranges

Composite Similarity

Combine multiple similarities:

similarities:
  text_sim:
    embedding: text_vec

  tag_sim:
    field: tags
    metric: jaccard

  overall:
    combine:
      - ref: text_sim
        weight: 0.7
      - ref: tag_sim
        weight: 0.3

Network Section

Edge creation and community detection rules.

Edges

network:
  edges:
    similarity: overall    # Which similarity to use
    min: 0.35             # Create edge if >= 0.35
    strong: 0.50          # Mark as strong if >= 0.50
    max_per_node: 100     # Optional: limit edges per node
    max_total: 50000      # Optional: limit total edges

Weights

Transform similarity scores:

network:
  weights:
    transform: sqrt       # identity, sqrt, square, log
    normalize: true       # Normalize to [0, 1]
Transform Effect Use Case
identity No change Default
sqrt Emphasize weak links Exploration
square Emphasize strong links Precision
log Compress range Wide distribution

Communities

network:
  communities:
    algorithm: louvain    # louvain, label_propagation

Complete Examples

Academic Papers

schema:
  title:
    type: text
    required: true
    transform: [strip]

  abstract: text

  authors:
    type: list
    from: "$.metadata.authors[*].name"

  tags:
    type: list
    from: "$.metadata.keywords"
    transform: [lowercase]
    default: []

  year:
    type: number
    from: "$.metadata.year"
    validate:
      min: 1900
      max: 2100

  compute:
    full_text: "concat(title, ' ', abstract)"
    primary_author: "first(authors)"
    citation: "format('{} et al. ({})', first(authors), year)"

embeddings:
  title_vec:
    field: title
    model: tfidf

  abstract_vec:
    field: abstract
    model: tfidf
    chunking:
      method: sentences
      max_tokens: 512

  text_vec:
    combine:
      - ref: title_vec
        weight: 0.3
      - ref: abstract_vec
        weight: 0.7

similarities:
  text_sim:
    embedding: text_vec

  tag_sim:
    field: tags
    metric: jaccard

  author_sim:
    field: authors
    metric: jaccard

  overall:
    combine:
      - ref: text_sim
        weight: 0.6
      - ref: tag_sim
        weight: 0.25
      - ref: author_sim
        weight: 0.15

network:
  edges:
    similarity: overall
    min: 0.35
    strong: 0.50

  communities:
    algorithm: louvain

E-commerce Products

schema:
  name:
    type: text
    required: true
    validate:
      min_length: 1
      max_length: 200

  description: text

  category:
    type: text
    required: true
    validate:
      choices: [electronics, clothing, home, sports]

  brand: text

  price:
    type: number
    validate:
      min: 0

  tags:
    type: list
    transform: [lowercase]
    default: []

  compute:
    display_title: "concat(brand, ' ', name)"

embeddings:
  name_vec:
    field: name
    model: tfidf

  desc_vec:
    field: description
    model: tfidf

  product_vec:
    combine:
      - ref: name_vec
        weight: 0.4
      - ref: desc_vec
        weight: 0.6

similarities:
  text_sim:
    embedding: product_vec

  category_sim:
    field: category
    metric: exact

  tag_sim:
    field: tags
    metric: jaccard

  brand_sim:
    field: brand
    metric: exact

  overall:
    combine:
      - ref: text_sim
        weight: 0.5
      - ref: category_sim
        weight: 0.3
      - ref: tag_sim
        weight: 0.15
      - ref: brand_sim
        weight: 0.05

network:
  edges:
    similarity: overall
    min: 0.4
    max_per_node: 50

  communities:
    algorithm: louvain

Best Practices

1. Separate Concerns

Keep each section focused: - Schema: Only data shape and extraction - Embeddings: Only vectorization - Similarities: Only comparison logic - Network: Only edge rules

2. Name Things Clearly

# Good: descriptive names
embeddings:
  semantic_content: ...

similarities:
  topic_overlap: ...

# Avoid: generic names
embeddings:
  vec1: ...

similarities:
  sim: ...

3. Use JSONPath for Nested Data

Don't preprocess JSON. Extract directly:

schema:
  author_names:
    type: list
    from: "$.metadata.authors[*].name"

4. Validate at Ingestion

Catch bad data early:

schema:
  year:
    type: number
    validate:
      min: 1900
      max: 2100

5. Start Simple

Begin with minimal config, add complexity as needed:

# Start here
schema:
  content: text

embeddings:
  content_vec:
    field: content
    model: tfidf

similarities:
  default:
    embedding: content_vec

network:
  edges:
    similarity: default
    min: 0.3

6. Use Composition for Flexibility

Build complex behaviors from simple parts:

similarities:
  # Simple parts
  text_sim: { embedding: text_vec }
  tag_sim: { field: tags, metric: jaccard }

  # Composed
  overall:
    combine:
      - ref: text_sim
        weight: 0.7
      - ref: tag_sim
        weight: 0.3

See Also