YAML DSL Reference¶

Complete specification for the Complex Network RAG configuration language.

Table of Contents¶

Overview
Design Principles
Schema Section
Embeddings Section
Similarities Section
Network Section
Complete Examples
Best Practices

Overview¶

The DSL has four orthogonal sections, each handling one concern:

schema:       # Data shape and extraction
embeddings:   # Field → vector mappings
similarities: # Comparison functions
network:      # Edge creation rules

Minimal Example¶

schema:
  content: text

embeddings:
  content_vec:
    field: content
    model: tfidf

similarities:
  default:
    embedding: content_vec

network:
  edges:
    similarity: default
    min: 0.3

Design Principles¶

The DSL follows SICP's three key principles:

1. Primitives¶

Minimal, orthogonal building blocks:

Primitive	Section	Purpose
Field	`schema`	Data shape and extraction
Embedding	`embeddings`	Field → vector mapping
Similarity	`similarities`	Comparison function
Edge rule	`network`	When to create edges

2. Combination¶

Uniform syntax everywhere using combine:

# Combining embeddings
text_vec:
  combine:
    - ref: title_vec
      weight: 0.3
    - ref: abstract_vec
      weight: 0.7

# Combining similarities (same syntax)
overall:
  combine:
    - ref: text_sim
      weight: 0.8
    - ref: tag_sim
      weight: 0.2

3. Abstraction¶

Named definitions for reuse and clarity. Every embedding and similarity has a name that can be referenced.

Schema Section¶

Defines data shape, extraction, and computed fields.

Field Syntax¶

schema:
  # Minimal: just type
  title: text

  # With options
  abstract:
    type: text
    required: true

  # List with default
  tags:
    type: list
    default: []

  # JSONPath extraction
  authors:
    type: list
    from: "$.metadata.authors[*].name"

  # With reduce and transforms
  first_author:
    type: text
    from: "$.authors[*].name"
    reduce: first
    transform: [strip]

Field Properties¶

Property	Type	Default	Description
`type`	string	`text`	Data type: `text`, `list`, `number`, `dict`
`required`	bool	`false`	Must be present
`default`	any	`null`	Value when missing
`from`	string	-	JSONPath extraction expression
`reduce`	string	-	How to reduce multiple matches
`flatten`	bool	`false`	Flatten nested lists
`transform`	list	-	Value transformations
`validate`	dict	-	Validation constraints

JSONPath Extraction¶

Extract values from nested JSON structures:

schema:
  # Direct field
  title:
    from: "$.title"

  # Nested access
  abstract:
    from: "$.metadata.abstract"

  # Array elements
  first_author:
    from: "$.authors[0].name"

  # All array elements
  all_authors:
    type: list
    from: "$.authors[*].name"

  # Nested arrays (flattened)
  all_affiliations:
    type: list
    from: "$.authors[*].affiliations[*]"
    flatten: true

Reduce Operations¶

When JSONPath matches multiple values:

Operation	Description	Example
`first`	First match	`[a,b,c]` → `a`
`last`	Last match	`[a,b,c]` → `c`
`list`	All as list	`[a,b,c]` → `[a,b,c]`
`join`	Join strings	`[a,b,c]` → `"a, b, c"`
`count`	Count matches	`[a,b,c]` → `3`
`unique`	Deduplicate	`[a,b,a]` → `[a,b]`
`sum`	Sum numbers	`[1,2,3]` → `6`
`mean`	Average	`[2,4,6]` → `4.0`
`min`	Minimum	`[3,1,2]` → `1`
`max`	Maximum	`[3,1,2]` → `3`
`flatten`	Flatten nested	`[[a],[b,c]]` → `[a,b,c]`

Transforms¶

Preprocessing applied to field values:

schema:
  title:
    transform: [lowercase, strip]

Transform	Description
`lowercase`	Convert to lowercase
`strip`	Strip whitespace

Validators¶

Constraints checked at ingestion time:

schema:
  year:
    type: number
    validate:
      min: 1900
      max: 2100

  email:
    type: text
    validate:
      pattern: "^[a-z]+@[a-z]+\\.[a-z]+$"

  status:
    type: text
    validate:
      choices: [draft, published, archived]

  title:
    type: text
    validate:
      min_length: 1
      max_length: 200

Validator	Applies To	Description
`min`	number	Minimum value
`max`	number	Maximum value
`min_length`	text, list	Minimum length
`max_length`	text, list	Maximum length
`pattern`	text	Regex pattern
`choices`	any	Allowed values

Computed Fields¶

Derive fields from existing ones:

schema:
  title: text
  abstract: text
  authors:
    type: list

  compute:
    # Short form: just formula
    full_text: "concat(title, ' ', abstract)"
    primary_author: "first(authors)"

    # Dict form with type
    author_count:
      formula: "count(authors)"
      type: number

Formula Operations¶

Operation	Description	Example
`concat(a, b, ...)`	Concatenate	`concat(title, ' - ', abstract)`
`join(field, sep)`	Join list	`join(tags, ', ')`
`first(field)`	First element	`first(authors)`
`last(field)`	Last element	`last(authors)`
`count(field)`	Count elements	`count(tags)`
`upper(field)`	Uppercase	`upper(title)`
`lower(field)`	Lowercase	`lower(title)`
`format(tpl, ...)`	Format string	`format('{} by {}', title, author)`
`coalesce(a, b, ...)`	First non-null	`coalesce(subtitle, title)`
`default(field, val)`	With fallback	`default(year, 2024)`

Embeddings Section¶

Named mappings from fields to vector space.

Primitive Embedding¶

Single field to vector:

embeddings:
  title_vec:
    field: title
    model: tfidf

With Chunking¶

For long text fields:

embeddings:
  abstract_vec:
    field: abstract
    model: tfidf
    chunking:
      method: sentences
      max_tokens: 512
      overlap: 50

Chunking Option	Default	Description
`method`	`sentences`	`none`, `fixed_tokens`, `sentences`, `paragraphs`
`max_tokens`	`512`	Maximum tokens per chunk
`overlap`	`50`	Token overlap between chunks

Composite Embedding¶

Combine multiple embeddings:

embeddings:
  title_vec:
    field: title
    model: tfidf

  abstract_vec:
    field: abstract
    model: tfidf

  # Weighted combination
  text_vec:
    combine:
      - ref: title_vec
        weight: 0.3
      - ref: abstract_vec
        weight: 0.7

Available Models¶

Model	Description
`tfidf`	TF-IDF vectorization (fast, interpretable)
`sentence_bert`	Sentence transformers (semantic)
`ollama`	Ollama embeddings (local LLM)

Similarities Section¶

Named comparison functions.

From Embedding¶

Uses cosine similarity:

similarities:
  text_sim:
    embedding: text_vec

From Field (Attribute)¶

Direct field comparison:

similarities:
  tag_sim:
    field: tags
    metric: jaccard

  year_sim:
    field: year
    metric: numeric
    params:
      max_distance: 50

Available Metrics¶

Metric	Formula	Best For
`jaccard`	\|A ∩ B\| / \|A ∪ B\|	Tags, categories
`dice`	2\|A ∩ B\| / (\|A\| + \|B\|)	Emphasize overlap
`overlap`	\|A ∩ B\| / min(\|A\|, \|B\|)	Subset matching
`exact`	1.0 if A == B else 0.0	Exact match
`numeric`	1 - \|a-b\|/max_distance	Numeric ranges

Composite Similarity¶

Combine multiple similarities:

similarities:
  text_sim:
    embedding: text_vec

  tag_sim:
    field: tags
    metric: jaccard

  overall:
    combine:
      - ref: text_sim
        weight: 0.7
      - ref: tag_sim
        weight: 0.3

Network Section¶

Edge creation and community detection rules.

Edges¶

network:
  edges:
    similarity: overall    # Which similarity to use
    min: 0.35             # Create edge if >= 0.35
    strong: 0.50          # Mark as strong if >= 0.50
    max_per_node: 100     # Optional: limit edges per node
    max_total: 50000      # Optional: limit total edges

Weights¶

Transform similarity scores:

network:
  weights:
    transform: sqrt       # identity, sqrt, square, log
    normalize: true       # Normalize to [0, 1]

Transform	Effect	Use Case
`identity`	No change	Default
`sqrt`	Emphasize weak links	Exploration
`square`	Emphasize strong links	Precision
`log`	Compress range	Wide distribution

Communities¶

network:
  communities:
    algorithm: louvain    # louvain, label_propagation

Complete Examples¶

Academic Papers¶

schema:
  title:
    type: text
    required: true
    transform: [strip]

  abstract: text

  authors:
    type: list
    from: "$.metadata.authors[*].name"

  tags:
    type: list
    from: "$.metadata.keywords"
    transform: [lowercase]
    default: []

  year:
    type: number
    from: "$.metadata.year"
    validate:
      min: 1900
      max: 2100

  compute:
    full_text: "concat(title, ' ', abstract)"
    primary_author: "first(authors)"
    citation: "format('{} et al. ({})', first(authors), year)"

embeddings:
  title_vec:
    field: title
    model: tfidf

  abstract_vec:
    field: abstract
    model: tfidf
    chunking:
      method: sentences
      max_tokens: 512

  text_vec:
    combine:
      - ref: title_vec
        weight: 0.3
      - ref: abstract_vec
        weight: 0.7

similarities:
  text_sim:
    embedding: text_vec

  tag_sim:
    field: tags
    metric: jaccard

  author_sim:
    field: authors
    metric: jaccard

  overall:
    combine:
      - ref: text_sim
        weight: 0.6
      - ref: tag_sim
        weight: 0.25
      - ref: author_sim
        weight: 0.15

network:
  edges:
    similarity: overall
    min: 0.35
    strong: 0.50

  communities:
    algorithm: louvain

E-commerce Products¶

schema:
  name:
    type: text
    required: true
    validate:
      min_length: 1
      max_length: 200

  description: text

  category:
    type: text
    required: true
    validate:
      choices: [electronics, clothing, home, sports]

  brand: text

  price:
    type: number
    validate:
      min: 0

  tags:
    type: list
    transform: [lowercase]
    default: []

  compute:
    display_title: "concat(brand, ' ', name)"

embeddings:
  name_vec:
    field: name
    model: tfidf

  desc_vec:
    field: description
    model: tfidf

  product_vec:
    combine:
      - ref: name_vec
        weight: 0.4
      - ref: desc_vec
        weight: 0.6

similarities:
  text_sim:
    embedding: product_vec

  category_sim:
    field: category
    metric: exact

  tag_sim:
    field: tags
    metric: jaccard

  brand_sim:
    field: brand
    metric: exact

  overall:
    combine:
      - ref: text_sim
        weight: 0.5
      - ref: category_sim
        weight: 0.3
      - ref: tag_sim
        weight: 0.15
      - ref: brand_sim
        weight: 0.05

network:
  edges:
    similarity: overall
    min: 0.4
    max_per_node: 50

  communities:
    algorithm: louvain

Best Practices¶

1. Separate Concerns¶

Keep each section focused: - Schema: Only data shape and extraction - Embeddings: Only vectorization - Similarities: Only comparison logic - Network: Only edge rules

2. Name Things Clearly¶

# Good: descriptive names
embeddings:
  semantic_content: ...

similarities:
  topic_overlap: ...

# Avoid: generic names
embeddings:
  vec1: ...

similarities:
  sim: ...

3. Use JSONPath for Nested Data¶

Don't preprocess JSON. Extract directly:

schema:
  author_names:
    type: list
    from: "$.metadata.authors[*].name"

4. Validate at Ingestion¶

Catch bad data early:

schema:
  year:
    type: number
    validate:
      min: 1900
      max: 2100

5. Start Simple¶

Begin with minimal config, add complexity as needed:

# Start here
schema:
  content: text

embeddings:
  content_vec:
    field: content
    model: tfidf

similarities:
  default:
    embedding: content_vec

network:
  edges:
    similarity: default
    min: 0.3

6. Use Composition for Flexibility¶

Build complex behaviors from simple parts:

similarities:
  # Simple parts
  text_sim: { embedding: text_vec }
  tag_sim: { field: tags, metric: jaccard }

  # Composed
  overall:
    combine:
      - ref: text_sim
        weight: 0.7
      - ref: tag_sim
        weight: 0.3