jsonl-algebra: Production-Grade Relational Algebra for Nested JSON

jsonl-algebra (command: ja) is a production-grade implementation of relational algebra for JSONL data. It's the full-featured, battle-tested realization of dotsuite's dotrelate component—bringing SQL-like operations to the command line with first-class support for nested JSON structures.

The Relationship to Dotsuite¶

In dotsuite's architecture, dotrelate provides relational operations on document collections:

dotrelate: Join, union, project, difference on collections
Theoretical foundation: Relational algebra lifted to nested documents

jsonl-algebra (ja) is the production implementation of these concepts, with: - ✅ Published on PyPI: pip install jsonl-algebra - ✅ Battle-tested: Used in real-world data pipelines - ✅ Feature-complete: All relational operations + aggregations - ✅ Streaming: Process datasets larger than memory - ✅ Schema tools: Inference and validation - ✅ Interactive REPL: Build pipelines step-by-step

Core Insight: Relational Algebra for Nested Data¶

Traditional relational algebra assumes flat tables:

SELECT name, age FROM users WHERE age > 30

But modern JSON is deeply nested:

{
  "user": {
    "id": 1,
    "name": "Alice",
    "address": {
      "city": "NYC",
      "zip": "10001"
    }
  },
  "orders": [
    {"id": 101, "amount": 50}
  ]
}

jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:

ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id

The Five Core Operations¶

Relational algebra has five fundamental operations that form a complete algebra:

1. Selection (σ) — Filter Rows¶

Mathematical notation: \(\sigma_{\text{predicate}}(R)\)

# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl

# Filter on nested fields
ja select 'user.age > 30' users.jsonl

# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl

Properties: - Commutative: \(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))\) - Combinable: \(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)\)

2. Projection (π) — Select/Compute Columns¶

Mathematical notation: \(\pi_{\text{columns}}(R)\)

# Pick specific fields
ja project id,name data.jsonl

# Access nested fields
ja project user.name,user.address.city users.jsonl

# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl

Properties: - Idempotent (for simple projections): \(\pi_a(\pi_{a,b}(R)) = \pi_a(R)\)

3. Join (⋈) — Combine Relations¶

Mathematical notation: \(R \bowtie_{\text{condition}} S\)

# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id

# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id

# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email

Properties: - Commutative: \(R \bowtie S = S \bowtie R\) - Associative: \((R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)\)

This means you can join multiple files in any order:

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja join - products.jsonl --on product_id=id

4. Union (∪) — Combine All Rows¶

Mathematical notation: \(R \cup S\)

# Combine employees and contractors
ja union employees.jsonl contractors.jsonl

# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl

Properties: - Commutative: \(R \cup S = S \cup R\) - Associative: \((R \cup S) \cup T = R \cup (S \cup T)\) - Identity: \(R \cup \emptyset = R\)

5. Difference (−) — Set Subtraction¶

Mathematical notation: \(R - S\)

# Find users without orders
ja difference users.jsonl customers_with_orders.jsonl --on id

# Set operations
ja difference all_items.jsonl sold_items.jsonl --on sku

Properties: - Not commutative: \(R - S \neq S - R\) - Anti-identity: \(R - \emptyset = R\), but \(\emptyset - R = \emptyset\)

Aggregation Operations: Extending the Algebra¶

While the five core operations are complete, practical data processing needs aggregation:

GroupBy (γ) — Aggregate by Groups¶

Mathematical notation: \(\gamma_{\text{group\_keys}, \text{aggregations}}(R)\)

# Count by category
ja groupby category --agg count products.jsonl

# Sum by customer
ja groupby customer_id --agg sum:amount orders.jsonl

# Multiple aggregations
ja groupby department --agg avg:salary,count,max:age employees.jsonl

Supported aggregations: - count: Number of items - sum: Sum of values - avg: Average of values - min/max: Minimum/maximum - list: Collect all values into array

Chained Grouping: An Innovation¶

Traditional SQL requires nested subqueries for multi-level grouping. jsonl-algebra enables chained grouping through metadata preservation:

# Multi-level grouping
cat sales.jsonl \
  | ja groupby region \        # First level
  | ja groupby product \       # Second level
  | ja groupby month \         # Third level
  | ja agg total=sum:amount    # Final aggregation

How it works: Each groupby adds metadata without aggregating:

{
  "sale_id": 101,
  "region": "east",
  "product": "laptop",
  "month": "jan",
  "amount": 1200,
  "_groups": [
    {"field": "region", "value": "east"},
    {"field": "product", "value": "laptop"},
    {"field": "month", "value": "jan"}
  ],
  "_group_size": 5,
  "_group_index": 0
}

This preserves the relational structure while tracking grouping hierarchy—a key innovation that enables composition.

Real-World Example: The 5-Minute Tour¶

Setup: Two JSONL files

users.jsonl:

{"user": {"id": 1, "name": "Alice"}}
{"user": {"id": 2, "name": "Bob"}}

orders.jsonl:

{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}

Goal: Total amount spent by each user

Solution:

# Join users and orders, then aggregate
ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja groupby user.name --agg sum:amount

Output:

{"user.name": "Alice", "sum_amount": 125}
{"user.name": "Bob", "sum_amount": 120}

This demonstrates: - ✅ Nested field access (user.id, user.name) - ✅ Relational join - ✅ Aggregation - ✅ Unix-style piping

Additional Operations¶

Distinct — Remove Duplicates¶

# Unique users
ja distinct users.jsonl

# Distinct on specific fields
ja distinct --key user.id,email contacts.jsonl

Sort — Order Results¶

# Sort by age ascending
ja sort age users.jsonl

# Sort descending
ja sort age:desc users.jsonl

# Multi-key sort
ja sort department,salary:desc employees.jsonl

Rename — Field Aliasing¶

# Rename single field
ja rename id=user_id data.jsonl

# Rename nested fields
ja rename user.loc=user.location data.jsonl

Head/Tail — Limit Results¶

# First 10 rows
ja head 10 data.jsonl

# Last 10 rows
ja tail 10 data.jsonl

Schema Operations: Inference and Validation¶

Schema Inference¶

Automatically discover the structure of your data:

# Infer JSON Schema
ja schema infer users.jsonl > users_schema.json

Output:

{
  "type": "object",
  "properties": {
    "id": {"type": "integer"},
    "name": {"type": "string"},
    "age": {"type": "integer"},
    "address": {
      "type": "object",
      "properties": {
        "city": {"type": "string"},
        "zip": {"type": "string"}
      }
    }
  }
}

Schema Validation¶

Validate data against a schema:

# Check if data conforms to schema
ja schema validate users_schema.json users.jsonl

This is crucial for: - Data quality checks - ETL pipeline validation - API contract testing

Format Conversion: CSV Integration¶

JSONL to CSV¶

# Flatten nested structure to CSV
ja to-csv users.jsonl > users.csv

# Select specific fields
ja project user.name,user.age users.jsonl | ja to-csv > simple.csv

Intelligent flattening:

{"user": {"name": "Alice", "age": 30}}

Becomes:

user.name,user.age
Alice,30

CSV to JSONL¶

# Convert CSV to JSONL
ja from-csv users.csv > users.jsonl

Interactive REPL¶

Build and test pipelines interactively:

$ ja repl users.jsonl

ja> select 'age > 25'
# Preview first few results...

ja> project name,email
# Preview transformed data...

ja> groupby department --agg avg:salary
# See aggregated results...

ja> write output.jsonl
# Save pipeline results

The REPL is perfect for: - Exploring unfamiliar data - Testing query predicates - Iterative pipeline development - Teaching relational algebra concepts

Command-Line Examples¶

Log Analysis¶

# Find error logs from auth service
ja select 'level == `"ERROR"` and service == `"auth"`' app.log.jsonl

# Count errors by service
ja select 'level == `"ERROR"`' app.log.jsonl \
  | ja groupby service --agg count

ETL Pipeline¶

# Extract, transform, load
ja select 'status == `"completed"`' sales.jsonl \
  | ja project customer_id,amount,date \
  | ja join - customers.jsonl --on customer_id=id \
  | ja groupby customer.region --agg sum:amount \
  | ja sort sum_amount:desc \
  | ja head 10

Data Quality Checks¶

# Find users without email
ja select 'not (email != `null`)' users.jsonl

# Find duplicate user IDs
ja groupby user_id users.jsonl \
  | ja select '_group_size > 1'

Report Generation¶

# Monthly sales report
ja join orders.jsonl products.jsonl --on product_id=id \
  | ja groupby month,category --agg sum:revenue,count:orders \
  | ja sort month,sum_revenue:desc \
  | ja to-csv > monthly_report.csv

Python API¶

Use ja programmatically:

from ja.core import read_jsonl, join, groupby_agg, select_rows
from ja.schema import infer_schema

# Load data
users = read_jsonl("users.jsonl")
orders = read_jsonl("orders.jsonl")

# Join
joined = join(users, orders, on=[("user.id", "customer_id")])

# Filter
high_value = select_rows(joined, "amount > 100")

# Aggregate
result = groupby_agg(
    high_value,
    group_by_key="user.name",
    aggregations=[("sum", "amount"), ("count", None)]
)

print(list(result))

# Schema inference
schema = infer_schema(users)

Integration with Ecosystem¶

jsonl-algebra works seamlessly with other tools:

With JAF (filtering)¶

# JAF for boolean filtering, ja for relational ops
jaf filter users.jsonl '(eq? @status "active")' --eval \
  | ja join - orders.jsonl --on id=user_id \
  | ja groupby name --agg sum:amount

With jq (JSON manipulation)¶

# ja for relational ops, jq for complex transformations
ja join users.jsonl orders.jsonl --on id=user_id \
  | jq '.metadata = {processed: now, version: "1.0"}'

With standard Unix tools¶

# Combine with grep, sort, uniq
ja project email users.jsonl \
  | grep '@gmail.com' \
  | sort \
  | uniq -c

Theoretical Foundation: Relational Algebra¶

jsonl-algebra implements Codd's relational algebra extended for nested data:

Completeness¶

The five operations (σ, π, ⋈, ∪, −) form a complete algebra—any relational query can be expressed using these primitives.

Closure Property¶

Every operation produces a valid relation:

\[ \text{Relation} \xrightarrow{\text{Operation}} \text{Relation} \]

This ensures composability: R | op1 | op2 | op3 always works.

Query Optimization¶

Because operations have well-defined properties, queries can be optimized:

Selection pushdown: [ \sigma_p(R \bowtie S) \equiv (\sigma_p(R)) \bowtie S \quad \text{(if p only references R)} ]

Projection elimination: [ \pi_a(\pi_{a,b}(R)) \equiv \pi_a(R) ]

These properties enable automatic query optimization (future work).

Nested Data Extension¶

Traditional relational algebra assumes flat tuples. jsonl-algebra extends this with:

Path Expressions¶

Replace column names with dot-separated paths:

\[ \sigma_{\text{user.age} > 30}(R) \quad \text{vs} \quad \sigma_{\text{age} > 30}(R) \]

Deep Equality¶

Join conditions can reference nested fields:

\[ R \bowtie_{\text{user.id} = \text{order.customer\_id}} S \]

Projection Flattening¶

Projection can flatten nested structures:

\[ \pi_{\text{user.name as name}, \text{user.address.city as city}}(R) \]

Performance Characteristics¶

Operation	Memory	Time	Notes
select	O(1)	O(n)	Streaming filter
project	O(1)	O(n)	Field extraction
join	O(min(R,S))	O(R×S)	Hash join (smaller table in memory)
union	O(1)	O(R+S)	Streaming concatenation
distinct	O(k)	O(n)	k = unique items
groupby	O(g)	O(n)	g = number of groups
sort	O(n)	O(n log n)	External sort for large data

jsonl-algebra is designed for: - ✅ Streaming operations where possible - ✅ Bounded memory for most operations - ✅ Efficient joins via hash tables - ✅ External sorting for datasets larger than RAM

When to Use jsonl-algebra¶

Use ja when: - ✅ Working with JSONL data - ✅ Need SQL-like operations on nested JSON - ✅ Building command-line data pipelines - ✅ Joining multiple data sources - ✅ Aggregating and grouping data - ✅ Want schema inference/validation

Use something else when: - ❌ Need a persistent database → PostgreSQL, MongoDB - ❌ Complex analytical queries → DuckDB, ClickHouse - ❌ Real-time streaming → Apache Kafka, Flink - ❌ Just filtering/mapping → JAF

Combine with: - JAF: Boolean filtering before relational ops - jq: Complex JSON transformations - SQL databases: Load processed data

Design Philosophy¶

🎯 Unix Philosophy: Do one thing well (relational algebra)

🔗 Composability: Every operation produces a valid relation

📦 Streaming First: Process data without loading into memory

🛡️ Type Safety: Schema inference and validation

📚 Pedagogical: Clear mapping to mathematical operations

⚡ Production-Ready: Battle-tested, published on PyPI

Quick Start¶

# Install
pip install jsonl-algebra

# Basic usage
ja select 'age > 30' users.jsonl
ja project name,email users.jsonl
ja join users.jsonl orders.jsonl --on id=user_id

# Interactive mode
ja repl data.jsonl

Comparison: dotsuite vs jsonl-algebra¶

Feature	dotsuite (dotrelate)	jsonl-algebra (ja)
Status	Pedagogical concept	Production tool
Distribution	Source code	PyPI package
Operations	Basic join, union	Full relational algebra
Aggregation	Planned	Complete (sum, avg, count, etc.)
Schema	None	Inference + validation
Interactive	No	REPL included
Format conversion	No	CSV support
Documentation	Concept docs	Full CLI reference
Testing	Basic	Comprehensive test suite

Recommendation: Learn relational concepts through dotsuite, use ja for production.

Resources¶

PyPI: pypi.org/project/jsonl-algebra/
Repository: github.com/queelius/jsonl-algebra
Quick Start: docs/quickstart.md
Concepts: docs/concepts/jsonl-algebra.md

License¶

MIT

jsonl-algebra: Bringing the power of relational algebra to nested JSON, one operation at a time. From SQL's flat tables to JSON's nested structures, with full mathematical foundations.