jsonl-algebra: Production-Grade Relational Algebra for Nested JSON
jsonl-algebra (command: ja) is a production-grade implementation of relational algebra for JSONL data. It's the full-featured, battle-tested realization of dotsuite's dotrelate component—bringing SQL-like operations to the command line with first-class support for nested JSON structures.
The Relationship to Dotsuite¶
In dotsuite's architecture, dotrelate provides relational operations on document collections:
- dotrelate: Join, union, project, difference on collections
- Theoretical foundation: Relational algebra lifted to nested documents
jsonl-algebra (ja) is the production implementation of these concepts, with:
- ✅ Published on PyPI: pip install jsonl-algebra
- ✅ Battle-tested: Used in real-world data pipelines
- ✅ Feature-complete: All relational operations + aggregations
- ✅ Streaming: Process datasets larger than memory
- ✅ Schema tools: Inference and validation
- ✅ Interactive REPL: Build pipelines step-by-step
Core Insight: Relational Algebra for Nested Data¶
Traditional relational algebra assumes flat tables:
But modern JSON is deeply nested:
{
"user": {
"id": 1,
"name": "Alice",
"address": {
"city": "NYC",
"zip": "10001"
}
},
"orders": [
{"id": 101, "amount": 50}
]
}
jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:
ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id
The Five Core Operations¶
Relational algebra has five fundamental operations that form a complete algebra:
1. Selection (σ) — Filter Rows¶
Mathematical notation: \(\sigma_{\text{predicate}}(R)\)
# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl
# Filter on nested fields
ja select 'user.age > 30' users.jsonl
# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl
Properties: - Commutative: \(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))\) - Combinable: \(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)\)
2. Projection (π) — Select/Compute Columns¶
Mathematical notation: \(\pi_{\text{columns}}(R)\)
# Pick specific fields
ja project id,name data.jsonl
# Access nested fields
ja project user.name,user.address.city users.jsonl
# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl
Properties: - Idempotent (for simple projections): \(\pi_a(\pi_{a,b}(R)) = \pi_a(R)\)
3. Join (⋈) — Combine Relations¶
Mathematical notation: \(R \bowtie_{\text{condition}} S\)
# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id
# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id
# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email
Properties: - Commutative: \(R \bowtie S = S \bowtie R\) - Associative: \((R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)\)
This means you can join multiple files in any order:
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja join - products.jsonl --on product_id=id
4. Union (∪) — Combine All Rows¶
Mathematical notation: \(R \cup S\)
# Combine employees and contractors
ja union employees.jsonl contractors.jsonl
# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl
Properties: - Commutative: \(R \cup S = S \cup R\) - Associative: \((R \cup S) \cup T = R \cup (S \cup T)\) - Identity: \(R \cup \emptyset = R\)
5. Difference (−) — Set Subtraction¶
Mathematical notation: \(R - S\)
# Find users without orders
ja difference users.jsonl customers_with_orders.jsonl --on id
# Set operations
ja difference all_items.jsonl sold_items.jsonl --on sku
Properties: - Not commutative: \(R - S \neq S - R\) - Anti-identity: \(R - \emptyset = R\), but \(\emptyset - R = \emptyset\)
Aggregation Operations: Extending the Algebra¶
While the five core operations are complete, practical data processing needs aggregation:
GroupBy (γ) — Aggregate by Groups¶
Mathematical notation: \(\gamma_{\text{group\_keys}, \text{aggregations}}(R)\)
# Count by category
ja groupby category --agg count products.jsonl
# Sum by customer
ja groupby customer_id --agg sum:amount orders.jsonl
# Multiple aggregations
ja groupby department --agg avg:salary,count,max:age employees.jsonl
Supported aggregations:
- count: Number of items
- sum: Sum of values
- avg: Average of values
- min/max: Minimum/maximum
- list: Collect all values into array
Chained Grouping: An Innovation¶
Traditional SQL requires nested subqueries for multi-level grouping. jsonl-algebra enables chained grouping through metadata preservation:
# Multi-level grouping
cat sales.jsonl \
| ja groupby region \ # First level
| ja groupby product \ # Second level
| ja groupby month \ # Third level
| ja agg total=sum:amount # Final aggregation
How it works: Each groupby adds metadata without aggregating:
{
"sale_id": 101,
"region": "east",
"product": "laptop",
"month": "jan",
"amount": 1200,
"_groups": [
{"field": "region", "value": "east"},
{"field": "product", "value": "laptop"},
{"field": "month", "value": "jan"}
],
"_group_size": 5,
"_group_index": 0
}
This preserves the relational structure while tracking grouping hierarchy—a key innovation that enables composition.
Real-World Example: The 5-Minute Tour¶
Setup: Two JSONL files
users.jsonl:
orders.jsonl:
{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}
Goal: Total amount spent by each user
Solution:
# Join users and orders, then aggregate
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja groupby user.name --agg sum:amount
Output:
This demonstrates:
- ✅ Nested field access (user.id, user.name)
- ✅ Relational join
- ✅ Aggregation
- ✅ Unix-style piping
Additional Operations¶
Distinct — Remove Duplicates¶
# Unique users
ja distinct users.jsonl
# Distinct on specific fields
ja distinct --key user.id,email contacts.jsonl
Sort — Order Results¶
# Sort by age ascending
ja sort age users.jsonl
# Sort descending
ja sort age:desc users.jsonl
# Multi-key sort
ja sort department,salary:desc employees.jsonl
Rename — Field Aliasing¶
# Rename single field
ja rename id=user_id data.jsonl
# Rename nested fields
ja rename user.loc=user.location data.jsonl
Head/Tail — Limit Results¶
Schema Operations: Inference and Validation¶
Schema Inference¶
Automatically discover the structure of your data:
Output:
{
"type": "object",
"properties": {
"id": {"type": "integer"},
"name": {"type": "string"},
"age": {"type": "integer"},
"address": {
"type": "object",
"properties": {
"city": {"type": "string"},
"zip": {"type": "string"}
}
}
}
}
Schema Validation¶
Validate data against a schema:
This is crucial for: - Data quality checks - ETL pipeline validation - API contract testing
Format Conversion: CSV Integration¶
JSONL to CSV¶
# Flatten nested structure to CSV
ja to-csv users.jsonl > users.csv
# Select specific fields
ja project user.name,user.age users.jsonl | ja to-csv > simple.csv
Intelligent flattening:
Becomes:CSV to JSONL¶
Interactive REPL¶
Build and test pipelines interactively:
$ ja repl users.jsonl
ja> select 'age > 25'
# Preview first few results...
ja> project name,email
# Preview transformed data...
ja> groupby department --agg avg:salary
# See aggregated results...
ja> write output.jsonl
# Save pipeline results
The REPL is perfect for: - Exploring unfamiliar data - Testing query predicates - Iterative pipeline development - Teaching relational algebra concepts
Command-Line Examples¶
Log Analysis¶
# Find error logs from auth service
ja select 'level == `"ERROR"` and service == `"auth"`' app.log.jsonl
# Count errors by service
ja select 'level == `"ERROR"`' app.log.jsonl \
| ja groupby service --agg count
ETL Pipeline¶
# Extract, transform, load
ja select 'status == `"completed"`' sales.jsonl \
| ja project customer_id,amount,date \
| ja join - customers.jsonl --on customer_id=id \
| ja groupby customer.region --agg sum:amount \
| ja sort sum_amount:desc \
| ja head 10
Data Quality Checks¶
# Find users without email
ja select 'not (email != `null`)' users.jsonl
# Find duplicate user IDs
ja groupby user_id users.jsonl \
| ja select '_group_size > 1'
Report Generation¶
# Monthly sales report
ja join orders.jsonl products.jsonl --on product_id=id \
| ja groupby month,category --agg sum:revenue,count:orders \
| ja sort month,sum_revenue:desc \
| ja to-csv > monthly_report.csv
Python API¶
Use ja programmatically:
from ja.core import read_jsonl, join, groupby_agg, select_rows
from ja.schema import infer_schema
# Load data
users = read_jsonl("users.jsonl")
orders = read_jsonl("orders.jsonl")
# Join
joined = join(users, orders, on=[("user.id", "customer_id")])
# Filter
high_value = select_rows(joined, "amount > 100")
# Aggregate
result = groupby_agg(
high_value,
group_by_key="user.name",
aggregations=[("sum", "amount"), ("count", None)]
)
print(list(result))
# Schema inference
schema = infer_schema(users)
Integration with Ecosystem¶
jsonl-algebra works seamlessly with other tools:
With JAF (filtering)¶
# JAF for boolean filtering, ja for relational ops
jaf filter users.jsonl '(eq? @status "active")' --eval \
| ja join - orders.jsonl --on id=user_id \
| ja groupby name --agg sum:amount
With jq (JSON manipulation)¶
# ja for relational ops, jq for complex transformations
ja join users.jsonl orders.jsonl --on id=user_id \
| jq '.metadata = {processed: now, version: "1.0"}'
With standard Unix tools¶
# Combine with grep, sort, uniq
ja project email users.jsonl \
| grep '@gmail.com' \
| sort \
| uniq -c
Theoretical Foundation: Relational Algebra¶
jsonl-algebra implements Codd's relational algebra extended for nested data:
Completeness¶
The five operations (σ, π, ⋈, ∪, −) form a complete algebra—any relational query can be expressed using these primitives.
Closure Property¶
Every operation produces a valid relation:
This ensures composability: R | op1 | op2 | op3 always works.
Query Optimization¶
Because operations have well-defined properties, queries can be optimized:
Selection pushdown: [ \sigma_p(R \bowtie S) \equiv (\sigma_p(R)) \bowtie S \quad \text{(if p only references R)} ]
Projection elimination: [ \pi_a(\pi_{a,b}(R)) \equiv \pi_a(R) ]
These properties enable automatic query optimization (future work).
Nested Data Extension¶
Traditional relational algebra assumes flat tuples. jsonl-algebra extends this with:
Path Expressions¶
Replace column names with dot-separated paths:
Deep Equality¶
Join conditions can reference nested fields:
Projection Flattening¶
Projection can flatten nested structures:
Performance Characteristics¶
| Operation | Memory | Time | Notes |
|---|---|---|---|
| select | O(1) | O(n) | Streaming filter |
| project | O(1) | O(n) | Field extraction |
| join | O(min(R,S)) | O(R×S) | Hash join (smaller table in memory) |
| union | O(1) | O(R+S) | Streaming concatenation |
| distinct | O(k) | O(n) | k = unique items |
| groupby | O(g) | O(n) | g = number of groups |
| sort | O(n) | O(n log n) | External sort for large data |
jsonl-algebra is designed for: - ✅ Streaming operations where possible - ✅ Bounded memory for most operations - ✅ Efficient joins via hash tables - ✅ External sorting for datasets larger than RAM
When to Use jsonl-algebra¶
Use ja when:
- ✅ Working with JSONL data
- ✅ Need SQL-like operations on nested JSON
- ✅ Building command-line data pipelines
- ✅ Joining multiple data sources
- ✅ Aggregating and grouping data
- ✅ Want schema inference/validation
Use something else when: - ❌ Need a persistent database → PostgreSQL, MongoDB - ❌ Complex analytical queries → DuckDB, ClickHouse - ❌ Real-time streaming → Apache Kafka, Flink - ❌ Just filtering/mapping → JAF
Combine with: - JAF: Boolean filtering before relational ops - jq: Complex JSON transformations - SQL databases: Load processed data
Design Philosophy¶
🎯 Unix Philosophy: Do one thing well (relational algebra)
🔗 Composability: Every operation produces a valid relation
📦 Streaming First: Process data without loading into memory
🛡️ Type Safety: Schema inference and validation
📚 Pedagogical: Clear mapping to mathematical operations
⚡ Production-Ready: Battle-tested, published on PyPI
Quick Start¶
# Install
pip install jsonl-algebra
# Basic usage
ja select 'age > 30' users.jsonl
ja project name,email users.jsonl
ja join users.jsonl orders.jsonl --on id=user_id
# Interactive mode
ja repl data.jsonl
Comparison: dotsuite vs jsonl-algebra¶
| Feature | dotsuite (dotrelate) | jsonl-algebra (ja) |
|---|---|---|
| Status | Pedagogical concept | Production tool |
| Distribution | Source code | PyPI package |
| Operations | Basic join, union | Full relational algebra |
| Aggregation | Planned | Complete (sum, avg, count, etc.) |
| Schema | None | Inference + validation |
| Interactive | No | REPL included |
| Format conversion | No | CSV support |
| Documentation | Concept docs | Full CLI reference |
| Testing | Basic | Comprehensive test suite |
Recommendation: Learn relational concepts through dotsuite, use ja for production.
Resources¶
- PyPI: pypi.org/project/jsonl-algebra/
- Repository: github.com/queelius/jsonl-algebra
- Quick Start: docs/quickstart.md
- Concepts: docs/concepts/jsonl-algebra.md
License¶
MIT
jsonl-algebra: Bringing the power of relational algebra to nested JSON, one operation at a time. From SQL's flat tables to JSON's nested structures, with full mathematical foundations.