Relational Algebra on the Command Line

Codd's relational algebra has five operations: selection, projection, join, union, and difference. Every operation takes one or two relations and returns a relation. That's the closure property. It means you can chain operations without limit, because the output of any operation is a valid input to the next.

This is the same structural idea as SICP's closure property for cons: the result of combining things is the same kind of thing you started with. For cons, that gives you trees. For relational algebra, it gives you composable queries.

jsonl-algebra (ja) is a command-line tool I built to make this concrete. Each subcommand implements one relational operation on JSONL data. Each reads a relation from stdin, writes a relation to stdout. Unix pipes become the composition mechanism.

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja select 'amount > 100' \
  | ja groupby user.name --agg sum:amount \
  | ja sort sum_amount:desc

Every | is function composition. The algebra guarantees that each intermediate result is a valid relation, so the pipeline always type-checks.

The Five Operations¶

Selection (\(\sigma\)): filter rows¶

ja select 'user.age > 30' users.jsonl
ja select 'price < 100 and category == `"electronics"`' products.jsonl

Selection is commutative: \(\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))\). The order of two filters doesn't matter. It's also combinable: chaining two selections is equivalent to one selection with both predicates ANDed.

Projection (\(\pi\)): select columns¶

ja project id,name data.jsonl
ja project user.name,user.address.city users.jsonl

Projection is idempotent: \(\pi_a(\pi_{a,b}(R)) = \pi_a(R)\). Projecting a subset of already-projected columns gives the same result as projecting directly.

Join (\(\bowtie\)): combine relations¶

ja join users.jsonl orders.jsonl --on user.id=customer_id

Join is commutative and associative: \((R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)\). You can join three files in any grouping and get the same result. This matters for multi-source pipelines.

Union (\(\cup\)): combine all rows¶

ja union employees.jsonl contractors.jsonl

Union is commutative, associative, and has an identity (the empty relation). It's a monoid.

Difference (\(-\)): set subtraction¶

ja difference all_items.jsonl sold_items.jsonl --on sku

Difference is the one non-commutative operation: \(R - S \neq S - R\). Items without sales is not the same as sales without items.

Extending the Algebra: Aggregation¶

The five core operations are relationally complete, but practical data work needs aggregation. ja adds groupby with the standard functions:

ja groupby department --agg avg:salary,count,max:age employees.jsonl

The interesting design choice is chained grouping. Each groupby adds metadata to the records without aggregating, so you can stack multiple grouping levels through pipes:

cat sales.jsonl \
  | ja groupby region \
  | ja groupby product \
  | ja agg total=sum:amount

This preserves the pipeline model. Each intermediate step is still a valid relation.

Nested Data¶

Traditional relational algebra assumes flat tuples. JSON is deeply nested. ja extends the algebra with dot-path expressions:

ja select 'user.address.city == `"NYC"`' users.jsonl
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id

This lifts selection, projection, and join conditions to work over nested structure. The algebraic properties still hold. \(\sigma_{\text{user.age} > 30}(R)\) is still a selection, it just reaches into the document tree.

Why Pipes¶

The usual way to compose relational operations is SQL: you write a single query with nested subexpressions. SQL is powerful, but it requires a query planner and a runtime that understands the full expression before executing.

Pipes give you a different composition model. Each stage is an independent process. The shell handles the plumbing. You get tee for free (capture intermediates), head for free (limit output), and the full Unix toolkit at every junction.

The tradeoff: pipes can't do the cross-stage optimizations that a query planner can (like pushing a selection past a join). But for the kind of ad-hoc data exploration that JSONL pipelines are good for, the transparency and composability of pipes is worth it.

# Capture the join result for inspection, then continue
ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | tee joined.jsonl \
  | ja groupby user.name --agg sum:amount

A Worked Example¶

Two files:

users.jsonl:

{"user": {"id": 1, "name": "Alice"}}
{"user": {"id": 2, "name": "Bob"}}

orders.jsonl:

{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}

Total spent per user:

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja groupby user.name --agg sum:amount

{"user.name": "Alice", "sum_amount": 125}
{"user.name": "Bob", "sum_amount": 120}

Join, then aggregate. Two operations, composed through a pipe. The algebra guarantees the intermediate join result is a valid relation, so the groupby has well-defined input.

The Point¶

Five operations. One closure property. Unix pipes as the glue.

The relational algebra isn't just a theoretical framework. It's a practical design constraint: if every operation consumes and produces the same kind of thing, composition is free. You don't need a query planner to compose operations. You don't need a special pipeline syntax. The shell already knows how to compose processes that read stdin and write stdout.

SICP calls this the closure property and treats it as the key to building complex structures from simple parts. Codd discovered the same principle for data: a small set of closed operations is enough to express any query. ja just connects the two ideas through a pipe.

This post is part of a series on SICP, exploring how the ideas from Structure and Interpretation of Computer Programs appear in modern programming practice.