Operations Overview
ja
provides a complete set of relational algebra operations designed for JSON data. Each operation follows the principle of taking relations (JSONL data) and producing relations, enabling infinite composability.
Core Operations
Selection and Filtering
Operation |
Symbol |
Purpose |
Example |
select |
σ |
Filter rows based on conditions |
ja select 'age > 30' |
distinct |
δ |
Remove duplicate rows |
ja distinct |
Operation |
Symbol |
Purpose |
Example |
project |
π |
Select/compute columns |
ja project name,total=price*qty |
rename |
ρ |
Rename fields |
ja rename old_name=new_name |
Set Operations
Operation |
Symbol |
Purpose |
Example |
union |
∪ |
Combine all rows from relations |
ja union file1.jsonl file2.jsonl |
intersection |
∩ |
Keep only rows in both relations |
ja intersection file1.jsonl file2.jsonl |
difference |
- |
Keep rows in left but not right |
ja difference file1.jsonl file2.jsonl |
Join Operations
Operation |
Symbol |
Purpose |
Example |
join |
⋈ |
Inner join on condition |
ja join users.jsonl orders.jsonl --on id=user_id |
product |
× |
Cartesian product |
ja product features.jsonl options.jsonl |
Grouping and Aggregation
Operation |
Symbol |
Purpose |
Example |
groupby |
γ |
Group rows (chainable) |
ja groupby department |
agg |
γ |
Aggregate data |
ja agg count,total=sum(amount) |
Ordering
Operation |
Symbol |
Purpose |
Example |
sort |
τ |
Sort rows |
ja sort name,age --desc |
Quick Reference
Selection (Filter Rows)
# Basic filtering
ja select 'age > 30' people.jsonl
# Complex conditions
ja select 'status == "active" and score >= 80' users.jsonl
# Nested field filtering
ja select 'user.location.country == "US"' data.jsonl
# Using group metadata
ja groupby department data.jsonl | ja select '_group_size > 10'
# Select specific fields
ja project name,email people.jsonl
# Compute new fields
ja project name,total=price*quantity orders.jsonl
# Nested field selection
ja project user.name,user.email,order.total data.jsonl
# Flatten nested structures
ja project user.name,user.email --flatten data.jsonl
Aggregation
# Simple aggregation
ja agg count,total=sum(amount) orders.jsonl
# After grouping
ja groupby department employees.jsonl | ja agg avg_salary=avg(salary)
# Chained grouping
ja groupby region sales.jsonl | ja groupby product | ja agg revenue=sum(amount)
# Conditional aggregation
ja agg shipped=count_if(status=="shipped"),total=sum(amount) orders.jsonl
Joins
# Inner join
ja join users.jsonl orders.jsonl --on id=user_id
# Join with nested fields
ja join customers.jsonl orders.jsonl --on customer.id=customer_id
# Multiple files in pipeline
cat orders.jsonl | ja join customers.jsonl --on customer_id=id
Set Operations
# Combine files
ja union current.jsonl historical.jsonl
# Find common records
ja intersection whitelist.jsonl data.jsonl
# Find differences
ja difference all_users.jsonl inactive_users.jsonl
Advanced Features
Nested Data Support
All operations support dot notation for nested fields:
# Works with any level of nesting
ja select 'user.profile.preferences.notifications == true'
ja project user.profile.name,user.profile.email
ja groupby user.location.country
ja join file1.jsonl file2.jsonl --on user.id=customer.id
Expression Language
Rich expression support in select
and project
:
# Arithmetic
ja project name,annual_salary=monthly_salary*12
# String operations
ja select 'name.startswith("A")'
# Comparisons
ja select 'created_date > "2024-01-01"'
# Logical operations
ja select 'age >= 18 and status == "active"'
Aggregation Functions
Function |
Purpose |
Example |
count |
Count rows |
count |
sum(field) |
Sum numeric values |
sum(amount) |
avg(field) |
Average of values |
avg(score) |
min(field) |
Minimum value |
min(date) |
max(field) |
Maximum value |
max(price) |
list(field) |
Collect values in list |
list(tag) |
first(field) |
First value |
first(name) |
last(field) |
Last value |
last(status) |
Conditional Aggregations
Function |
Purpose |
Example |
count_if(condition) |
Count matching rows |
count_if(status=="paid") |
sum_if(field, condition) |
Sum matching values |
sum_if(amount, region=="North") |
avg_if(field, condition) |
Average matching values |
avg_if(score, active==true) |
Composition Examples
Building Complex Pipelines
# E-commerce analysis pipeline
cat orders.jsonl \
| ja select 'status == "completed"' \
| ja join products.jsonl --on product_id=id \
| ja join customers.jsonl --on customer_id=id \
| ja project \
customer.name, \
product.category, \
total=quantity*price, \
month=date[0:7] \
| ja groupby customer.tier \
| ja groupby product.category \
| ja agg \
revenue=sum(total), \
orders=count, \
avg_order=avg(total)
Data Quality Pipeline
# Clean and validate data
cat raw_data.jsonl \
| ja select 'email != null and age > 0' \
| ja project \
name, \
email=email.lower(), \
age, \
valid=email.contains("@") \
| ja select 'valid == true' \
| ja distinct \
| ja sort name
Time Series Analysis
# Analyze user engagement over time
cat events.jsonl \
| ja project \
user_id, \
event_type, \
date=timestamp[0:10] \
| ja groupby date \
| ja groupby event_type \
| ja agg \
events=count, \
unique_users=count_distinct(user_id) \
| ja sort date
Optimization Tips
- Filter Early: Use
select
before expensive operations
- Project Only Needed Fields: Reduce data size with
project
- Use Direct Aggregation: Use
--agg
for simple grouping
- Order Operations: Put most selective filters first
# Optimized pipeline
cat large_file.jsonl \
| ja select 'timestamp > "2024-01-01"' \
| ja project user_id,amount,category \
| ja groupby category \
| ja agg total=sum(amount)
Memory Efficiency
All operations are streaming by design:
- Process arbitrarily large files
- Constant memory usage (except for operations requiring sorting/grouping)
- Natural integration with Unix pipes
Error Handling
Common Issues
Invalid Expressions:
# Check expression syntax
ja select 'age > thirty' # Error: 'thirty' not defined
ja select 'age > 30' # Correct
Missing Fields:
# Handle missing fields gracefully
ja select 'field != null' # Only rows where field exists
Type Mismatches:
# Ensure consistent types
ja select 'amount > 0' # Assumes amount is numeric
Next Steps