Future Plans for ja¶
Lazy Evaluation and JSON Query Language¶
The Core Idea¶
What if we could represent JSONL algebra pipelines as JSON? Instead of executing each step immediately, we'd build a query plan that could be inspected, modified, and optimized before execution.
Simple Example¶
# Current eager mode (executes immediately)
cat orders.jsonl | ja select 'amount > 100' | ja groupby customer | ja agg total=sum(amount)
# Future lazy mode (builds query plan)
cat orders.jsonl | ja query --build | ja select 'amount > 100' | ja groupby customer | ja agg total=sum(amount)
This would output a JSON query plan:
{
"source": "stdin",
"operations": [
{"op": "select", "expr": "amount > 100"},
{"op": "groupby", "key": "customer"},
{"op": "agg", "spec": {"total": "sum(amount)"}}
]
}
Why This Matters¶
- Inspection: See what will happen before it happens
- Optimization: Rearrange operations for better performance
- Reusability: Save and share query definitions
- Tooling: Other tools could generate or consume these plans
Multi-Source Example¶
# A join pipeline
cat transactions.jsonl | ja query --build | ja join users.jsonl --on user_id | ja select 'user.active == true'
Query plan:
{
"source": "stdin",
"operations": [
{
"op": "join",
"with": "users.jsonl",
"on": ["user_id", "id"]
},
{
"op": "select",
"expr": "user.active == true"
}
]
}
Execution Options¶
# Build and save a query plan
ja query --build < orders.jsonl > plan.json
# Execute a saved plan
ja query --execute plan.json < orders.jsonl
# Show what would happen (dry run)
ja query --explain plan.json
stdin Handling¶
For stdin sources, we keep it simple:
# If stdin is too large for lazy mode
$ cat huge_file.jsonl | ja query --build
Error: stdin too large for query building (>10MB)
Options:
1. Save to a file first: cat huge_file.jsonl > data.jsonl
2. Use eager mode: ja select ... (without --build)
3. Use existing tools: cat huge_file.jsonl | head -10000 | ja query --build
Integration with Unix Tools¶
The beauty is that JSON query plans work well with existing tools:
# Use jq to modify query plans
ja query --build < data.jsonl | jq '.operations += [{"op": "limit", "n": 100}]' | ja query --execute
# Version control your queries
git add queries/monthly_report.json
# Generate queries programmatically
python generate_query.py | ja query --execute < data.jsonl
Potential Benefits¶
-
Query Optimization
-
Debugging
-
Alternative Execution Engines
Open Questions¶
- Should this be part of
jaor a separate tool? - How much optimization is worth the complexity?
- What's the right balance between lazy and eager execution?
Next Steps¶
Start simple:
1. Add --dry-run to show what an operation would do
2. Add --explain to show row count estimates
3. Gather feedback on whether full lazy evaluation is needed
The goal is to enhance ja's power while maintaining its simplicity. JSON query plans could be the bridge between simple command-line usage and more complex data processing needs.
Streaming Mode and Window Processing¶
The --streaming Flag¶
Add a --streaming flag that enforces streaming constraints:
# This would error
ja sort --streaming data.jsonl
Error: sort operation requires seeing all data and cannot be performed in streaming mode
# This would work
ja select 'amount > 100' --streaming data.jsonl
Benefits: - Explicit about memory usage expectations - Fail fast when streaming isn't possible - Good for production pipelines with memory constraints
Window-Based Processing¶
For operations that normally require seeing all data, add --window-size support:
# Sort within 1000-row windows
ja sort amount --window-size 1000 huge.jsonl
# Collect groups within windows
ja groupby region huge.jsonl | ja collect --window-size 5000
# Remove duplicates within windows
ja distinct --window-size 10000 huge.jsonl
This provides a middle ground: - Process arbitrarily large files - Trade completeness for memory efficiency - Useful for approximate results on huge datasets
Operations by Streaming Capability¶
Always Streaming¶
select- Row-by-row filteringproject- Row-by-row transformationrename- Row-by-row field renaminggroupby(without --agg) - Adds metadata only
Never Streaming (Need Full Data)¶
sort- Must compare all rowsdistinct- Must track all seen valuesgroupby --agg- Must see all groupscollect- Must gather all group membersjoin(currently) - Needs to load right side
Could Be Streaming¶
joinwith pre-sorted data and merge joinunionwith duplicate handling disabledintersection/differencewith bloom filters
Implementation Plan¶
- Phase 1: Add
--streamingflag to enforce constraints - Phase 2: Implement
--window-sizefor sort, distinct, collect - Phase 3: Document streaming characteristics in help text
- Phase 4: Add approximate algorithms for streaming versions
Example: Memory-Conscious Pipeline¶
# Process 1TB log file with memory constraints
cat huge_log.jsonl \
| ja select 'level == "ERROR"' --streaming \
| ja project timestamp,message,host \
| ja groupby host \
| ja collect --window-size 10000 \
| ja agg errors=count --window-size 10000
This processes the entire file while never holding more than 10,000 rows in memory.
Path-like Matching¶
We provide dot notation. We will extend this to support more complex matching.
The path field1.*.field2[<condition-predicate>].field4 can point to many values at field4. Maybe the value is
a simple string or integer, or maybe it is arbititrarily complex JSON values. When we group by
such a field, many values may be returned for a single JSONL line (JSON).
When we perform a group-by operation, we group by the value at the end of that path. If two JSON lines have say the same value associated to field4 in the above example, then they placed in the same group. Right?
Not so fast.
Suppose we have a JSONL file with 3 entries:
If we group by a.key, we get one group with the value value and the other gru