Tutorial: Analyzing Log Files¶

Learn how to use jsonl-algebra to analyze log files, extract insights, and identify issues. This tutorial walks through a real-world scenario: analyzing web server access logs.

Time required: 15-20 minutes

What you'll learn:

Parsing and filtering log data
Calculating statistics and trends
Identifying errors and anomalies
Creating reports from logs
Building monitoring dashboards

Scenario¶

You're analyzing access logs from a web application. The logs are in JSONL format with this structure:

{"timestamp": "2025-10-27T10:15:30Z", "method": "GET", "path": "/api/users", "status": 200, "duration_ms": 45, "user_id": 1001, "ip": "192.168.1.100"}
{"timestamp": "2025-10-27T10:15:31Z", "method": "POST", "path": "/api/orders", "status": 201, "duration_ms": 120, "user_id": 1002, "ip": "192.168.1.101"}
{"timestamp": "2025-10-27T10:15:32Z", "method": "GET", "path": "/api/products", "status": 500, "duration_ms": 5000, "user_id": 1001, "ip": "192.168.1.100"}

Setup: Create Sample Data¶

First, let's create a realistic sample log file:

cat > access_logs.jsonl << 'EOF'
{"timestamp": "2025-10-27T10:00:00Z", "method": "GET", "path": "/api/users", "status": 200, "duration_ms": 45, "user_id": 1001, "ip": "192.168.1.100"}
{"timestamp": "2025-10-27T10:00:05Z", "method": "POST", "path": "/api/orders", "status": 201, "duration_ms": 120, "user_id": 1002, "ip": "192.168.1.101"}
{"timestamp": "2025-10-27T10:00:10Z", "method": "GET", "path": "/api/products", "status": 500, "duration_ms": 5000, "user_id": 1001, "ip": "192.168.1.100"}
{"timestamp": "2025-10-27T10:00:15Z", "method": "GET", "path": "/api/users/1001", "status": 200, "duration_ms": 30, "user_id": 1001, "ip": "192.168.1.100"}
{"timestamp": "2025-10-27T10:00:20Z", "method": "DELETE", "path": "/api/orders/500", "status": 404, "duration_ms": 20, "user_id": 1003, "ip": "192.168.1.102"}
{"timestamp": "2025-10-27T10:00:25Z", "method": "GET", "path": "/api/products", "status": 200, "duration_ms": 55, "user_id": 1002, "ip": "192.168.1.101"}
{"timestamp": "2025-10-27T10:00:30Z", "method": "POST", "path": "/api/users", "status": 400, "duration_ms": 15, "user_id": null, "ip": "192.168.1.103"}
{"timestamp": "2025-10-27T10:00:35Z", "method": "GET", "path": "/api/orders", "status": 200, "duration_ms": 80, "user_id": 1001, "ip": "192.168.1.100"}
{"timestamp": "2025-10-27T10:00:40Z", "method": "PUT", "path": "/api/users/1002", "status": 500, "duration_ms": 3000, "user_id": 1002, "ip": "192.168.1.101"}
{"timestamp": "2025-10-27T10:00:45Z", "method": "GET", "path": "/api/products/search", "status": 200, "duration_ms": 150, "user_id": 1001, "ip": "192.168.1.100"}
{"timestamp": "2025-10-27T10:00:50Z", "method": "POST", "path": "/api/orders", "status": 201, "duration_ms": 95, "user_id": 1003, "ip": "192.168.1.102"}
{"timestamp": "2025-10-27T10:00:55Z", "method": "GET", "path": "/api/users", "status": 200, "duration_ms": 40, "user_id": 1002, "ip": "192.168.1.101"}
{"timestamp": "2025-10-27T10:01:00Z", "method": "GET", "path": "/api/products/123", "status": 404, "duration_ms": 25, "user_id": 1001, "ip": "192.168.1.100"}
{"timestamp": "2025-10-27T10:01:05Z", "method": "POST", "path": "/api/auth/login", "status": 401, "duration_ms": 10, "user_id": null, "ip": "192.168.1.104"}
{"timestamp": "2025-10-27T10:01:10Z", "method": "GET", "path": "/api/orders/1001", "status": 200, "duration_ms": 60, "user_id": 1001, "ip": "192.168.1.100"}
EOF

Task 1: Find All Errors¶

Let's identify all error responses (status >= 400):

ja select 'status >= 400' access_logs.jsonl

Output:

{"timestamp": "2025-10-27T10:00:10Z", "method": "GET", "path": "/api/products", "status": 500, ...}
{"timestamp": "2025-10-27T10:00:20Z", "method": "DELETE", "path": "/api/orders/500", "status": 404, ...}
{"timestamp": "2025-10-27T10:00:30Z", "method": "POST", "path": "/api/users", "status": 400, ...}
{"timestamp": "2025-10-27T10:00:40Z", "method": "PUT", "path": "/api/users/1002", "status": 500, ...}
{"timestamp": "2025-10-27T10:01:00Z", "method": "GET", "path": "/api/products/123", "status": 404, ...}
{"timestamp": "2025-10-27T10:01:05Z", "method": "POST", "path": "/api/auth/login", "status": 401, ...}

Save Errors to File¶

ja select 'status >= 400' access_logs.jsonl > errors.jsonl

Count Error Types¶

ja select 'status >= 400' access_logs.jsonl \
  | ja groupby status --agg count \
  | ja sort count --desc

Output:

{"status": 404, "count": 2}
{"status": 500, "count": 2}
{"status": 400, "count": 1}
{"status": 401, "count": 1}

HTTP Status Categories

400-499: Client errors (bad requests, auth failures)
500-599: Server errors (application crashes, timeouts)

Task 2: Identify Slow Requests¶

Find requests that took longer than 1 second (1000ms):

ja select 'duration_ms > 1000' access_logs.jsonl \
  | ja project timestamp,method,path,duration_ms \
  | ja sort duration_ms --desc

Output:

{"timestamp": "2025-10-27T10:00:10Z", "method": "GET", "path": "/api/products", "duration_ms": 5000}
{"timestamp": "2025-10-27T10:00:40Z", "method": "PUT", "path": "/api/users/1002", "duration_ms": 3000}

Calculate Performance Statistics¶

ja groupby path --agg count,avg_ms=avg:duration_ms,max_ms=max:duration_ms access_logs.jsonl \
  | ja sort avg_ms --desc

Output:

{"path": "/api/products", "count": 3, "avg_ms": 1683.33, "max_ms": 5000}
{"path": "/api/orders", "count": 3, "avg_ms": 98.33, "max_ms": 120}
{"path": "/api/products/search", "count": 1, "avg_ms": 150.0, "max_ms": 150}
...

Task 3: Analyze Request Patterns¶

Requests by HTTP Method¶

ja groupby method --agg count access_logs.jsonl \
  | ja sort count --desc

Output:

{"method": "GET", "count": 9}
{"method": "POST", "count": 4}
{"method": "PUT", "count": 1}
{"method": "DELETE", "count": 1}

Requests by Endpoint¶

ja groupby path --agg count access_logs.jsonl \
  | ja sort count --desc \
  | head -5

Output:

{"path": "/api/products", "count": 2}
{"path": "/api/users", "count": 2}
{"path": "/api/orders", "count": 2}
...

Success Rate by Endpoint¶

ja groupby path --agg total=count,errors="count:status >= 400" access_logs.jsonl

Task 4: User Activity Analysis¶

Most Active Users¶

ja select 'user_id != null' access_logs.jsonl \
  | ja groupby user_id --agg requests=count \
  | ja sort requests --desc

Output:

{"user_id": 1001, "requests": 7}
{"user_id": 1002, "requests": 4}
{"user_id": 1003, "requests": 2}

User Error Analysis¶

Find users experiencing the most errors:

ja select 'status >= 400 and user_id != null' access_logs.jsonl \
  | ja groupby user_id --agg errors=count \
  | ja sort errors --desc

Output:

{"user_id": 1001, "errors": 2}
{"user_id": 1002, "errors": 1}

Task 5: Create a Summary Report¶

Combine multiple analyses into a comprehensive report:

#!/bin/bash
# log_report.sh

LOG_FILE="access_logs.jsonl"

echo "=== Log Analysis Report ==="
echo

echo "Total Requests:"
wc -l < "$LOG_FILE"
echo

echo "Error Summary:"
ja select 'status >= 400' "$LOG_FILE" \
  | ja groupby status --agg count \
  | ja sort count --desc

echo
echo "Top 5 Endpoints by Request Count:"
ja groupby path --agg count "$LOG_FILE" \
  | ja sort count --desc \
  | head -5

echo
echo "Slowest Endpoints (avg response time):"
ja groupby path --agg avg_ms=avg:duration_ms "$LOG_FILE" \
  | ja sort avg_ms --desc \
  | head -5

echo
echo "Most Active Users:"
ja select 'user_id != null' "$LOG_FILE" \
  | ja groupby user_id --agg count \
  | ja sort count --desc \
  | head -5

Make it executable and run:

chmod +x log_report.sh
./log_report.sh

Task 6: Time-based Analysis¶

Extract Hour from Timestamp¶

To analyze patterns by hour, we'll use a technique with ja expressions:

# For more complex time parsing, use the data explorer or a preprocessing step
# For now, let's group by full timestamp prefix

ja project timestamp,path,status access_logs.jsonl | head -5

Peak Usage Times¶

For a real-world scenario, you'd parse timestamps. Here's a conceptual approach:

# preprocess.py - Add hour field
import json
from datetime import datetime

with open('access_logs.jsonl') as f:
    for line in f:
        record = json.loads(line)
        dt = datetime.fromisoformat(record['timestamp'].replace('Z', '+00:00'))
        record['hour'] = dt.hour
        print(json.dumps(record))

Then analyze:

python preprocess.py | ja groupby hour --agg requests=count \
  | ja sort hour

Task 7: Alerting on Anomalies¶

High Error Rate Detection¶

# Find endpoints with >50% error rate
ja groupby path \
  --agg total=count,"errors=count:status >= 400" \
  access_logs.jsonl \
  | ja select 'errors * 2 > total'  # More than 50% errors

Performance Degradation¶

# Endpoints averaging >500ms
ja groupby path --agg avg_ms=avg:duration_ms access_logs.jsonl \
  | ja select 'avg_ms > 500'

Output:

{"path": "/api/products", "avg_ms": 1683.33}
{"path": "/api/users/1002", "avg_ms": 3000.0}

Task 8: Multi-File Analysis¶

If logs are split across multiple files:

# Combine all log files
ja union logs_morning.jsonl logs_afternoon.jsonl logs_evening.jsonl \
  > full_day_logs.jsonl

# Or analyze directly
ja union logs_*.jsonl \
  | ja select 'status >= 500' \
  | ja groupby path --agg count

Task 9: Export for Visualization¶

Create CSV for Spreadsheet Analysis¶

ja groupby path --agg count,avg_ms=avg:duration_ms,errors="count:status >= 400" access_logs.jsonl \
  | ja export csv > endpoint_stats.csv

Open endpoint_stats.csv in Excel, Google Sheets, or similar.

Create JSON for Dashboard¶

ja groupby path --agg count,avg_ms=avg:duration_ms access_logs.jsonl \
  | ja export json > dashboard_data.json

Advanced Patterns¶

Pattern 1: Funnel Analysis¶

Track user journey through endpoints:

# Users who hit products endpoint
ja select 'path == "/api/products"' access_logs.jsonl \
  | ja project user_id \
  | ja distinct > viewed_products.jsonl

# Of those, who placed orders?
ja join viewed_products.jsonl access_logs.jsonl --on user_id=user_id \
  | ja select 'path == "/api/orders" and method == "POST"' \
  | ja project user_id \
  | ja distinct

Pattern 2: Session Reconstruction¶

Group requests by user to reconstruct sessions:

ja select 'user_id != null' access_logs.jsonl \
  | ja sort user_id,timestamp \
  | ja groupby user_id --agg requests=count,paths=list:path

Pattern 3: Correlation Analysis¶

Find endpoints often accessed together:

# First endpoint per user
ja select 'user_id != null' access_logs.jsonl \
  | ja sort user_id,timestamp \
  | ja groupby user_id \
  # Take first path per user (would need custom aggregation)

Real-World Integration¶

With Monitoring Tools¶

# Continuous monitoring
tail -f /var/log/app/access.log \
  | ja select 'status >= 500' \
  | ja project timestamp,path,status,duration_ms \
  > critical_errors.jsonl &

# Alert when threshold reached
watch -n 60 'ja select "status >= 500" critical_errors.jsonl | wc -l'

With Log Aggregation¶

# Process logs from multiple servers
for server in web{1..5}; do
  scp $server:/var/log/app/access.log ${server}_access.jsonl
done

ja union *_access.jsonl \
  | ja groupby server_hostname --agg errors="count:status >= 500"

Best Practices¶

Filter Early - Reduce data size before expensive operations

# Good
ja select 'status >= 400' huge.jsonl | ja groupby path

# Bad
ja groupby path huge.jsonl | ja select 'count > 10'

Save Intermediate Results - For complex analyses

ja select 'status >= 400' logs.jsonl > errors.jsonl
ja groupby path --agg count errors.jsonl
ja groupby user_id --agg count errors.jsonl

Use Scripts for Reports - Automate repetitive analysis
Timestamp Preprocessing - Add derived time fields early
Monitor Performance - Keep an eye on query execution time

Troubleshooting¶

Issue: Memory errors with large logs¶

Solution: Filter or sample first

ja select 'status >= 400' huge_logs.jsonl | ja groupby path
# Or
head -100000 huge_logs.jsonl | ja groupby path

Issue: Inconsistent timestamps¶

Solution: Normalize in preprocessing

# normalize_timestamps.py
import json
from dateutil import parser

for line in sys.stdin:
    record = json.loads(line)
    record['timestamp'] = parser.parse(record['timestamp']).isoformat()
    print(json.dumps(record))

Summary¶

You've learned how to:

✅ Filter logs for errors and anomalies
✅ Calculate performance statistics
✅ Analyze request patterns
✅ Track user activity
✅ Create summary reports
✅ Detect performance issues
✅ Export data for visualization

Next Steps¶

Real-time Monitoring Tutorial - Set up live monitoring
ETL Pipeline Tutorial - Build data pipelines
Log Analyzer Integration - Use the built-in tool
Data Quality Tutorial - Validate and clean data

Practice Exercises¶

Try these on your own logs:

Find the endpoint with the highest error rate
Identify the slowest hour of the day
Calculate 95^th percentile response time per endpoint
Find users with the most failed authentication attempts
Create an automated daily summary report

Well Done!

You can now analyze logs like a pro with jsonl-algebra!