Dataset Management¶
The REPL provides comprehensive dataset management commands to load, organize, and inspect your data.
Loading Datasets¶
load <file> [name]¶
Load a JSONL file into the workspace:
# Load with default name (filename without extension)
ja> load users.jsonl
Loaded: users (current)
Path: /home/user/users.jsonl
# Load with custom name
ja> load /data/customers.jsonl clients
Loaded: clients (current)
Path: /data/customers.jsonl
Behavior:
- The dataset becomes the current dataset
- Default name is the filename stem (without .jsonl)
- Names must be unique (loading fails if name already exists)
- File paths are stored, not data (streaming model preserved)
Switching Between Datasets¶
cd <name>¶
Change the current dataset:
pwd / current¶
Show the current dataset:
Listing Datasets¶
datasets¶
List all registered datasets:
ja> datasets
Registered datasets:
orders
/home/user/orders.jsonl
users (current)
/home/user/users.jsonl
filtered
/tmp/ja_repl_abc123/filtered_1.jsonl
Output shows:
- Dataset names (alphabetically sorted)
- Current dataset marked with (current)
- File paths (temp files for derived datasets)
Inspecting Datasets¶
info [name]¶
Show detailed statistics about a dataset:
# Show info for current dataset
ja> info
Dataset: users
Path: /home/user/users.jsonl
Rows: 1,234
Size: 456.7 KB
Fields: id, name, age, email, location.city, location.state
Sample (first row):
{
"id": 1,
"name": "Alice",
"age": 30,
...
}
# Show info for specific dataset
ja> info filtered
Dataset: filtered
Path: /tmp/ja_repl_abc123/filtered_1.jsonl
Rows: 523
Size: 198.4 KB
...
Information displayed: - Dataset name - File path - Row count (with comma formatting) - File size (in B, KB, or MB) - Field names (with dot notation for nested fields) - Sample of first row
ls [name] [--limit N]¶
Preview dataset contents:
# Preview current dataset (default: window-size lines)
ja> ls
{"id": 1, "name": "Alice", ...}
{"id": 2, "name": "Bob", ...}
...
# Preview with custom limit
ja> ls --limit 3
{"id": 1, "name": "Alice", ...}
{"id": 2, "name": "Bob", ...}
{"id": 3, "name": "Charlie", ...}
# Preview specific dataset
ja> ls users --limit 5
...
Saving Datasets¶
save <file>¶
Persist the current dataset to a file:
ja> save output.jsonl
Saved users to: output.jsonl
# Save to a different location
ja> save /tmp/backup.jsonl
Saved users to: /tmp/backup.jsonl
Notes: - Only saves the current dataset - Does NOT register the saved file as a new dataset - Overwrites existing files without warning
Dataset Lifecycle¶
Original Files vs. Derived Datasets¶
- Original files: Loaded with
load, stored at their original paths - Derived datasets: Created by operations, stored in temp directory
ja> load users.jsonl
# users -> /home/user/users.jsonl (original)
ja> select 'age > 30' adults
# adults -> /tmp/ja_repl_xyz/adults_1.jsonl (derived)
Temporary Files¶
Derived datasets are automatically stored in a temporary directory:
Cleanup:
- Temp files persist for the session duration
- Automatically cleaned up when REPL exits
- Use save to persist important results
Name Conflict Prevention¶
Dataset names must be unique:
ja> load users.jsonl
Loaded: users (current)
ja> load users.jsonl
Error: Dataset 'users' already exists. Use a different name.
# Solution: Use custom name
ja> load users.jsonl users2
Loaded: users2 (current)
This applies to both loading and operations:
ja> select 'age > 30' filtered
Created: filtered (current)
ja> select 'city == "NYC"' filtered
Error: Dataset 'filtered' already exists. Use a different name.
Best Practices¶
-
Use descriptive names for derived datasets:
-
Check
infobefore operations to understand your data: -
Use
datasetsto track your workspace: -
Save important results before continuing:
Examples¶
Loading Multiple Files¶
ja> load users.jsonl
Loaded: users (current)
ja> load orders.jsonl
Loaded: orders (current)
ja> load products.jsonl
Loaded: products (current)
ja> datasets
Registered datasets:
orders
products (current)
users
Working with Custom Names¶
ja> load jan_sales.jsonl sales_jan
Loaded: sales_jan (current)
ja> load feb_sales.jsonl sales_feb
Loaded: sales_feb (current)
ja> cd sales_jan
Current dataset: sales_jan
ja> union sales_feb q1_sales
Created: q1_sales (current)
ja> info q1_sales
Dataset: q1_sales
Rows: 2,456
...