Dataset Management¶

The REPL provides comprehensive dataset management commands to load, organize, and inspect your data.

Loading Datasets¶

`load <file> [name]`¶

Load a JSONL file into the workspace:

# Load with default name (filename without extension)
ja> load users.jsonl
Loaded: users (current)
  Path: /home/user/users.jsonl

# Load with custom name
ja> load /data/customers.jsonl clients
Loaded: clients (current)
  Path: /data/customers.jsonl

Behavior: - The dataset becomes the current dataset - Default name is the filename stem (without .jsonl) - Names must be unique (loading fails if name already exists) - File paths are stored, not data (streaming model preserved)

Switching Between Datasets¶

`cd <name>`¶

Change the current dataset:

ja> load users.jsonl
ja> load orders.jsonl
ja> cd users
Current dataset: users

`pwd` / `current`¶

Show the current dataset:

ja> pwd
Current dataset: users
  Path: /home/user/users.jsonl

Listing Datasets¶

`datasets`¶

List all registered datasets:

ja> datasets
Registered datasets:
  orders
    /home/user/orders.jsonl
  users (current)
    /home/user/users.jsonl
  filtered
    /tmp/ja_repl_abc123/filtered_1.jsonl

Output shows: - Dataset names (alphabetically sorted) - Current dataset marked with (current) - File paths (temp files for derived datasets)

Inspecting Datasets¶

`info [name]`¶

Show detailed statistics about a dataset:

# Show info for current dataset
ja> info

Dataset: users
Path: /home/user/users.jsonl
Rows: 1,234
Size: 456.7 KB
Fields: id, name, age, email, location.city, location.state

Sample (first row):
  {
    "id": 1,
    "name": "Alice",
    "age": 30,
    ...
  }

# Show info for specific dataset
ja> info filtered

Dataset: filtered
Path: /tmp/ja_repl_abc123/filtered_1.jsonl
Rows: 523
Size: 198.4 KB
...

Information displayed: - Dataset name - File path - Row count (with comma formatting) - File size (in B, KB, or MB) - Field names (with dot notation for nested fields) - Sample of first row

`ls [name] [--limit N]`¶

Preview dataset contents:

# Preview current dataset (default: window-size lines)
ja> ls
{"id": 1, "name": "Alice", ...}
{"id": 2, "name": "Bob", ...}
...

# Preview with custom limit
ja> ls --limit 3
{"id": 1, "name": "Alice", ...}
{"id": 2, "name": "Bob", ...}
{"id": 3, "name": "Charlie", ...}

# Preview specific dataset
ja> ls users --limit 5
...

Saving Datasets¶

`save <file>`¶

Persist the current dataset to a file:

ja> save output.jsonl
Saved users to: output.jsonl

# Save to a different location
ja> save /tmp/backup.jsonl
Saved users to: /tmp/backup.jsonl

Notes: - Only saves the current dataset - Does NOT register the saved file as a new dataset - Overwrites existing files without warning

Dataset Lifecycle¶

Original Files vs. Derived Datasets¶

Original files: Loaded with load, stored at their original paths
Derived datasets: Created by operations, stored in temp directory

ja> load users.jsonl
# users -> /home/user/users.jsonl (original)

ja> select 'age > 30' adults
# adults -> /tmp/ja_repl_xyz/adults_1.jsonl (derived)

Temporary Files¶

Derived datasets are automatically stored in a temporary directory:

/tmp/ja_repl_<session_id>/
  ├── adults_1.jsonl
  ├── filtered_2.jsonl
  └── joined_3.jsonl

Cleanup: - Temp files persist for the session duration - Automatically cleaned up when REPL exits - Use save to persist important results

Name Conflict Prevention¶

Dataset names must be unique:

ja> load users.jsonl
Loaded: users (current)

ja> load users.jsonl
Error: Dataset 'users' already exists. Use a different name.

# Solution: Use custom name
ja> load users.jsonl users2
Loaded: users2 (current)

This applies to both loading and operations:

ja> select 'age > 30' filtered
Created: filtered (current)

ja> select 'city == "NYC"' filtered
Error: Dataset 'filtered' already exists. Use a different name.

Best Practices¶

Use descriptive names for derived datasets:

ja> select 'status == "active"' active_users
ja> select 'age > 65' seniors

Check info before operations to understand your data:

ja> load data.jsonl
ja> info  # Check structure first
ja> select '...' filtered

Use datasets to track your workspace:

ja> datasets  # See what's loaded
ja> info users
ja> info processed

Save important results before continuing:

ja> select ... important
ja> save important_data.jsonl  # Persist it
ja> select ... more_work

Examples¶

Loading Multiple Files¶

ja> load users.jsonl
Loaded: users (current)

ja> load orders.jsonl
Loaded: orders (current)

ja> load products.jsonl
Loaded: products (current)

ja> datasets
Registered datasets:
  orders
  products (current)
  users

Working with Custom Names¶

ja> load jan_sales.jsonl sales_jan
Loaded: sales_jan (current)

ja> load feb_sales.jsonl sales_feb
Loaded: sales_feb (current)

ja> cd sales_jan
Current dataset: sales_jan

ja> union sales_feb q1_sales
Created: q1_sales (current)

ja> info q1_sales
Dataset: q1_sales
Rows: 2,456
...

Exploring Unknown Data¶

ja> load mystery.jsonl
Loaded: mystery (current)

ja> info
Dataset: mystery
Rows: 10,234
Size: 2.3 MB
Fields (15 total): id, timestamp, user.name, user.email, ...

ja> ls --limit 2
{"id": 1, "timestamp": "2024-01-01", ...}
{"id": 2, "timestamp": "2024-01-02", ...}