Loading Datasets in Infinigram¶
Quick Answer¶
No, you don't need an absolute path! You can use either: 1. Relative paths - relative to your storage directory 2. Absolute paths - if you want full control
Basic Usage¶
Option 1: Using VirtualFilesystem (Recommended)¶
The VirtualFilesystem manages datasets for you with Unix-like paths:
from pathlib import Path
from infinigram.vfs import VirtualFilesystem
# Create VFS with a storage directory
# This can be relative or absolute
vfs = VirtualFilesystem(storage_dir=Path("./datasets"))
# OR
vfs = VirtualFilesystem(storage_dir=Path.home() / ".infinigram" / "datasets")
# Load existing dataset by name (no path needed!)
dataset = vfs.get_dataset("math")
# The dataset is located at: storage_dir/math/
Option 2: Using Dataset Directly¶
If you want direct control, use the Dataset class:
from pathlib import Path
from infinigram.storage import Dataset
# Relative path (relative to current working directory)
dataset = Dataset(Path("./my_datasets/math"))
# Absolute path
dataset = Dataset(Path("/home/user/datasets/math"))
# Using home directory
dataset = Dataset(Path.home() / "datasets" / "math")
Complete Examples¶
Example 1: Create and Load with VFS¶
from pathlib import Path
from infinigram.vfs import VirtualFilesystem
# Setup VFS with relative path
vfs = VirtualFilesystem(storage_dir=Path("./infinigram_data"))
# Create a new dataset
math_dataset = vfs.create_dataset("math")
# Add some documents
math_dataset.add_document("Addition is combining numbers")
math_dataset.add_document("Subtraction is the inverse of addition")
math_dataset.add_tag(0, "basics")
# Later, load the dataset (same session or different session)
vfs2 = VirtualFilesystem(storage_dir=Path("./infinigram_data"))
loaded_dataset = vfs2.get_dataset("math")
print(f"Documents: {loaded_dataset.count_documents()}") # Output: 2
print(loaded_dataset.get_document(0)) # Output: "Addition is combining numbers"
Example 2: Direct Dataset Loading¶
from pathlib import Path
from infinigram.storage import Dataset
# Load dataset from relative path
dataset = Dataset(Path("./data/my_corpus"))
# Add documents
dataset.add_document("The cat sat on the mat")
dataset.add_tag(0, "animals/cat")
# Later, load from the same path
dataset2 = Dataset(Path("./data/my_corpus"))
print(dataset2.count_documents()) # Output: 1
Example 3: Using Environment Variables¶
import os
from pathlib import Path
from infinigram.vfs import VirtualFilesystem
# Get storage directory from environment variable
storage_dir = Path(os.getenv("INFINIGRAM_DATA", "./infinigram_data"))
vfs = VirtualFilesystem(storage_dir=storage_dir)
dataset = vfs.get_dataset("my_dataset")
Path Resolution Details¶
VirtualFilesystem¶
When you use VirtualFilesystem(storage_dir=Path("./datasets")):
- The
storage_diris where ALL datasets are stored - Each dataset is a subdirectory:
storage_dir/dataset_name/ - You access datasets by name, not by full path
./datasets/ # storage_dir
├── math/ # dataset "math"
│ ├── documents.jsonl
│ ├── index.db
│ └── metadata.json
├── science/ # dataset "science"
│ ├── documents.jsonl
│ ├── index.db
│ └── metadata.json
└── history/ # dataset "history"
├── documents.jsonl
├── index.db
└── metadata.json
To load:
vfs = VirtualFilesystem(Path("./datasets"))
math_ds = vfs.get_dataset("math") # loads ./datasets/math/
science_ds = vfs.get_dataset("science") # loads ./datasets/science/
Dataset Direct¶
When you use Dataset(path):
- The
pathis the full path to the dataset directory - Can be relative or absolute
- The directory contains:
documents.jsonl,index.db,metadata.json
# These all work:
Dataset(Path("./my_data")) # relative
Dataset(Path("/home/user/data")) # absolute
Dataset(Path.home() / "infinigram" / "data") # home directory
Dataset(Path.cwd() / "datasets" / "math") # current working dir
Common Patterns¶
Pattern 1: Default Location¶
from pathlib import Path
from infinigram.vfs import VirtualFilesystem
# Use a consistent default location
DEFAULT_STORAGE = Path.home() / ".infinigram" / "datasets"
vfs = VirtualFilesystem(storage_dir=DEFAULT_STORAGE)
Pattern 2: Project-Local Data¶
from pathlib import Path
from infinigram.vfs import VirtualFilesystem
# Store datasets in project directory
vfs = VirtualFilesystem(storage_dir=Path("./data"))
Pattern 3: Shared System Data¶
from pathlib import Path
from infinigram.vfs import VirtualFilesystem
# System-wide location
vfs = VirtualFilesystem(storage_dir=Path("/var/lib/infinigram/datasets"))
Checking if Dataset Exists¶
With VFS¶
vfs = VirtualFilesystem(Path("./datasets"))
if vfs.dataset_exists("math"):
dataset = vfs.get_dataset("math")
else:
dataset = vfs.create_dataset("math")
With Dataset¶
from pathlib import Path
dataset_path = Path("./datasets/math")
if dataset_path.exists():
dataset = Dataset(dataset_path)
else:
# Create new (Dataset creates directory automatically)
dataset = Dataset(dataset_path)
Listing Available Datasets¶
vfs = VirtualFilesystem(Path("./datasets"))
# List all datasets
datasets = vfs.list_datasets()
print(f"Available datasets: {datasets}")
# Load each one
for name in datasets:
ds = vfs.get_dataset(name)
print(f"{name}: {ds.count_documents()} documents")
Error Handling¶
from infinigram.vfs import VirtualFilesystem
from pathlib import Path
vfs = VirtualFilesystem(Path("./datasets"))
try:
dataset = vfs.get_dataset("nonexistent")
except FileNotFoundError as e:
print(f"Dataset not found: {e}")
# Create it instead
dataset = vfs.create_dataset("nonexistent")
Best Practices¶
✅ Recommended¶
- Use VirtualFilesystem for most cases - it manages paths for you
- Use relative paths for project-specific data
- Use
Path.home()for user-specific data - Check existence before loading if not sure dataset exists
❌ Avoid¶
- Hard-coding absolute paths in code
- Using string paths instead of
Pathobjects - Creating datasets in system directories without permissions
Summary¶
To answer your question directly:
# NO absolute path needed!
from pathlib import Path
from infinigram.vfs import VirtualFilesystem
# Use relative path for storage directory
vfs = VirtualFilesystem(Path("./my_datasets"))
# Load dataset by name (just the name, not a path!)
dataset = vfs.get_dataset("math")
# The dataset is automatically loaded from: ./my_datasets/math/
The path you provide to VirtualFilesystem() can be relative or absolute - your choice! Then you access datasets by simple names, no paths needed.