Infinigram REST API Documentation¶
Version: 0.4.0 Status: Production Ready Compatibility: OpenAI API v1
Overview¶
Infinigram provides an OpenAI-compatible REST API for corpus-based language modeling. The API allows you to: - Generate text completions using variable-length n-gram matching - Manage multiple models simultaneously - Use hierarchical weighted predictions - Get detailed match metadata and confidence scores - Introspect suffix matches and confidence at any context
Quick Start¶
1. Start the Server¶
# Option 1: Use the module directly
python -m infinigram.server.api
# Option 2: Start server programmatically
from infinigram.server.api import app, model_manager
import uvicorn
# Load your models
model_manager.add_model("my-model", corpus=[1,2,3,4,5], max_length=10)
# Start server
uvicorn.run(app, host="0.0.0.0", port=8000)
2. Test the API¶
# Check health
curl http://localhost:8000/health
# List models
curl http://localhost:8000/v1/models
# Generate completion
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demo",
"prompt": [2, 3],
"max_tokens": 5,
"top_k": 10
}'
API Endpoints¶
Core Endpoints¶
GET /¶
Root endpoint with API information.
Response:
{
"message": "Infinigram API",
"version": "0.4.0",
"description": "Corpus-based language model with OpenAI-compatible API",
"endpoints": {
"completions": "/v1/completions",
"models": "/v1/models",
"predict": "/v1/predict",
"predict_backoff": "/v1/predict_backoff",
"suffix_matches": "/v1/suffix_matches",
"longest_suffix": "/v1/longest_suffix",
"confidence": "/v1/confidence",
"count": "/v1/count",
"search": "/v1/search",
"transforms": "/v1/transforms",
"health": "/health"
}
}
GET /health¶
Health check endpoint.
Response:
Completion Endpoints¶
POST /v1/completions¶
Create a text completion (OpenAI-compatible).
Request Body:
{
"model": "demo", // Required: Model ID
"prompt": [1, 2, 3], // Required: List of integer token IDs
"max_tokens": 10, // Optional: Maximum tokens to generate (default: 10)
"temperature": 1.0, // Optional: Sampling temperature (not yet implemented)
"top_k": 50, // Optional: Return top k predictions (default: 50)
"weight_function": "quadratic", // Optional: "linear", "quadratic", "exponential", "sigmoid"
"min_length": 1, // Optional: Minimum suffix length for weighted prediction
"max_length": null, // Optional: Maximum suffix length
"echo": false, // Optional: Echo prompt in response
"logprobs": 3 // Optional: Return log probabilities for top N tokens
}
Response:
{
"id": "cmpl-1760741740364",
"object": "text_completion",
"created": 1760741740,
"model": "demo",
"choices": [
{
"text": "[4, 2, 3, 5, 6]",
"index": 0,
"logprobs": null,
"finish_reason": "length",
"metadata": {
"match_position": 1,
"match_length": 7,
"confidence": 0.493,
"tokens": [4, 2, 3, 5, 6]
}
}
],
"usage": {
"prompt_tokens": 2,
"completion_tokens": 5,
"total_tokens": 7
}
}
Example with Weighted Prediction:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demo",
"prompt": [2, 3],
"max_tokens": 3,
"weight_function": "quadratic",
"min_length": 1,
"max_length": 5
}'
Example with Log Probabilities:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demo",
"prompt": [2, 3],
"max_tokens": 2,
"logprobs": 3
}'
Response includes detailed probability information:
{
"logprobs": {
"content": [
{
"tokens": ["4", "5", "1"],
"token_logprobs": [-0.307, -1.399, -6.014],
"top_logprobs": {
"4": -0.307,
"5": -1.399,
"1": -6.014
}
}
]
}
}
Model Management Endpoints¶
GET /v1/models¶
List all available models (OpenAI-compatible).
Response:
{
"object": "list",
"data": [
{
"id": "demo",
"object": "model",
"created": 1760741705,
"owned_by": "infinigram",
"description": "Simple demo model with numeric tokens",
"corpus_size": 17,
"vocab_size": 9,
"max_length": 10
}
]
}
GET /v1/models/{model_id}¶
Get information about a specific model.
Example:
Response:
{
"id": "demo",
"object": "model",
"created": 1760741759,
"owned_by": "infinigram",
"description": "Simple demo model with numeric tokens",
"corpus_size": 17,
"vocab_size": 9,
"max_length": 10
}
POST /v1/models/load¶
Load a new model from a corpus.
Request:
{
"model_id": "my-custom-model",
"corpus": [1, 2, 3, 4, 5, 6, 7, 8],
"max_length": 10,
"description": "My custom model description"
}
Example:
curl -X POST http://localhost:8000/v1/models/load \
-H "Content-Type: application/json" \
-d '{
"model_id": "test-model",
"corpus": [1,2,3,4,5,2,3,6],
"max_length": 5,
"description": "Test model"
}'
Response:
DELETE /v1/models/{model_id}¶
Unload a model from memory.
Example:
Response:
Introspection Endpoints¶
These endpoints provide direct access to model introspection without generating completions.
POST /v1/predict¶
Get next-byte predictions for a context.
Request Body:
{
"model": "demo",
"context": "the cat",
"top_k": 50,
"smoothing": 0.0,
"weight_function": null,
"transforms": ["lowercase"]
}
Response:
{
"model": "demo",
"context": "the cat",
"predictions": [
{"byte": 32, "char": " ", "probability": 0.853},
{"byte": 115, "char": "s", "probability": 0.042}
],
"transforms": ["lowercase"],
"weight_function": null
}
POST /v1/predict_backoff¶
Get predictions using Stupid Backoff smoothing.
Request Body:
{
"model": "demo",
"context": "the cat",
"top_k": 50,
"backoff_factor": 0.4,
"min_count_threshold": 1,
"smoothing": 0.0,
"transforms": null
}
Response:
{
"model": "demo",
"context": "the cat",
"predictions": [
{"byte": 32, "char": " ", "probability": 0.853}
],
"backoff_factor": 0.4,
"min_count_threshold": 1,
"transforms": null
}
POST /v1/suffix_matches¶
Find all matching suffixes at different lengths.
Request Body:
Response:
{
"model": "demo",
"context": "the cat",
"context_length": 7,
"matches": [
{"length": 7, "suffix": "the cat", "count": 3, "positions": [12, 45, 78]},
{"length": 3, "suffix": "cat", "count": 8, "positions": [12, 45, 78, 102, ...]}
],
"transforms": null
}
POST /v1/longest_suffix¶
Find the longest matching suffix.
Request Body:
Response:
{
"model": "demo",
"context": "the cat sat on",
"context_length": 14,
"match_position": 42,
"match_length": 10,
"matched_suffix": "at sat on",
"transforms": null
}
POST /v1/confidence¶
Get confidence score for a context.
Request Body:
Response:
{
"model": "demo",
"context": "the cat",
"confidence": 0.78,
"match_length": 7,
"context_length": 7,
"transforms": null
}
GET /v1/transforms¶
List available transforms.
Response:
{
"transforms": ["lowercase", "uppercase", "casefold", "strip", "normalize_whitespace"],
"description": "Available runtime query transforms"
}
Advanced Features¶
Hierarchical Weighted Prediction¶
Use multiple suffix lengths with configurable weighting:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "demo",
"prompt": [1, 2, 3],
"max_tokens": 5,
"weight_function": "exponential",
"min_length": 1,
"max_length": 10
}'
Available weight functions:
- linear: w(k) = k (default)
- quadratic: w(k) = k²
- exponential: w(k) = 2^k
- sigmoid: w(k) = 1 / (1 + exp(-k + 5))
Metadata and Confidence¶
Every completion includes metadata about the match:
{
"metadata": {
"match_position": 42, // Position in corpus where match was found
"match_length": 5, // Length of longest matching suffix
"confidence": 0.78, // Confidence score (0-1)
"tokens": [4, 5, 6] // Raw token IDs generated
}
}
Integration Examples¶
Python Client¶
import requests
# Create completion
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "demo",
"prompt": [1, 2, 3],
"max_tokens": 10,
"top_k": 50
}
)
result = response.json()
print(f"Generated tokens: {result['choices'][0]['metadata']['tokens']}")
print(f"Confidence: {result['choices'][0]['metadata']['confidence']}")
LLM Probability Mixing¶
Use Infinigram to ground LLM predictions in a specific corpus:
# Get LLM probabilities
llm_probs = llm_api.get_next_token_probs(context)
# Get Infinigram probabilities
infinigram_response = requests.post(
"http://localhost:8000/v1/completions",
json={"model": "domain-corpus", "prompt": context, "max_tokens": 1}
).json()
infinigram_probs = parse_probs_from_logprobs(infinigram_response)
# Mix probabilities
mixed_probs = 0.7 * llm_probs + 0.3 * infinigram_probs
next_token = sample(mixed_probs)
Error Handling¶
Model Not Found¶
HTTP Status: 404Invalid Request¶
HTTP Status: 422Note: String prompts are now fully supported and automatically converted to UTF-8 bytes.
Unknown Weight Function¶
{
"detail": "Unknown weight function 'invalid'. Available: ['linear', 'quadratic', 'exponential', 'sigmoid']"
}
Performance Characteristics¶
- Latency: <10ms for typical queries (100-token context)
- Throughput: 1000+ requests/second on single CPU
- Memory: O(corpus_size) per model
- Model loading: Instant (no training required)
Roadmap¶
Completed in v0.4.0: - [x] String prompt support (auto UTF-8 conversion) - [x] Introspection endpoints (predict, suffix_matches, confidence) - [x] Backoff smoothing endpoint - [x] Transforms listing endpoint
Planned enhancements: - [ ] Streaming responses for long completions - [ ] String tokenization (BPE/WordPiece support) - [ ] Authentication and API keys - [ ] Rate limiting - [ ] Batch completion endpoint - [ ] Model persistence to disk - [ ] Prometheus metrics endpoint - [ ] WebSocket support for real-time predictions