Trusted Pool Coordinator¶
A simple pool coordinator that routes jobs to allowlisted workers. Workers are trusted based on their Ed25519 public keys.
Architecture¶
Client → Pool Coordinator → Trusted Workers
↓
Allowlist (public keys)
Health checking
Load balancing
Trust Model¶
- Workers are allowlisted by their Ed25519 public keys
- No result verification needed (trusted execution)
- Health checking ensures worker availability
- Load balancing distributes jobs across available workers
This is simpler than the trustless pool because: - No consensus needed - No verification of results - No economic incentives (stake/slash) - Workers are pre-approved and trusted
Setup¶
1. Install Dependencies¶
2. Configure Workers¶
Create workers.json with your trusted workers:
[
{
"worker_id": "base64-encoded-ed25519-public-key",
"endpoint": "http://worker1.example.com:8443",
"max_concurrent_jobs": 4
},
{
"worker_id": "another-public-key-base64",
"endpoint": "http://worker2.example.com:8443",
"max_concurrent_jobs": 4
}
]
To get a worker's public key (worker_id):
# On worker machine:
./sandrun --generate-key /etc/sandrun/worker.pem
# Output shows:
# ✅ Saved worker key to: /etc/sandrun/worker.pem
# Worker ID: <base64-encoded-public-key>
Add the worker ID to your workers.json allowlist.
3. Start Workers¶
On each worker machine:
4. Start Pool Coordinator¶
Usage¶
Submit Job to Pool¶
Instead of submitting directly to a worker, submit to the pool coordinator:
curl -X POST http://pool.example.com:9000/submit \
-F "files=@project.tar.gz" \
-F 'manifest={"entrypoint":"main.py","interpreter":"python3"}'
Response:
Check Job Status¶
Response:
{
"job_id": "pool-a1b2c3d4e5f6",
"pool_status": "running",
"worker_id": "base64-worker-public-key",
"worker_status": {
"job_id": "remote-job-id-on-worker",
"status": "running",
"execution_metadata": {
"cpu_seconds": 1.23,
"memory_peak_bytes": 52428800
}
},
"submitted_at": 1234567890.123,
"completed_at": null
}
Download Output¶
Check Pool Status¶
Response:
{
"total_workers": 3,
"healthy_workers": 2,
"total_jobs": 15,
"queued_jobs": 2,
"workers": [
{
"worker_id": "worker-1-public-key",
"endpoint": "http://worker1.example.com:8443",
"is_healthy": true,
"active_jobs": 3,
"max_concurrent_jobs": 4,
"last_health_check": 1234567890.123
}
]
}
API Endpoints¶
POST /submit¶
Submit a job to the pool.
Request: - files: Tarball of project files (multipart/form-data) - manifest: Job manifest JSON
Response:
GET /status/{job_id}¶
Get job status.
Response:
{
"job_id": "pool-xxx",
"pool_status": "running",
"worker_id": "worker-public-key",
"worker_status": { ... },
"submitted_at": 1234567890.123,
"completed_at": null
}
GET /outputs/{job_id}/{path}¶
Download output file.
Response: Binary file content
GET /pool¶
Get pool status.
Response:
{
"total_workers": 3,
"healthy_workers": 2,
"total_jobs": 10,
"queued_jobs": 1,
"workers": [ ... ]
}
How It Works¶
Job Flow¶
- Client submits job to pool coordinator
- Job enters queue with "queued" status
- Coordinator finds available worker (healthy, not overloaded)
- Job dispatched to worker via HTTP POST to worker's /submit endpoint
- Worker executes job in sandbox
- Client polls status via pool coordinator (proxied to worker)
- Client downloads outputs via pool coordinator (proxied from worker)
Health Checking¶
- Pool coordinator checks each worker every 30 seconds
- Health check:
GET http://worker:8443/health - Expected response:
{"status":"healthy","worker_id":"..."} - Unhealthy workers are excluded from routing
Load Balancing¶
- Jobs routed to worker with fewest active jobs
- Workers have
max_concurrent_jobslimit (default: 4) - If no workers available, job waits in queue
Failure Handling¶
- If worker rejects job → job re-queued
- If worker fails health check → marked unhealthy, excluded from routing
- Jobs in progress on failed workers remain assigned (client can retry)
Differences from Trustless Pool¶
| Feature | Trusted Pool | Trustless Pool |
|---|---|---|
| Worker authorization | Allowlist (public keys) | Open (anyone can join) |
| Result verification | None (trust workers) | Hash comparison + consensus |
| Economic model | None | Stake + slashing |
| Complexity | Simple (~200 lines) | Complex (~1000+ lines) |
| Use case | Private cluster, known workers | Public compute, anonymous workers |
Security Considerations¶
Worker Authentication¶
Workers must be started with --worker-key to have an identity. The pool coordinator verifies worker identity during health checks:
This prevents: - Impersonation: Rogue server can't pretend to be allowlisted worker - Unauthorized workers: Only allowlisted workers receive jobs
Network Security¶
Since this is a trusted pool, you should:
- Use private network or VPN for worker communication
- Enable TLS on workers (add HTTPS support)
- Firewall workers to only accept from coordinator IP
- Restrict pool coordinator to authorized clients
Resource Limits¶
Workers enforce their own resource limits (as configured in sandrun). The pool coordinator adds: - max_concurrent_jobs: Prevent worker overload - Job queueing: Prevent coordinator overload - Health checks: Detect and exclude failed workers
Monitoring¶
Logs¶
The coordinator logs: - Job submissions and dispatching - Worker health status changes - Errors and warnings
Example:
INFO:__main__:Added trusted worker: a1b2c3d4e5f6... at http://worker1:8443
INFO:__main__:Queued job pool-abc123
INFO:__main__:Dispatched job pool-abc123 to a1b2c3d4e5f6... (remote: job-xyz789)
WARNING:__main__:Health check failed for b2c3d4e5f6g7...: Connection refused
Metrics¶
Check /pool endpoint for real-time metrics: - Total workers and healthy count - Total jobs and queue depth - Per-worker active job count
Future Enhancements¶
Potential improvements for production use:
- Persistent storage for job history (currently in-memory)
- Worker capacity discovery (auto-detect max_concurrent_jobs)
- Job priority queues (high/low priority jobs)
- Authentication for clients (API keys, OAuth)
- TLS support for encrypted communication
- Metrics export (Prometheus, Grafana)
- Job cancellation (cancel in-progress jobs)
- Worker drain mode (stop accepting new jobs for maintenance)
Troubleshooting¶
No workers available¶
Causes: - All workers unhealthy (check worker logs) - All workers at max capacity (check /pool endpoint) - Workers not started with --worker-key
Solution: - Start more workers - Increase max_concurrent_jobs per worker - Check worker health endpoints directly
Jobs stuck in "queued" status¶
Causes: - No healthy workers available - Worker endpoints incorrect in workers.json
Solution: - Check /pool endpoint for worker health status - Verify worker endpoints are reachable - Check worker logs for errors
Worker rejected job¶
Causes: - Invalid manifest format - Files too large for worker - Worker resource limits exceeded
Solution: - Check worker logs for specific error - Verify manifest is valid JSON - Reduce job size or increase worker limits
Example Deployment¶
3-Worker Pool¶
# Worker 1
sudo ./sandrun --port 8443 --worker-key /etc/sandrun/worker1.pem
# Worker 2
sudo ./sandrun --port 8443 --worker-key /etc/sandrun/worker2.pem
# Worker 3
sudo ./sandrun --port 8443 --worker-key /etc/sandrun/worker3.pem
# Coordinator
python coordinator.py --port 9000 --workers workers.json
workers.json:
[
{
"worker_id": "worker1-public-key-from-generate-key",
"endpoint": "http://192.168.1.101:8443",
"max_concurrent_jobs": 4
},
{
"worker_id": "worker2-public-key-from-generate-key",
"endpoint": "http://192.168.1.102:8443",
"max_concurrent_jobs": 4
},
{
"worker_id": "worker3-public-key-from-generate-key",
"endpoint": "http://192.168.1.103:8443",
"max_concurrent_jobs": 4
}
]
Now you can submit jobs to the pool at http://coordinator-ip:9000/submit and they will be automatically distributed across the 3 workers!