Trusted Pool Coordinator¶

A simple pool coordinator that routes jobs to allowlisted workers. Workers are trusted based on their Ed25519 public keys.

Architecture¶

Client → Pool Coordinator → Trusted Workers
                ↓
         Allowlist (public keys)
         Health checking
         Load balancing

Trust Model¶

Workers are allowlisted by their Ed25519 public keys
No result verification needed (trusted execution)
Health checking ensures worker availability
Load balancing distributes jobs across available workers

This is simpler than the trustless pool because: - No consensus needed - No verification of results - No economic incentives (stake/slash) - Workers are pre-approved and trusted

Setup¶

1. Install Dependencies¶

cd integrations/trusted-pool
pip install -r requirements.txt

2. Configure Workers¶

Create workers.json with your trusted workers:

[
  {
    "worker_id": "base64-encoded-ed25519-public-key",
    "endpoint": "http://worker1.example.com:8443",
    "max_concurrent_jobs": 4
  },
  {
    "worker_id": "another-public-key-base64",
    "endpoint": "http://worker2.example.com:8443",
    "max_concurrent_jobs": 4
  }
]

To get a worker's public key (worker_id):

# On worker machine:
./sandrun --generate-key /etc/sandrun/worker.pem

# Output shows:
# ✅ Saved worker key to: /etc/sandrun/worker.pem
#    Worker ID: <base64-encoded-public-key>

Add the worker ID to your workers.json allowlist.

3. Start Workers¶

On each worker machine:

sudo ./sandrun --port 8443 --worker-key /etc/sandrun/worker.pem

4. Start Pool Coordinator¶

python coordinator.py --port 9000 --workers workers.json

Usage¶

Submit Job to Pool¶

Instead of submitting directly to a worker, submit to the pool coordinator:

curl -X POST http://pool.example.com:9000/submit \
  -F "files=@project.tar.gz" \
  -F 'manifest={"entrypoint":"main.py","interpreter":"python3"}'

Response:

{
  "job_id": "pool-a1b2c3d4e5f6",
  "status": "queued"
}

Check Job Status¶

curl http://pool.example.com:9000/status/pool-a1b2c3d4e5f6

Response:

{
  "job_id": "pool-a1b2c3d4e5f6",
  "pool_status": "running",
  "worker_id": "base64-worker-public-key",
  "worker_status": {
    "job_id": "remote-job-id-on-worker",
    "status": "running",
    "execution_metadata": {
      "cpu_seconds": 1.23,
      "memory_peak_bytes": 52428800
    }
  },
  "submitted_at": 1234567890.123,
  "completed_at": null
}

Download Output¶

curl http://pool.example.com:9000/outputs/pool-a1b2c3d4e5f6/results/output.png \
  -o output.png

Check Pool Status¶

curl http://pool.example.com:9000/pool

Response:

{
  "total_workers": 3,
  "healthy_workers": 2,
  "total_jobs": 15,
  "queued_jobs": 2,
  "workers": [
    {
      "worker_id": "worker-1-public-key",
      "endpoint": "http://worker1.example.com:8443",
      "is_healthy": true,
      "active_jobs": 3,
      "max_concurrent_jobs": 4,
      "last_health_check": 1234567890.123
    }
  ]
}

API Endpoints¶

POST /submit¶

Submit a job to the pool.

Request: - files: Tarball of project files (multipart/form-data) - manifest: Job manifest JSON

Response:

{
  "job_id": "pool-xxx",
  "status": "queued"
}

GET /status/{job_id}¶

Get job status.

Response:

{
  "job_id": "pool-xxx",
  "pool_status": "running",
  "worker_id": "worker-public-key",
  "worker_status": { ... },
  "submitted_at": 1234567890.123,
  "completed_at": null
}

GET /outputs/{job_id}/{path}¶

Download output file.

Response: Binary file content

GET /pool¶

Get pool status.

Response:

{
  "total_workers": 3,
  "healthy_workers": 2,
  "total_jobs": 10,
  "queued_jobs": 1,
  "workers": [ ... ]
}

How It Works¶

Job Flow¶

Client submits job to pool coordinator
Job enters queue with "queued" status
Coordinator finds available worker (healthy, not overloaded)
Job dispatched to worker via HTTP POST to worker's /submit endpoint
Worker executes job in sandbox
Client polls status via pool coordinator (proxied to worker)
Client downloads outputs via pool coordinator (proxied from worker)

Health Checking¶

Pool coordinator checks each worker every 30 seconds
Health check: GET http://worker:8443/health
Expected response: {"status":"healthy","worker_id":"..."}
Unhealthy workers are excluded from routing

Load Balancing¶

Jobs routed to worker with fewest active jobs
Workers have max_concurrent_jobs limit (default: 4)
If no workers available, job waits in queue

Failure Handling¶

If worker rejects job → job re-queued
If worker fails health check → marked unhealthy, excluded from routing
Jobs in progress on failed workers remain assigned (client can retry)

Differences from Trustless Pool¶

Feature	Trusted Pool	Trustless Pool
Worker authorization	Allowlist (public keys)	Open (anyone can join)
Result verification	None (trust workers)	Hash comparison + consensus
Economic model	None	Stake + slashing
Complexity	Simple (~200 lines)	Complex (~1000+ lines)
Use case	Private cluster, known workers	Public compute, anonymous workers

Security Considerations¶

Worker Authentication¶

Workers must be started with --worker-key to have an identity. The pool coordinator verifies worker identity during health checks:

if data.get("worker_id") == worker.worker_id:
    worker.is_healthy = True

This prevents: - Impersonation: Rogue server can't pretend to be allowlisted worker - Unauthorized workers: Only allowlisted workers receive jobs

Network Security¶

Since this is a trusted pool, you should:

Use private network or VPN for worker communication
Enable TLS on workers (add HTTPS support)
Firewall workers to only accept from coordinator IP
Restrict pool coordinator to authorized clients

Resource Limits¶

Workers enforce their own resource limits (as configured in sandrun). The pool coordinator adds: - max_concurrent_jobs: Prevent worker overload - Job queueing: Prevent coordinator overload - Health checks: Detect and exclude failed workers

Monitoring¶

Logs¶

The coordinator logs: - Job submissions and dispatching - Worker health status changes - Errors and warnings

Example:

INFO:__main__:Added trusted worker: a1b2c3d4e5f6... at http://worker1:8443
INFO:__main__:Queued job pool-abc123
INFO:__main__:Dispatched job pool-abc123 to a1b2c3d4e5f6... (remote: job-xyz789)
WARNING:__main__:Health check failed for b2c3d4e5f6g7...: Connection refused

Metrics¶

Check /pool endpoint for real-time metrics: - Total workers and healthy count - Total jobs and queue depth - Per-worker active job count

Future Enhancements¶

Potential improvements for production use:

Persistent storage for job history (currently in-memory)
Worker capacity discovery (auto-detect max_concurrent_jobs)
Job priority queues (high/low priority jobs)
Authentication for clients (API keys, OAuth)
TLS support for encrypted communication
Metrics export (Prometheus, Grafana)
Job cancellation (cancel in-progress jobs)
Worker drain mode (stop accepting new jobs for maintenance)

Troubleshooting¶

No workers available¶

WARNING:__main__:No available workers for job pool-xxx

Causes: - All workers unhealthy (check worker logs) - All workers at max capacity (check /pool endpoint) - Workers not started with --worker-key

Solution: - Start more workers - Increase max_concurrent_jobs per worker - Check worker health endpoints directly

Jobs stuck in "queued" status¶

Causes: - No healthy workers available - Worker endpoints incorrect in workers.json

Solution: - Check /pool endpoint for worker health status - Verify worker endpoints are reachable - Check worker logs for errors

Worker rejected job¶

ERROR:__main__:Worker a1b2c3d4... rejected job: 400

Causes: - Invalid manifest format - Files too large for worker - Worker resource limits exceeded

Solution: - Check worker logs for specific error - Verify manifest is valid JSON - Reduce job size or increase worker limits

Example Deployment¶

3-Worker Pool¶

# Worker 1
sudo ./sandrun --port 8443 --worker-key /etc/sandrun/worker1.pem

# Worker 2
sudo ./sandrun --port 8443 --worker-key /etc/sandrun/worker2.pem

# Worker 3
sudo ./sandrun --port 8443 --worker-key /etc/sandrun/worker3.pem

# Coordinator
python coordinator.py --port 9000 --workers workers.json

workers.json:

[
  {
    "worker_id": "worker1-public-key-from-generate-key",
    "endpoint": "http://192.168.1.101:8443",
    "max_concurrent_jobs": 4
  },
  {
    "worker_id": "worker2-public-key-from-generate-key",
    "endpoint": "http://192.168.1.102:8443",
    "max_concurrent_jobs": 4
  },
  {
    "worker_id": "worker3-public-key-from-generate-key",
    "endpoint": "http://192.168.1.103:8443",
    "max_concurrent_jobs": 4
  }
]

Now you can submit jobs to the pool at http://coordinator-ip:9000/submit and they will be automatically distributed across the 3 workers!