Troubleshooting Guide¶
Solutions to common issues and error messages.
Quick Diagnosis¶
Start here to quickly identify your issue:
# Check if Sandrun is running
ps aux | grep sandrun
# Check if port is accessible
curl http://localhost:8443/
# Check system resources
free -h
df -h
# Check kernel support
cat /proc/sys/kernel/seccomp # Should be 2
ls /proc/self/ns/ # Should list namespaces
# Check recent logs
journalctl -u sandrun -n 50
Installation Issues¶
CMake Cannot Find Dependencies¶
Symptom:
Solution:
Build Fails with Compiler Errors¶
Symptom:
Solution:
# Check GCC version (need 7.0+)
gcc --version
# Update if needed (Ubuntu)
sudo apt-get install g++-9
export CXX=g++-9
# Clean and rebuild
rm -rf build
cmake -B build -DCMAKE_CXX_COMPILER=g++-9
cmake --build build
CMake Version Too Old¶
Symptom:
Solution:
# Install latest CMake
wget https://github.com/Kitware/CMake/releases/download/v3.25.0/cmake-3.25.0-linux-x86_64.sh
chmod +x cmake-3.25.0-linux-x86_64.sh
sudo ./cmake-3.25.0-linux-x86_64.sh --prefix=/usr/local --skip-license
# Verify
cmake --version
Server Startup Issues¶
Permission Denied Creating Namespace¶
Symptom:
Solutions:
Port Already in Use¶
Symptom:
Solutions:
# Option 1: Find what's using the port
sudo lsof -i :8443
sudo netstat -tulpn | grep 8443
# Option 2: Kill the process
sudo kill <PID>
# Option 3: Use different port
sudo ./build/sandrun --port 9000
# Option 4: Stop existing Sandrun instance
sudo systemctl stop sandrun
# or
sudo pkill -f sandrun
Seccomp Not Supported¶
Symptom:
Solution:
# Check seccomp support
cat /proc/sys/kernel/seccomp
# Should output: 2
# If 0, rebuild kernel with CONFIG_SECCOMP=y
# Or use a distribution with seccomp enabled
# Check kernel version (need 4.6+)
uname -r
# Update kernel if needed
sudo apt-get install linux-generic-hwe-20.04
sudo reboot
Cannot Open Worker Key File¶
Symptom:
Solution:
# Generate new worker key
sudo mkdir -p /etc/sandrun
sudo ./build/sandrun --generate-key /etc/sandrun/worker.pem
# Or skip worker key if not using pools
sudo ./build/sandrun --port 8443 # No --worker-key flag
Job Submission Issues¶
Invalid Manifest¶
Symptom:
Solution:
# Validate JSON syntax
echo '{"entrypoint":"main.py"}' | jq .
# Minimum valid manifest
cat > manifest.json <<EOF
{
"entrypoint": "main.py",
"interpreter": "python3"
}
EOF
# Submit with manifest
curl -X POST http://localhost:8443/submit \
-F "files=@job.tar.gz" \
-F "manifest=$(cat manifest.json)"
Tarball Too Large¶
Symptom:
Solutions:
# Check tarball size
ls -lh job.tar.gz
# Compress better
tar czf job.tar.gz --best my_project/
# Remove unnecessary files
tar czf job.tar.gz \
--exclude='*.pyc' \
--exclude='__pycache__' \
--exclude='.git' \
--exclude='node_modules' \
my_project/
# Split into multiple jobs if needed
Files Missing in Tarball¶
Symptom: Job fails with FileNotFoundError: main.py
Solution:
# List tarball contents
tar -tzf job.tar.gz
# Ensure entrypoint is included
tar -tzf job.tar.gz | grep main.py
# Create tarball correctly
cd project_directory
tar czf ../job.tar.gz .
# Not: tar czf job.tar.gz project_directory
Rate Limit Exceeded¶
Symptom:
{
"error": "Rate limit exceeded",
"reason": "CPU quota exhausted (10.2/10.0 seconds used)",
"retry_after": 45
}
Solutions:
# Check your quota
curl http://localhost:8443/stats
# Wait for quota to reset
sleep 60
# Optimize code to use less CPU
# Split long jobs into smaller chunks
# Use more efficient algorithms
Job Execution Issues¶
Job Failed with Exit Code 1¶
Symptom:
Diagnosis:
# Get full logs
curl http://localhost:8443/logs/job-abc123
# Common causes:
# - Syntax error in code
# - Missing dependency
# - File not found
# - Permission error
Solutions:
# Test locally first
cd project_directory
python3 main.py # Test before submitting
# Add debugging
cat > main.py <<EOF
import sys
print("Python version:", sys.version)
print("Working directory:", os.getcwd())
print("Files:", os.listdir('.'))
# ... your code ...
EOF
Job Killed (Exit Code 137)¶
Symptom:
Causes:
- Out of memory (exceeded 512MB limit)
- Timeout (exceeded 5 minute limit)
Solutions:
# Increase memory limit in manifest
cat > manifest.json <<EOF
{
"entrypoint": "main.py",
"memory_mb": 1024
}
EOF
# Increase timeout
cat > manifest.json <<EOF
{
"entrypoint": "main.py",
"timeout": 600
}
EOF
# Optimize memory usage
# - Use generators instead of lists
# - Process data in chunks
# - Delete large objects when done
Permission Denied Inside Sandbox¶
Symptom:
Explanation:
This is expected behavior. The sandbox restricts access to:
- Host filesystem (only job directory accessible)
- Network (completely blocked)
- System files (
/etc,/proc,/sysread-only)
Solutions:
# Copy needed files into job directory
cp /etc/hosts my_project/hosts
tar czf job.tar.gz my_project/
# In your code, use relative paths
with open('hosts', 'r') as f: # Not /etc/hosts
data = f.read()
Import Errors (Missing Dependencies)¶
Symptom:
Solutions:
Job Stuck in "queued" Status¶
Symptom: Job never starts executing.
Diagnosis:
# Check system stats
curl http://localhost:8443/stats
# Response shows:
# "queue_length": 10 # Many queued jobs
# "active_jobs": 2 # System busy
Causes:
- Too many concurrent jobs (2 per IP limit)
- System overloaded
- No available workers (if using pool)
Solutions:
# Wait for active jobs to complete
# Or cancel queued jobs if needed
# Check worker health (pool deployments)
curl http://pool:9000/pool
Cannot Download Output Files¶
Symptom:
Causes:
- Job auto-deleted after 1 hour
- Already downloaded (immediate deletion)
- Job failed (deleted after 5 minutes)
Solutions:
# Download immediately after completion
# Check status first
STATUS=$(curl -s http://localhost:8443/status/job-abc123 | jq -r '.status')
if [ "$STATUS" = "completed" ]; then
curl http://localhost:8443/download/job-abc123/output.txt -o output.txt
fi
# Use WebSocket streaming to monitor completion
Pool Coordinator Issues¶
No Available Workers¶
Symptom:
Diagnosis:
Causes:
- All workers offline
- Workers failed health check
- Workers at max capacity
Solutions:
# Check worker health directly
curl http://worker1:8443/health
# Start more workers
sudo ./build/sandrun --port 8443 --worker-key /etc/sandrun/worker.pem
# Check worker logs
journalctl -u sandrun -f
# Verify workers.json configuration
cat workers.json
# Ensure worker IDs and endpoints are correct
Worker Authentication Failed¶
Symptom:
Solution:
# Regenerate worker key
sudo ./build/sandrun --generate-key /etc/sandrun/worker.pem
# Copy output Worker ID
# Update workers.json with new worker_id
# Restart worker
sudo systemctl restart sandrun
Jobs Stuck in Pool Queue¶
Symptom: Jobs never dispatched to workers.
Diagnosis:
# Check pool logs
journalctl -u pool-coordinator -f
# Check worker endpoints
for worker in worker1 worker2 worker3; do
echo "Testing $worker:"
curl http://$worker:8443/health
done
Solutions:
# Verify network connectivity
ping worker1
telnet worker1 8443
# Check firewall rules
sudo iptables -L -n
# Ensure workers are reachable from coordinator
# Update workers.json with correct endpoints
Performance Issues¶
Slow Job Execution¶
Diagnosis:
# Check system load
uptime
top
# Check I/O wait
iostat -x 1
# Check memory pressure
free -h
vmstat 1
Solutions:
# Increase system resources
# Add more RAM
# Use faster CPU
# Add more workers for horizontal scaling
# Optimize job code
# Use compiled languages for CPU-intensive tasks
# Minimize disk I/O
# Use efficient algorithms
High Memory Usage¶
Diagnosis:
# Check memory usage
free -h
# Check tmpfs usage
df -h /dev/shm
mount | grep tmpfs
# Check per-process memory
ps aux --sort=-%mem | head -20
Solutions:
# Increase system RAM
# Reduce concurrent job limit
# Reduce per-job memory limit
# Optimize job code for memory efficiency
# Clean up old jobs manually if needed
sudo systemctl restart sandrun
Debugging Tools¶
Enable Debug Logging¶
# Build with debug symbols
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
# Run with verbose logging
sudo ./build/sandrun --port 8443 --verbose
# Or set environment variable
export SANDRUN_LOG_LEVEL=debug
sudo -E ./build/sandrun --port 8443
Use GDB for Crashes¶
# Build with debug symbols
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
# Run under GDB
sudo gdb ./build/sandrun
(gdb) run --port 8443
# Wait for crash
(gdb) backtrace
(gdb) info locals
Memory Leak Detection¶
# Run with Valgrind
sudo valgrind \
--leak-check=full \
--show-leak-kinds=all \
--track-origins=yes \
--verbose \
--log-file=valgrind.log \
./build/sandrun --port 8443
# Analyze results
cat valgrind.log
Network Debugging¶
# Monitor HTTP traffic
sudo tcpdump -i any -nn -A port 8443
# Or use Wireshark
sudo wireshark
# Test with verbose curl
curl -v http://localhost:8443/
Getting Help¶
If you're still stuck:
-
Check logs:
-
Gather system info:
-
Search existing issues: GitHub Issues
-
Ask for help:
- GitHub Discussions
-
Include: OS, kernel version, error messages, logs
-
Report bugs:
- File an issue
- Include: steps to reproduce, expected vs actual behavior
Still need help? Open an issue →