✓ Verified
💻 Development
✓ Enhanced Data
Homelab Cluster
Manage multi-tier AI inference clusters for homelabs.
- Rating
- 4.7 (65 reviews)
- Downloads
- 818 downloads
- Version
- 1.0.0
Overview
Manage multi-tier AI inference clusters for homelabs.
Complete Documentation
View Source →name: homelab-cluster description: | Manage multi-tier AI inference clusters for homelabs. Health monitoring, expert MoE routing, automatic node recovery, and model deployment across Ollama and llama.cpp nodes. Covers GPU memory planning, Docker volume strategies for large models, sequential startup patterns to avoid CUDA deadlocks, and unified API gateways via LiteLLM. version: "1.0.0" license: MIT metadata: author: mlesnews org: Lumina Homelab domain: luminahomelab.ai emoji: "🏠" tags:
- homelab
- infrastructure
- llm
- gpu
- monitoring
- ollama
- llama-cpp
- compute-cluster
- litellm
- docker
Homelab Cluster Management
Manage a compound AI compute cluster spanning multiple tiers of GPU and CPU inference nodes. Built and battle-tested by Lumina Homelab.When to Use
Use this skill when your agent needs to:- Monitor health of distributed model endpoints
- Route inference requests to the best available model
- Recover downed nodes automatically
- Plan GPU memory allocation across models
- Deploy models across heterogeneous hardware
Architecture Pattern
A homelab cluster typically spans 2-3 tiers: | Tier | Typical Hardware | Runtime | Role | |------|-----------------|---------|------| | Local | Primary GPU (RTX 4090/5090) | Ollama | Fast inference, embeddings | | Remote | Secondary GPU (RTX 3090/4090) | llama.cpp or Ollama | Distributed inference | | NAS/CPU | Synology, RPi, any CPU node | Ollama | Lightweight models, fallback | A LiteLLM proxy sits in front, providing a unified OpenAI-compatible API across all tiers.Health Monitoring
Check all endpoints with configurable per-endpoint timeouts: ``bash
Define endpoints with tier labels
ENDPOINTS = {
"local/ollama": {"url": "http://localhost:11434/api/tags", "tier": "LOCAL"},
"remote/mark-i": {"url": "http://REMOTE_IP:3009/v1/models", "tier": "REMOTE", "timeout": 8},
"gateway/litellm": {"url": "http://localhost:8080/health/liveliness", "tier": "GATEWAY"},
}
For each endpoint: GET with timeout, check HTTP 200
Classify: HEALTHY / DEGRADED / DOWN per tier
Overall prognosis based on tier health
`
Key lesson: Use /health/liveliness for LiteLLM, not /health — the latter probes all model routes and hangs if any are unreachable.
Expert MoE Routing
Route requests to the optimal model based on task classification:
`
Task Categories:
code → Coder model (Qwen2.5-Coder-7B or similar)
reason → Reasoning model (DeepSeek-R1-Distill or similar)
chat → General model (Qwen2.5-14B or similar)
vision → Vision model (Qwen2.5-VL or similar)
fast → Smallest available model for quick responses
embed → Embedding model (nomic-embed-text or similar)
Router logic:
- Classify task from prompt
- Check health of preferred model
- Fallback to next-best if unavailable
- Return model endpoint + metadata
`
Docker Deployment (llama.cpp on Remote Nodes)
Critical: Use Docker Volumes, Not Bind Mounts
For models larger than ~1.5GB on Windows Docker hosts:
`bash
Create a Docker volume for model storage
docker volume create models-vol
Copy models INTO the volume
docker run --rm -v models-vol:/models -v /host/path:/src alpine cp /src/model.gguf /models/
Run container FROM volume (not bind mount)
docker run -d --gpus all -v models-vol:/models -p 3009:8000 \
-e MODEL_PATH=/models/model.gguf your-llamacpp-image
`
Why: Windows bind mounts use gRPC-FUSE/9P bridge which hangs during GPU tensor loading for large files. Docker volumes use native Linux ext4 and bypass this entirely.
Sequential Container Startup
Never start multiple GPU containers simultaneously:
`bash
WRONG — causes CUDA initialization deadlock
docker start mark-i mark-iii mark-iv mark-vi &
RIGHT — sequential with health check between each
for container in mark-v mark-iii mark-iv mark-vi mark-i; do
docker restart $container
sleep 5
# Verify health before starting next
curl -s http://localhost:PORT/v1/models || echo "Warning: $container slow to start"
done
`
GPU Memory Planning
Plan your model lineup to fit within VRAM:
`
Example for 24GB GPU:
14B model (Q4_K_M) → 9.0 GB, 28 GPU layers
7B coder → 4.4 GB, full GPU
8B reasoning → 4.6 GB, full GPU
1.5B fast coder → 1.1 GB, full GPU
1.7B fast chat → 1.0 GB, full GPU
─────────────────────────────
Total: 20.1 GB (~84% utilized)
Remaining: CPU-only containers for 32B+ models
`
Automatic Node Recovery
When a remote node goes down (Docker Desktop crash, reboot, etc.):
`
Recovery sequence:
- Health check fails for remote tier
- Check if SSH is responsive (node is up but Docker is down)
- If SSH works: restart Docker Desktop via SSH
- If SSH fails: create RDP session to wake the machine
- Wait for Docker + sequential container restart
- Re-check health
`
Important: Never store recovery credentials in plaintext. Use a vault (Azure Key Vault, HashiCorp Vault, etc.) and pipe secrets through stdin, never as CLI arguments.
LiteLLM Gateway Configuration
Unified API across all tiers:
`yaml
model_list:
# Local Ollama models
- model_name: local/chat
litellm_params:
model: ollama/qwen2.5:32b
api_base: http://localhost:11434
# Remote llama.cpp models (need openai/ prefix)
- model_name: remote/mark-i
litellm_params:
model: openai/qwen2.5-14b-instruct
api_base: http://REMOTE_IP:3009/v1
api_key: "not-needed"
# NAS Ollama models
- model_name: nas/coder
litellm_params:
model: ollama/qwen2.5-coder:7b
api_base: http://NAS_IP:11434
`
Key: llama.cpp endpoints need the openai/ prefix in model name and /v1` in api_base for LiteLLM compatibility.
Links
- Lumina Homelab: luminahomelab.ai
- X/Twitter: @HK47LUMINA
- GitHub: mlesnews
Installation
Terminal bash
openclaw install homelab-cluster
Copied!
Tags
#devops_and-cloud
Quick Info
Category Development
Model Claude 3.5
Complexity One-Click
Author mlesnews
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
Ready to Install?
Get started with this skill in seconds
openclaw install homelab-cluster
Related Skills
✓ Verified
💻 Development
4claw
4claw — a moderated imageboard for AI agents.
🧠 Claude-Ready
)}
★ 4.4 (118)
↓ 4,990
v1.0.0
✓ Verified
💻 Development
Aap Passport
Agent Attestation Protocol - The Reverse Turing Test.
🧠 Claude-Ready
)}
★ 4.3 (89)
↓ 4,621
v1.0.0
✓ Verified
💻 Development
Acestep Lyrics Transcription
Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API.
⚡ GPT-Optimized
)}
★ 3.8 (274)
↓ 17,648
v1.0.0
✓ Verified
💻 Development
Adaptive Suite
A continuously adaptive skill suite that empowers Clawdbot.
🧠 Claude-Ready
)}
★ 4.7 (88)
↓ 1,625
v1.0.0