Mlx Local Inference
Full local AI inference stack on Apple Silicon Macs via MLX.
- Rating
- 4.1 (144 reviews)
- Downloads
- 2,458 downloads
- Version
- 1.0.0
Overview
Full local AI inference stack on Apple Silicon Macs via MLX.
Complete Documentation
View Source →
MLX Local Inference Stack
Full local AI inference on Apple Silicon Macs. All services expose OpenAI-compatible APIs.
Services Overview
| Service | Port | Access | Models |
|---|---|---|---|
| LLM + Whisper + Embedding | 8787 | LAN (0.0.0.0) | qwen3-14b, gemma-3-12b, whisper-large-v3-turbo, qwen3-embedding-0.6b/4b |
| ASR (Qwen3-ASR) | 8788 | localhost only | Qwen3-ASR-1.7B-8bit |
| Transcribe Daemon | — | file-based | Uses ASR + LLM |
com.mlx-server (8787), com.mlx-audio-server (8788), com.mlx-transcribe-daemon1. LLM — Local Chat Completions
Models
| Model ID | Params | Best For |
|---|---|---|
| qwen3-14b | 14B 4bit | Chinese, deep reasoning (built-in think mode) |
| gemma-3-12b | 12B 4bit | English, code generation |
API
curl -X POST http://localhost:8787/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.7,
"max_tokens": 2048
}'
Add "stream": true for streaming.
Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="unused")
response = client.chat.completions.create(
model="qwen3-14b",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7, max_tokens=2048
)
print(response.choices[0].message.content)
Qwen3 Think Mode
Qwen3 may include chain-of-thought tags. Strip them:
import re
text = re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL)
Model Selection Guide
| Scenario | Recommended |
|---|---|
| Chinese text | qwen3-14b |
| Cantonese | qwen3-14b |
| English writing | gemma-3-12b |
| Code generation | Either |
| Deep reasoning | qwen3-14b (think mode) |
| Quick Q&A | gemma-3-12b |
2. ASR — Speech-to-Text
Qwen3-ASR (best for Chinese/Cantonese)
curl -X POST http://127.0.0.1:8788/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=mlx-community/Qwen3-ASR-1.7B-8bit" \
-F "language=zh"
Whisper (multilingual, 99 languages)
curl -X POST http://localhost:8787/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo"
ASR Model Comparison
| Qwen3-ASR (port 8788) | Whisper (port 8787) | |
|---|---|---|
| Chinese/Cantonese | Strong | Average |
| Multilingual | No | Yes (99 langs) |
| LAN access | No (localhost) | Yes |
| Loading | On-demand | Always loaded |
Supported audio formats
wav, mp3, m4a, flac, ogg, webm
Long audio
Split into 10-min chunks first:
ffmpeg -y -ss 0 -t 600 -i long.wav -ar 16000 -ac 1 chunk_000.wav
3. Embeddings — Text Vectorization
Models
| Model ID | Size | Use Case |
|---|---|---|
| qwen3-embedding-0.6b | 0.6B 4bit | Fast retrieval, low latency |
| qwen3-embedding-4b | 4B 4bit | High-accuracy semantic matching |
API
curl -X POST http://localhost:8787/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-embedding-0.6b", "input": "text to embed"}'
Batch
curl -X POST http://localhost:8787/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-embedding-4b", "input": ["text 1", "text 2"]}'
4. OCR — Image Text Extraction
Default Model: PaddleOCR-VL-1.5-6bit
| Item | Value |
|---|---|
| Model ID | paddleocr-vl-6bit |
| Speed | ~185 t/s |
| Memory | ~3.3 GB |
| Prompt | OCR: |
CLI
cd ~/.mlx-server/venv
python -m mlx_vlm.generate \
--model mlx-community/PaddleOCR-VL-1.5-6bit \
--image image.jpg \
--prompt "OCR:" \
--max-tokens 512 --temp 0.0
Python
from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("mlx-community/PaddleOCR-VL-1.5-6bit")
config = load_config("mlx-community/PaddleOCR-VL-1.5-6bit")
prompt = apply_chat_template(processor, config, "OCR:", num_images=1)
out = generate(model, processor, prompt, "image.jpg",
max_tokens=512, temperature=0.0, verbose=False)
print(out.text if hasattr(out, "text") else out)
Notes
- Prompt must be exactly
OCR:for PaddleOCR-VL temperature=0.0for deterministic output- RGBA images must be converted to RGB first
- Venv:
~/.mlx-server/venv
5. TTS — Text-to-Speech
Model: Qwen3-TTS (cached, not auto-served)
| Item | Value |
|---|---|
| Model | Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit |
| Memory | ~2GB |
| Feature | Custom voice cloning |
CLI
~/.mlx-server/venv/bin/mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
--text "你好,这是一段测试语音"
As API (via mlx_audio.server on port 8788)
curl -X POST http://127.0.0.1:8788/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit",
"input": "你好世界"
}' --output speech.wav
6. Transcribe Daemon — Automatic Batch Transcription
Drop audio files into ~/transcribe/ for automatic processing:
- Daemon detects file (polls every 15s)
- Phase 1: Transcribe via Qwen3-ASR →
filename_raw.md - Phase 2: Correct via Qwen3-14B LLM →
filename_corrected.md - Move results to
~/transcribe/done/
LLM Correction Rules
- Fix homophone errors (的/得/地, 在/再)
- Preserve Cantonese characters (嘅、唔、咁、喺、冇、佢)
- Add punctuation and paragraphs
- Remove filler words
Supported formats
wav, mp3, m4a, flac, ogg, webmService Management
# LLM + Whisper + Embedding server (port 8787)
launchctl kickstart -k gui/$(id -u)/com.mlx-server
# ASR server (port 8788)
launchctl kickstart -k gui/$(id -u)/com.mlx-audio-server
# Transcribe daemon
launchctl kickstart gui/$(id -u)/com.mlx-transcribe-daemon
# Logs
tail -f ~/.mlx-server/logs/server.log
tail -f ~/.mlx-server/logs/mlx-audio-server.err.log
tail -f ~/.mlx-server/logs/transcribe-daemon.err.log
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- Python 3.10+ with mlx, mlx-lm, mlx-audio, mlx-vlm
- Recommended: 32GB+ RAM for running multiple models
Installation
openclaw install mlx-local-inference
💻Code Examples
}'
Add `"stream": true` for streaming.
### Pythonprint(response.choices[0].message.content)
### Qwen3 Think Mode
Qwen3 may include `<think>...</think>` chain-of-thought tags. Strip them:text = re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL)
### Model Selection Guide
| Scenario | Recommended |
|----------|-------------|
| Chinese text | `qwen3-14b` |
| Cantonese | `qwen3-14b` |
| English writing | `gemma-3-12b` |
| Code generation | Either |
| Deep reasoning | `qwen3-14b` (think mode) |
| Quick Q&A | `gemma-3-12b` |
---
## 2. ASR — Speech-to-Text
### Qwen3-ASR (best for Chinese/Cantonese)-F "model=whisper-large-v3-turbo"
### ASR Model Comparison
| | Qwen3-ASR (port 8788) | Whisper (port 8787) |
|---|---|---|
| Chinese/Cantonese | **Strong** | Average |
| Multilingual | No | Yes (99 langs) |
| LAN access | No (localhost) | Yes |
| Loading | On-demand | Always loaded |
### Supported audio formats
wav, mp3, m4a, flac, ogg, webm
### Long audio
Split into 10-min chunks first:ffmpeg -y -ss 0 -t 600 -i long.wav -ar 16000 -ac 1 chunk_000.wav
---
## 3. Embeddings — Text Vectorization
### Models
| Model ID | Size | Use Case |
|----------|------|----------|
| `qwen3-embedding-0.6b` | 0.6B 4bit | Fast retrieval, low latency |
| `qwen3-embedding-4b` | 4B 4bit | High-accuracy semantic matching |
### API-d '{"model": "qwen3-embedding-4b", "input": ["text 1", "text 2"]}'
---
## 4. OCR — Image Text Extraction
### Default Model: PaddleOCR-VL-1.5-6bit
| Item | Value |
|------|-------|
| Model ID | `paddleocr-vl-6bit` |
| Speed | ~185 t/s |
| Memory | ~3.3 GB |
| Prompt | `OCR:` |
### CLIprint(out.text if hasattr(out, "text") else out)
### Notes
- Prompt must be exactly `OCR:` for PaddleOCR-VL
- `temperature=0.0` for deterministic output
- RGBA images must be converted to RGB first
- Venv: `~/.mlx-server/venv`
---
## 5. TTS — Text-to-Speech
### Model: Qwen3-TTS (cached, not auto-served)
| Item | Value |
|------|-------|
| Model | Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit |
| Memory | ~2GB |
| Feature | Custom voice cloning |
### CLI}' --output speech.wav
---
## 6. Transcribe Daemon — Automatic Batch Transcription
Drop audio files into `~/transcribe/` for automatic processing:
1. Daemon detects file (polls every 15s)
2. **Phase 1**: Transcribe via Qwen3-ASR → `filename_raw.md`
3. **Phase 2**: Correct via Qwen3-14B LLM → `filename_corrected.md`
4. Move results to `~/transcribe/done/`
### LLM Correction Rules
- Fix homophone errors (的/得/地, 在/再)
- Preserve Cantonese characters (嘅、唔、咁、喺、冇、佢)
- Add punctuation and paragraphs
- Remove filler words
### Supported formats
wav, mp3, m4a, flac, ogg, webm
---
## Service Managementcurl -X POST http://localhost:8787/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.7,
"max_tokens": 2048
}'from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="unused")
response = client.chat.completions.create(
model="qwen3-14b",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7, max_tokens=2048
)
print(response.choices[0].message.content)Tags
Quick Info
Ready to Install?
Get started with this skill in seconds
Related Skills
4claw
4claw — a moderated imageboard for AI agents.
Aap Passport
Agent Attestation Protocol - The Reverse Turing Test.
Acestep Lyrics Transcription
Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API.
Adaptive Suite
A continuously adaptive skill suite that empowers Clawdbot.