✓ Verified 💻 Development ✓ Enhanced Data

Mlx Local Inference

Full local AI inference stack on Apple Silicon Macs via MLX.

Rating
4.1 (144 reviews)
Downloads
2,458 downloads
Version
1.0.0

Overview

Full local AI inference stack on Apple Silicon Macs via MLX.

Complete Documentation

View Source →

MLX Local Inference Stack

Full local AI inference on Apple Silicon Macs. All services expose OpenAI-compatible APIs.

Services Overview

ServicePortAccessModels
LLM + Whisper + Embedding8787LAN (0.0.0.0)qwen3-14b, gemma-3-12b, whisper-large-v3-turbo, qwen3-embedding-0.6b/4b
ASR (Qwen3-ASR)8788localhost onlyQwen3-ASR-1.7B-8bit
Transcribe Daemonfile-basedUses ASR + LLM
LaunchAgents: com.mlx-server (8787), com.mlx-audio-server (8788), com.mlx-transcribe-daemon


1. LLM — Local Chat Completions

Models

Model IDParamsBest For
qwen3-14b14B 4bitChinese, deep reasoning (built-in think mode)
gemma-3-12b12B 4bitEnglish, code generation

API

bash
curl -X POST http://localhost:8787/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7,
    "max_tokens": 2048
  }'

Add "stream": true for streaming.

Python

python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="unused")
response = client.chat.completions.create(
    model="qwen3-14b",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7, max_tokens=2048
)
print(response.choices[0].message.content)

Qwen3 Think Mode

Qwen3 may include ... chain-of-thought tags. Strip them:

python
import re
text = re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL)

Model Selection Guide

ScenarioRecommended
Chinese textqwen3-14b
Cantoneseqwen3-14b
English writinggemma-3-12b
Code generationEither
Deep reasoningqwen3-14b (think mode)
Quick Q&Agemma-3-12b

2. ASR — Speech-to-Text

Qwen3-ASR (best for Chinese/Cantonese)

bash
curl -X POST http://127.0.0.1:8788/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=mlx-community/Qwen3-ASR-1.7B-8bit" \
  -F "language=zh"

Whisper (multilingual, 99 languages)

bash
curl -X POST http://localhost:8787/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo"

ASR Model Comparison

Qwen3-ASR (port 8788)Whisper (port 8787)
Chinese/CantoneseStrongAverage
MultilingualNoYes (99 langs)
LAN accessNo (localhost)Yes
LoadingOn-demandAlways loaded

Supported audio formats

wav, mp3, m4a, flac, ogg, webm

Long audio

Split into 10-min chunks first:

bash
ffmpeg -y -ss 0 -t 600 -i long.wav -ar 16000 -ac 1 chunk_000.wav


3. Embeddings — Text Vectorization

Models

Model IDSizeUse Case
qwen3-embedding-0.6b0.6B 4bitFast retrieval, low latency
qwen3-embedding-4b4B 4bitHigh-accuracy semantic matching

API

bash
curl -X POST http://localhost:8787/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-embedding-0.6b", "input": "text to embed"}'

Batch

bash
curl -X POST http://localhost:8787/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-embedding-4b", "input": ["text 1", "text 2"]}'


4. OCR — Image Text Extraction

Default Model: PaddleOCR-VL-1.5-6bit

ItemValue
Model IDpaddleocr-vl-6bit
Speed~185 t/s
Memory~3.3 GB
PromptOCR:

CLI

bash
cd ~/.mlx-server/venv
python -m mlx_vlm.generate \
  --model mlx-community/PaddleOCR-VL-1.5-6bit \
  --image image.jpg \
  --prompt "OCR:" \
  --max-tokens 512 --temp 0.0

Python

python
from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("mlx-community/PaddleOCR-VL-1.5-6bit")
config = load_config("mlx-community/PaddleOCR-VL-1.5-6bit")
prompt = apply_chat_template(processor, config, "OCR:", num_images=1)
out = generate(model, processor, prompt, "image.jpg",
               max_tokens=512, temperature=0.0, verbose=False)
print(out.text if hasattr(out, "text") else out)

Notes

  • Prompt must be exactly OCR: for PaddleOCR-VL
  • temperature=0.0 for deterministic output
  • RGBA images must be converted to RGB first
  • Venv: ~/.mlx-server/venv

5. TTS — Text-to-Speech

Model: Qwen3-TTS (cached, not auto-served)

ItemValue
ModelQwen3-TTS-12Hz-1.7B-CustomVoice-8bit
Memory~2GB
FeatureCustom voice cloning

CLI

bash
~/.mlx-server/venv/bin/mlx_audio.tts.generate \
  --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
  --text "你好,这是一段测试语音"

As API (via mlx_audio.server on port 8788)

bash
curl -X POST http://127.0.0.1:8788/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit",
    "input": "你好世界"
  }' --output speech.wav


6. Transcribe Daemon — Automatic Batch Transcription

Drop audio files into ~/transcribe/ for automatic processing:

  • Daemon detects file (polls every 15s)
  • Phase 1: Transcribe via Qwen3-ASR → filename_raw.md
  • Phase 2: Correct via Qwen3-14B LLM → filename_corrected.md
  • Move results to ~/transcribe/done/

LLM Correction Rules

  • Fix homophone errors (的/得/地, 在/再)
  • Preserve Cantonese characters (嘅、唔、咁、喺、冇、佢)
  • Add punctuation and paragraphs
  • Remove filler words

Supported formats

wav, mp3, m4a, flac, ogg, webm


Service Management

bash
# LLM + Whisper + Embedding server (port 8787)
launchctl kickstart -k gui/$(id -u)/com.mlx-server

# ASR server (port 8788)
launchctl kickstart -k gui/$(id -u)/com.mlx-audio-server

# Transcribe daemon
launchctl kickstart gui/$(id -u)/com.mlx-transcribe-daemon

# Logs
tail -f ~/.mlx-server/logs/server.log
tail -f ~/.mlx-server/logs/mlx-audio-server.err.log
tail -f ~/.mlx-server/logs/transcribe-daemon.err.log

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • Python 3.10+ with mlx, mlx-lm, mlx-audio, mlx-vlm
  • Recommended: 32GB+ RAM for running multiple models

Installation

Terminal bash

openclaw install mlx-local-inference
    
Copied!

💻Code Examples

}'

-.txt
Add `"stream": true` for streaming.

### Python

print(response.choices[0].message.content)

printresponsechoices0messagecontent.txt
### Qwen3 Think Mode

Qwen3 may include `<think>...</think>` chain-of-thought tags. Strip them:

text = re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL)

text--resubrthinkthinks--text-flagsredotall.txt
### Model Selection Guide

| Scenario | Recommended |
|----------|-------------|
| Chinese text | `qwen3-14b` |
| Cantonese | `qwen3-14b` |
| English writing | `gemma-3-12b` |
| Code generation | Either |
| Deep reasoning | `qwen3-14b` (think mode) |
| Quick Q&A | `gemma-3-12b` |

---

## 2. ASR — Speech-to-Text

### Qwen3-ASR (best for Chinese/Cantonese)

-F "model=whisper-large-v3-turbo"

--f-modelwhisper-large-v3-turbo.txt
### ASR Model Comparison

| | Qwen3-ASR (port 8788) | Whisper (port 8787) |
|---|---|---|
| Chinese/Cantonese | **Strong** | Average |
| Multilingual | No | Yes (99 langs) |
| LAN access | No (localhost) | Yes |
| Loading | On-demand | Always loaded |

### Supported audio formats

wav, mp3, m4a, flac, ogg, webm

### Long audio

Split into 10-min chunks first:

ffmpeg -y -ss 0 -t 600 -i long.wav -ar 16000 -ac 1 chunk_000.wav

ffmpeg--y--ss-0--t-600--i-longwav--ar-16000--ac-1-chunk000wav.txt
---

## 3. Embeddings — Text Vectorization

### Models

| Model ID | Size | Use Case |
|----------|------|----------|
| `qwen3-embedding-0.6b` | 0.6B 4bit | Fast retrieval, low latency |
| `qwen3-embedding-4b` | 4B 4bit | High-accuracy semantic matching |

### API

-d '{"model": "qwen3-embedding-4b", "input": ["text 1", "text 2"]}'

--d-model-qwen3-embedding-4b-input-text-1-text-2.txt
---

## 4. OCR — Image Text Extraction

### Default Model: PaddleOCR-VL-1.5-6bit

| Item | Value |
|------|-------|
| Model ID | `paddleocr-vl-6bit` |
| Speed | ~185 t/s |
| Memory | ~3.3 GB |
| Prompt | `OCR:` |

### CLI

print(out.text if hasattr(out, "text") else out)

printouttext-if-hasattrout-text-else-out.txt
### Notes
- Prompt must be exactly `OCR:` for PaddleOCR-VL
- `temperature=0.0` for deterministic output
- RGBA images must be converted to RGB first
- Venv: `~/.mlx-server/venv`

---

## 5. TTS — Text-to-Speech

### Model: Qwen3-TTS (cached, not auto-served)

| Item | Value |
|------|-------|
| Model | Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit |
| Memory | ~2GB |
| Feature | Custom voice cloning |

### CLI

}' --output speech.wav

----output-speechwav.txt
---

## 6. Transcribe Daemon — Automatic Batch Transcription

Drop audio files into `~/transcribe/` for automatic processing:

1. Daemon detects file (polls every 15s)
2. **Phase 1**: Transcribe via Qwen3-ASR → `filename_raw.md`
3. **Phase 2**: Correct via Qwen3-14B LLM → `filename_corrected.md`
4. Move results to `~/transcribe/done/`

### LLM Correction Rules
- Fix homophone errors (的/得/地, 在/再)
- Preserve Cantonese characters (嘅、唔、咁、喺、冇、佢)
- Add punctuation and paragraphs
- Remove filler words

### Supported formats
wav, mp3, m4a, flac, ogg, webm

---

## Service Management
example.sh
curl -X POST http://localhost:8787/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7,
    "max_tokens": 2048
  }'
example.py
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="unused")
response = client.chat.completions.create(
    model="qwen3-14b",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7, max_tokens=2048
)
print(response.choices[0].message.content)

Tags

#devops_and-cloud

Quick Info

Category Development
Model Claude 3.5
Complexity One-Click
Author bendusy
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install mlx-local-inference