Zhipu Asr
Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model.
- Rating
- 4.3 (53 reviews)
- Downloads
- 8,864 downloads
- Version
- 1.0.0
Overview
Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model.
Complete Documentation
View Source →
Zhipu AI Automatic Speech Recognition (ASR)
Transcribe Chinese audio files to text using Zhipu AI's GLM-ASR model.
Setup
1. Get your API Key: Get a key from Zhipu AI Console
2. Set it in your environment:
export ZHIPU_API_KEY="your-key-here"
Supported Audio Formats
- WAV - Recommended, best quality
- MP3 - Widely supported
- OGG - Auto-converted to MP3
- M4A - Auto-converted to MP3
- AAC - Auto-converted to MP3
- FLAC - Auto-converted to MP3
- WMA - Auto-converted to MP3
Note: The script automatically converts unsupported formats to MP3 using ffmpeg. Only WAV and MP3 are accepted by the API, but you can use any format that ffmpeg supports.
File Constraints
- Maximum file size: 25 MB
- Maximum duration: 30 seconds
- Recommended sample rate: 16000 Hz or higher
- Audio channels: Mono or stereo
Usage
Basic Transcription
Transcribe an audio file with default settings:
bash scripts/speech_to_text.sh recording.wav
Transcription with Context
Provide previous transcription or context for better accuracy:
bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容,有助于提高准确性"
Transcription with Hotwords
Use custom vocabulary to improve recognition of specific terms:
bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"
Full Options
Combine context and hotwords:
bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称"
Parameters:
audio_file(required): Path to audio file (.wav or .mp3)prompt(optional): Previous transcription or context text (max 8000 chars)hotwords(optional): Comma-separated list of specific terms (max 100 words)
Features
Context Prompts
Why use context prompts:
- Improves accuracy in long conversations
- Helps with domain-specific terminology
- Maintains consistency across multiple segments
- Multi-part conversations or meetings
- Technical or specialized content
- Continuing from previous transcriptions
bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容:讨论了项目进展和下一步计划"
Hotwords
What are hotwords: Custom vocabulary list that boosts recognition accuracy for specific terms.
Best use cases:
- Proper names (people, places)
- Domain-specific terminology
- Company names and products
- Technical jargon
- Industry-specific terms
# Medical transcription
bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案"
# Business meeting
bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算"
# Tech discussion
bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"
Workflow Examples
Transcribe a Meeting
# Part 1
bash scripts/speech_to_text.sh meeting_part1.wav
# Part 2 with context
bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称"
# Part 3 with context
bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"
Transcribe a Lecture
bash scripts/speech_to_text.sh lecture.wav "" "教授,课程名称,专业术语1,专业术语2"
Process Multiple Files
for file in recording_*.wav; do
bash scripts/speech_to_text.sh "$file"
done
Audio Quality Tips
Best practices for accurate transcription:
- Clear audio source
- Minimize background noise
- Use good quality microphone
- Speak clearly and at moderate pace
- Optimal audio settings
- Sample rate: 16000 Hz or higher
- Bit depth: 16-bit or higher
- Single channel (mono) is sufficient
- File preparation
- Remove silence from beginning/end
- Normalize audio levels
- Ensure consistent volume
Output Format
The script outputs JSON with:
id: Task IDcreated: Request timestamp (Unix timestamp)request_id: Unique request identifiermodel: Model name usedtext: Transcribed text
{
"id": "task-12345",
"created": 1234567890,
"request_id": "req-abc123",
"model": "glm-asr-2512",
"text": "你好,这是转录的文本内容"
}
Troubleshooting
File Size Issues:
- Split audio files larger than 25 MB
- Reduce sample rate or bit depth
- Use compression (MP3) for smaller files
- Split recordings longer than 30 seconds
- Process segments separately
- Use context prompts to maintain continuity
- Improve audio quality
- Use hotwords for specific terms
- Provide context prompts
- Ensure clear speech and minimal noise
- Ensure file is .wav or .mp3
- Check file is not corrupted
- Verify audio can be played by standard players
Limitations
- Maximum audio duration: 30 seconds per request
- File size limit: 25 MB
- Maximum hotwords: 100 terms
- Context prompt limit: 8000 characters
- Best performance with Chinese language audio
Performance Notes
- Typical transcription time: 1-3 seconds
- Real-time or faster for most audio
- Processing time scales with audio quality and length
Installation
openclaw install zhipu-asr
💻Code Examples
bash scripts/speech_to_text.sh recording.wav
### Transcription with Context
Provide previous transcription or context for better accuracy:bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容,有助于提高准确性"
### Transcription with Hotwords
Use custom vocabulary to improve recognition of specific terms:bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"
### Full Options
Combine context and hotwords:bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称"
**Parameters:**
- `audio_file` (required): Path to audio file (.wav or .mp3)
- `prompt` (optional): Previous transcription or context text (max 8000 chars)
- `hotwords` (optional): Comma-separated list of specific terms (max 100 words)
## Features
### Context Prompts
**Why use context prompts:**
- Improves accuracy in long conversations
- Helps with domain-specific terminology
- Maintains consistency across multiple segments
**When to use:**
- Multi-part conversations or meetings
- Technical or specialized content
- Continuing from previous transcriptions
**Example:**bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容:讨论了项目进展和下一步计划"
### Hotwords
**What are hotwords:**
Custom vocabulary list that boosts recognition accuracy for specific terms.
**Best use cases:**
- Proper names (people, places)
- Domain-specific terminology
- Company names and products
- Technical jargon
- Industry-specific terms
**Examples:**bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"
## Workflow Examples
### Transcribe a Meetingdone
## Audio Quality Tips
**Best practices for accurate transcription:**
1. **Clear audio source**
- Minimize background noise
- Use good quality microphone
- Speak clearly and at moderate pace
2. **Optimal audio settings**
- Sample rate: 16000 Hz or higher
- Bit depth: 16-bit or higher
- Single channel (mono) is sufficient
3. **File preparation**
- Remove silence from beginning/end
- Normalize audio levels
- Ensure consistent volume
## Output Format
The script outputs JSON with:
- `id`: Task ID
- `created`: Request timestamp (Unix timestamp)
- `request_id`: Unique request identifier
- `model`: Model name used
- `text`: Transcribed text
Example output:# Medical transcription
bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案"
# Business meeting
bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算"
# Tech discussion
bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"# Part 1
bash scripts/speech_to_text.sh meeting_part1.wav
# Part 2 with context
bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称"
# Part 3 with context
bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"for file in recording_*.wav; do
bash scripts/speech_to_text.sh "$file"
doneTags
Quick Info
Ready to Install?
Get started with this skill in seconds
Related Skills
4claw
4claw — a moderated imageboard for AI agents.
Aap Passport
Agent Attestation Protocol - The Reverse Turing Test.
Acestep Lyrics Transcription
Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API.
Adaptive Suite
A continuously adaptive skill suite that empowers Clawdbot.