Talking to OpenClaw: Voice and Audio Interfaces
Use voice to control OpenClaw and DenchClaw with speech-to-text input, audio responses, and the openai-whisper skill. Full guide to voice-first AI workspace workflows.
You can talk to OpenClaw using voice input. DenchClaw supports speech-to-text through the system keyboard's built-in voice input, the openai-whisper skill for local transcription, and text-to-speech audio responses. This guide covers every voice interaction option — from the simplest (iOS dictation) to the most capable (local Whisper with audio output).
Not sure what DenchClaw is? Start with what is DenchClaw. Already set up? Follow along from the setup guide.
Voice Input Options#
There are three ways to use voice input with DenchClaw, in order of complexity:
- System dictation — use your OS or phone's built-in speech-to-text
- Browser speech input — the web UI's built-in microphone button
- OpenAI Whisper skill — local transcription via Whisper CLI, no API key needed
Option 1: System Dictation (Easiest)#
Every major operating system includes speech-to-text. This is the simplest approach — no configuration, no extra software.
macOS#
Enable Dictation:
- System Settings → Keyboard → Dictation
- Toggle Dictation on
- Set a shortcut (default: press Fn twice)
Use it in DenchClaw:
- Click in the DenchClaw chat input
- Press Fn Fn (or your custom shortcut)
- Speak your message
- Press Fn again or tap the microphone icon to finish
macOS Dictation works offline on Apple Silicon and works system-wide — any text field.
iOS/iPadOS#
On the mobile companion app:
- Tap the chat input
- Tap the microphone icon on the iOS keyboard
- Speak
- Tap the microphone again to stop and send
iOS dictation is fast and accurate for English. For languages with less training data, accuracy varies.
Windows#
- Press Windows + H to open the Voice Typing panel
- Click the microphone
- Speak into DenchClaw's chat input
- Click Stop when done
Android#
- Tap the chat input in the companion app
- Tap the microphone icon on the Google keyboard
- Speak
- The transcription appears automatically
Option 2: Browser Microphone Button#
DenchClaw's web UI includes a built-in microphone button in the chat input bar. It uses the browser's Web Speech API for transcription.
- Open DenchClaw at
localhost:3100 - Click the microphone icon in the chat input
- Allow microphone access when prompted
- Speak your message
- Click the microphone again to stop — the transcription fills the input field
- Press Enter to send
The browser microphone sends audio to a speech recognition service (Google's by default via Chrome). If you're concerned about audio privacy, use the Whisper skill instead — it transcribes locally with no data sent to any cloud service.
Option 3: OpenAI Whisper Skill (Most Capable)#
The openai-whisper skill adds local speech-to-text transcription using OpenAI's Whisper model, running entirely on your machine. No API key required. No audio leaves your device.
Install the Skill#
clawhub install openai-whisperOr through DenchClaw chat:
"Install the openai-whisper skill"
Prerequisites#
Whisper requires Python and the Whisper CLI:
# Install Python 3.9+ (macOS with Homebrew)
brew install python@3.11
# Install Whisper
pip3 install openai-whisper
# Verify
whisper --helpThe first run downloads the model weights. Choose your model based on your hardware:
| Model | Size | Speed | Accuracy | VRAM |
|---|---|---|---|---|
tiny | 39M | Very fast | Basic | ~1 GB |
base | 74M | Fast | Good | ~1 GB |
small | 244M | Moderate | Better | ~2 GB |
medium | 769M | Slow | Great | ~5 GB |
large | 1550M | Slowest | Best | ~10 GB |
For most users on a modern Mac or PC: base or small is the right balance.
Using Whisper in DenchClaw#
Once the skill is installed, tell the agent to transcribe audio:
"Transcribe the recording at ~/Downloads/meeting-notes.m4a"
Or record directly from the terminal and pipe to Whisper:
# Record 30 seconds and transcribe (macOS)
rec -r 16000 -c 1 -b 16 /tmp/note.wav trim 0 30
whisper /tmp/note.wav --model base --language English --output_format txt
cat /tmp/note.txtThe DenchClaw agent can then take the transcribed text and process it — log it as a note, update a contact record, create a task, or whatever you ask.
Setting Up a Voice Note Workflow#
Here's a practical workflow for logging voice notes after calls:
Step 1: Install the Skill and a Recording Shortcut#
clawhub install openai-whisperCreate a quick recording script:
#!/bin/bash
# ~/scripts/voice-note.sh
# Usage: voice-note.sh [duration_seconds]
DURATION=${1:-60}
OUTFILE="/tmp/voice-note-$(date +%Y%m%d-%H%M%S).wav"
echo "Recording for ${DURATION}s... Press Ctrl+C to stop early."
rec -r 16000 -c 1 -b 16 "$OUTFILE" trim 0 "$DURATION"
echo "Transcribing..."
whisper "$OUTFILE" --model small --language English --output_format txt
echo "Transcription saved to ${OUTFILE%.wav}.txt"
cat "${OUTFILE%.wav}.txt"Make it executable:
chmod +x ~/scripts/voice-note.shStep 2: Record, Transcribe, and Log#
After a call:
~/scripts/voice-note.sh 120Speak your notes for up to 2 minutes. When done, the transcription appears in your terminal. Copy and paste it into DenchClaw, or use the agent directly:
"Log this as a note for the Acme contact: [paste transcription]"
Step 3: Automate With the Agent#
If you've set up the Whisper skill in DenchClaw, you can ask the agent to handle the transcription file directly:
"Transcribe /tmp/voice-note-20260326.wav and log it as a note for whoever I mention in the recording under today's date"
The agent transcribes, identifies the contact from the text, and logs the note.
Text-to-Speech: Audio Responses#
DenchClaw can respond in audio using text-to-speech (TTS). Two approaches:
macOS Built-In TTS#
Ask the agent to use say:
"Read me today's pipeline summary out loud"
If the agent has shell access (which it does by default), it can use macOS's say command:
say -v Samantha "You have three open deals. Acme is in negotiation. Stripe is in demo scheduled. Notion is in proposal sent."DenchClaw's Built-In TTS#
DenchClaw has native TTS via its tts tool. The agent can speak responses directly when you ask it to. This uses your system's audio output.
Example:
"Give me my top three tasks today as a spoken response"
The agent will generate the TTS audio and play it through your default audio output.
Voice Shortcuts and Hotkeys#
For power users, setting up hotkeys makes voice interaction much faster:
macOS: Alfred / Raycast#
If you use Alfred or Raycast, create a workflow that:
- Opens DenchClaw PWA (or focuses it if already open)
- Triggers macOS Dictation (
Fn Fn)
This gives you a single shortcut to open DenchClaw and start speaking in under a second.
macOS: Custom Key Binding#
Using BetterTouchTool or Keyboard Maestro:
- Trigger: custom hotkey (e.g.,
⌘⇧V) - Action: activate DenchClaw window + type
{dictation start}(Keyboard Maestro can trigger system dictation)
Privacy Notes#
Voice input privacy depends on the method:
| Method | Audio Goes Where |
|---|---|
| macOS Dictation (online mode) | Apple's servers |
| macOS Dictation (enhanced mode, Apple Silicon) | On-device only |
| Browser Web Speech API (Chrome) | Google's servers |
| OpenAI Whisper skill | Your machine only |
| iOS Dictation | Apple's servers (or on-device for enhanced) |
If audio privacy is important to you, use the Whisper skill or macOS enhanced dictation on Apple Silicon — both process audio locally.
FAQ#
Does voice input work in the mobile app?
Yes. The mobile companion app uses your phone's system keyboard dictation. Tap the microphone icon on your keyboard while the chat input is focused. See the mobile app guide for more.
Can the agent listen continuously for a wake word?
Not currently. Each voice input session is manual — you activate dictation, speak, and send. Continuous listening with a wake word is on the experimental roadmap.
Does Whisper support languages other than English?
Yes. Whisper supports 90+ languages. Specify with the --language flag: whisper file.wav --language Spanish. For auto-detection, omit the flag.
How accurate is Whisper for technical vocabulary (company names, product terms)?
Accuracy for technical terms varies. The medium and large models handle unusual names better. For critical notes, scan the transcription before logging it.
Can I connect a hardware push-to-talk button?
Yes, with some setup. On macOS, a USB or Bluetooth button that sends a keyboard shortcut can trigger macOS Dictation. Configure the shortcut in System Settings → Keyboard → Dictation.
Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →
