Talking to OpenClaw: Voice and Audio Interfaces

Use voice to control OpenClaw and DenchClaw with speech-to-text input, audio responses, and the openai-whisper skill. Full guide to voice-first AI workspace workflows.

Mark Rachapoom

March 26, 2026·8 min read

You can talk to OpenClaw using voice input. DenchClaw supports speech-to-text through the system keyboard's built-in voice input, the openai-whisper skill for local transcription, and text-to-speech audio responses. This guide covers every voice interaction option — from the simplest (iOS dictation) to the most capable (local Whisper with audio output).

Not sure what DenchClaw is? Start with what is DenchClaw. Already set up? Follow along from the setup guide.

Voice Input Options#

There are three ways to use voice input with DenchClaw, in order of complexity:

System dictation — use your OS or phone's built-in speech-to-text
Browser speech input — the web UI's built-in microphone button
OpenAI Whisper skill — local transcription via Whisper CLI, no API key needed

Option 1: System Dictation (Easiest)#

Every major operating system includes speech-to-text. This is the simplest approach — no configuration, no extra software.

macOS#

Enable Dictation:

System Settings → Keyboard → Dictation
Toggle Dictation on
Set a shortcut (default: press Fn twice)

Use it in DenchClaw:

Click in the DenchClaw chat input
Press Fn Fn (or your custom shortcut)
Speak your message
Press Fn again or tap the microphone icon to finish

macOS Dictation works offline on Apple Silicon and works system-wide — any text field.

iOS/iPadOS#

On the mobile companion app:

Tap the chat input
Tap the microphone icon on the iOS keyboard
Speak
Tap the microphone again to stop and send

iOS dictation is fast and accurate for English. For languages with less training data, accuracy varies.

Windows#

Press Windows + H to open the Voice Typing panel
Click the microphone
Speak into DenchClaw's chat input
Click Stop when done

Android#

Tap the chat input in the companion app
Tap the microphone icon on the Google keyboard
Speak
The transcription appears automatically

Option 2: Browser Microphone Button#

DenchClaw's web UI includes a built-in microphone button in the chat input bar. It uses the browser's Web Speech API for transcription.

Open DenchClaw at localhost:3100
Click the microphone icon in the chat input
Allow microphone access when prompted
Speak your message
Click the microphone again to stop — the transcription fills the input field
Press Enter to send

The browser microphone sends audio to a speech recognition service (Google's by default via Chrome). If you're concerned about audio privacy, use the Whisper skill instead — it transcribes locally with no data sent to any cloud service.

Option 3: OpenAI Whisper Skill (Most Capable)#

The openai-whisper skill adds local speech-to-text transcription using OpenAI's Whisper model, running entirely on your machine. No API key required. No audio leaves your device.

Install the Skill#

clawhub install openai-whisper

Or through DenchClaw chat:

"Install the openai-whisper skill"

Prerequisites#

Whisper requires Python and the Whisper CLI:

# Install Python 3.9+ (macOS with Homebrew)
brew install python@3.11
 
# Install Whisper
pip3 install openai-whisper
 
# Verify
whisper --help

The first run downloads the model weights. Choose your model based on your hardware:

Model	Size	Speed	Accuracy	VRAM
`tiny`	39M	Very fast	Basic	~1 GB
`base`	74M	Fast	Good	~1 GB
`small`	244M	Moderate	Better	~2 GB
`medium`	769M	Slow	Great	~5 GB
`large`	1550M	Slowest	Best	~10 GB

For most users on a modern Mac or PC: base or small is the right balance.

Using Whisper in DenchClaw#

Once the skill is installed, tell the agent to transcribe audio:

"Transcribe the recording at ~/Downloads/meeting-notes.m4a"

Or record directly from the terminal and pipe to Whisper:

# Record 30 seconds and transcribe (macOS)
rec -r 16000 -c 1 -b 16 /tmp/note.wav trim 0 30
whisper /tmp/note.wav --model base --language English --output_format txt
cat /tmp/note.txt

The DenchClaw agent can then take the transcribed text and process it — log it as a note, update a contact record, create a task, or whatever you ask.

Setting Up a Voice Note Workflow#

Here's a practical workflow for logging voice notes after calls:

Step 1: Install the Skill and a Recording Shortcut#

clawhub install openai-whisper

Create a quick recording script:

#!/bin/bash
# ~/scripts/voice-note.sh
# Usage: voice-note.sh [duration_seconds]
 
DURATION=${1:-60}
OUTFILE="/tmp/voice-note-$(date +%Y%m%d-%H%M%S).wav"
 
echo "Recording for ${DURATION}s... Press Ctrl+C to stop early."
rec -r 16000 -c 1 -b 16 "$OUTFILE" trim 0 "$DURATION"
 
echo "Transcribing..."
whisper "$OUTFILE" --model small --language English --output_format txt
 
echo "Transcription saved to ${OUTFILE%.wav}.txt"
cat "${OUTFILE%.wav}.txt"

Make it executable:

chmod +x ~/scripts/voice-note.sh

Step 2: Record, Transcribe, and Log#

After a call:

~/scripts/voice-note.sh 120

Speak your notes for up to 2 minutes. When done, the transcription appears in your terminal. Copy and paste it into DenchClaw, or use the agent directly:

"Log this as a note for the Acme contact: [paste transcription]"

Step 3: Automate With the Agent#

If you've set up the Whisper skill in DenchClaw, you can ask the agent to handle the transcription file directly:

"Transcribe /tmp/voice-note-20260326.wav and log it as a note for whoever I mention in the recording under today's date"

The agent transcribes, identifies the contact from the text, and logs the note.

Text-to-Speech: Audio Responses#

DenchClaw can respond in audio using text-to-speech (TTS). Two approaches:

macOS Built-In TTS#

Ask the agent to use say:

"Read me today's pipeline summary out loud"

If the agent has shell access (which it does by default), it can use macOS's say command:

say -v Samantha "You have three open deals. Acme is in negotiation. Stripe is in demo scheduled. Notion is in proposal sent."

DenchClaw's Built-In TTS#

DenchClaw has native TTS via its tts tool. The agent can speak responses directly when you ask it to. This uses your system's audio output.

Example:

"Give me my top three tasks today as a spoken response"

The agent will generate the TTS audio and play it through your default audio output.

Voice Shortcuts and Hotkeys#

For power users, setting up hotkeys makes voice interaction much faster:

macOS: Alfred / Raycast#

If you use Alfred or Raycast, create a workflow that:

Opens DenchClaw PWA (or focuses it if already open)
Triggers macOS Dictation (Fn Fn)

This gives you a single shortcut to open DenchClaw and start speaking in under a second.

macOS: Custom Key Binding#

Using BetterTouchTool or Keyboard Maestro:

Trigger: custom hotkey (e.g., ⌘⇧V)
Action: activate DenchClaw window + type {dictation start} (Keyboard Maestro can trigger system dictation)

Privacy Notes#

Voice input privacy depends on the method:

Method	Audio Goes Where
macOS Dictation (online mode)	Apple's servers
macOS Dictation (enhanced mode, Apple Silicon)	On-device only
Browser Web Speech API (Chrome)	Google's servers
OpenAI Whisper skill	Your machine only
iOS Dictation	Apple's servers (or on-device for enhanced)

If audio privacy is important to you, use the Whisper skill or macOS enhanced dictation on Apple Silicon — both process audio locally.

FAQ#

Does voice input work in the mobile app?
Yes. The mobile companion app uses your phone's system keyboard dictation. Tap the microphone icon on your keyboard while the chat input is focused. See the mobile app guide for more.

Can the agent listen continuously for a wake word?
Not currently. Each voice input session is manual — you activate dictation, speak, and send. Continuous listening with a wake word is on the experimental roadmap.

Does Whisper support languages other than English?
Yes. Whisper supports 90+ languages. Specify with the --language flag: whisper file.wav --language Spanish. For auto-detection, omit the flag.

How accurate is Whisper for technical vocabulary (company names, product terms)?
Accuracy for technical terms varies. The medium and large models handle unusual names better. For critical notes, scan the transcription before logging it.

Can I connect a hardware push-to-talk button?
Yes, with some setup. On macOS, a USB or Bluetooth button that sends a keyboard shortcut can trigger macOS Dictation. Configure the shortcut in System Settings → Keyboard → Dictation.

Ready to try DenchClaw? Install in one command: npx denchclaw. Full setup guide →