Voice input
Speak your DojOps commands instead of typing them. All transcription happens locally via whisper.cpp — no audio ever leaves your machine.
Difficulty: Beginner Duration: 25 minutes (includes building whisper.cpp from source) What you’ll build: A fully working voice-input setup that lets you speak prompts in chat, plan, and autonomous agent mode
What you’ll learn
- Install whisper.cpp and SoX for your operating system
- Verify the setup with
dojops doctor - Use
/voiceinside interactive chat sessions - Speak a plan goal with
dojops plan --voice - Run autonomous tasks with
dojops auto --voice - Combine voice with
--executeand--yesfor hands-free workflows - Diagnose the most common voice setup failures
Prerequisites
- DojOps 1.1.6 installed (
npm i -g @dojops/cli) - A working microphone
- A C compiler, cmake, and git (for building whisper.cpp — covered below)
Workshop steps
Step 1: Install build tools and SoX
Voice input requires two system dependencies: SoX records audio from your microphone, and whisper.cpp converts that audio to text. Install the build tools first, then SoX.
macOS:
xcode-select --install
brew install cmake soxLinux / WSL (Debian / Ubuntu):
sudo apt update && sudo apt install -y \
build-essential cmake git sox libsox-fmt-allLinux (Fedora / RHEL):
sudo dnf install -y gcc gcc-c++ cmake git sox sox-plugins-freeworldLinux (Arch):
sudo pacman -S base-devel cmake git soxVerify SoX is installed and can see your audio device:
sox --versionSoX v14.4.2If SoX is installed but can’t access your microphone on Linux, check that your user is in the audio group:
groups $USER | grep audioIf audio isn’t listed: sudo usermod -aG audio $USER, then log out and back in.
Windows: SoX is available from sox.sourceforge.net . Voice input on Windows is experimental and may have audio device issues.
Step 2: Install whisper.cpp
DojOps builds whisper.cpp from source and places everything in ~/.dojops/toolchain/ — no system permissions needed. The default model (ggml-base.en.bin) is approximately 142 MB and downloads automatically.
dojops toolchain install whisper-cpp◐ Cloning whisper.cpp v1.7.3...
◐ Building (this takes 1-3 minutes)...
✓ whisper-cpp v1.7.3 installed
Binary: ~/.dojops/toolchain/bin/whisper-cli
◐ Downloading model ggml-base.en.bin (142 MB)...
✓ Model saved to ~/.dojops/voice/ggml-base.en.binThe build duration depends on your hardware. On a modern machine it’s about 90 seconds. The model download depends on your connection speed.
Build requirements: whisper.cpp compiles from C/C++ source, which requires
gcc/g++(orclang),make, and about 512 MB of RAM during compilation. On minimal containers or CI runners, installbuild-essential(Debian/Ubuntu) orbase-devel(Arch) first. The resulting binary is small (~2 MB).
The whisper.cpp installation is global to your user account, not per-project. You only need to do this once.
Step 3: Verify the setup
Run the doctor command and check the Voice section:
dojops doctor System checks
─────────────────────────────────────────
Node.js pass v20.11.0
DojOps CLI pass v1.1.6
Provider pass openai (gpt-4o)
Voice
─────────────────────────────────────────
whisper.cpp pass Found (~/.dojops/toolchain/bin/whisper-cli)
SoX (rec) pass Found (/usr/bin/rec)
Whisper model pass Found (~/.dojops/voice/ggml-base.en.bin, 142 MB)
Optional tools
─────────────────────────────────────────
trivy pass v0.49.1
checkov warn Not installed (IaC scanning unavailable)All three Voice lines must show pass. If any shows warn or fail, go back to Step 1 or 2 for the specific dependency.
Test actual transcription before moving on:
dojops voice-test● Recording for 3 seconds... (say something)
● Transcribed: "hello this is a test"
✓ Voice input working correctlyIf the transcription is empty or garbled, check Step 4 of the Troubleshooting section at the bottom of this page.
Step 4: Use voice in chat
Start an interactive chat session:
dojops chatAt the prompt, type /voice to start a recording:
┌ DojOps Interactive Chat
│
◆ You: /voice
● Recording... Speak now (press Enter to stop, max 30s)Speak your question clearly, then press Enter to stop recording. The audio is transcribed locally:
● Transcribed: "What CI tools are configured in this project?"
◇ DojOps: Based on the project files, I can see:
│ · GitHub Actions workflow at .github/workflows/ci.yml
│ (Node 20/22 matrix, pnpm, Vitest)
│ · Dockerfile with multi-stage build
│
◆ You: _You can use /voice as many times as you want during a session. Typed and voice inputs mix freely — use whichever is faster for each message.
A practical tip: start chat with --voice to validate dependencies before you get deep into a session.
dojops chat --voice✓ Voice input ready (whisper-cli + ggml-base.en.bin)
┌ DojOps Interactive Chat
│
◆ You: _If a dependency is missing, the error appears here at startup instead of mid-conversation.
Step 5: Speak a plan goal
Use --voice with the plan command to speak a multi-step goal instead of typing it:
dojops plan --voice● Recording... Speak your plan goal (press Enter to stop, max 30s)
● Transcribed: "Set up CI/CD for a Node.js app with Docker and push to GitHub Container Registry"
◇ Decomposing into tasks...
◇ Plan: Set up CI/CD for a Node.js app with Docker (4 tasks)
│
│ task-1 github-actions: Create GitHub Actions CI workflow
│ task-2 dockerfile: Create multi-stage Dockerfile
│ task-3 docker-compose: Create docker-compose.yml for local dev
│ task-4 github-actions: Create release workflow for GHCR push
│
◆ Plan saved as plan-50372126The transcription fills in the prompt. Everything after that works exactly as if you had typed it.
Step 6: Execute immediately after speaking
Voice composes with all plan flags. Speak and execute in one command:
dojops plan --voice --executeThis records your goal, decomposes it into tasks, shows you the plan, and waits for approval before executing each task.
For fully automated execution with no approval prompts:
dojops plan --voice --execute --yesUse --yes only when you’ve already reviewed the plan pattern and trust the output. A good workflow is to run --execute first (with approval prompts) on a new class of task, then --execute --yes once you’re confident in what the plan produces.
Step 7: Speak an autonomous agent task
The autonomous agent accepts voice input the same way:
dojops auto --voice● Recording... Speak your task (press Enter to stop, max 30s)
● Transcribed: "Check our Dockerfile and update it to follow current best practices"
◐ Starting autonomous agent...
◇ read_file Dockerfile
◇ search_files .dockerignore
◇ run_skill dockerfile (with current file as context)
◇ edit_file Dockerfile (pinned base image, added .dockerignore, removed root user)
◇ run_command docker build --dry-run .
◇ done
✓ Done
3 changes applied · 6 iterations · 5,200 tokensThe agent receives your transcribed text as its task and runs the full iterative loop. Voice is purely an input mechanism — the agent behavior is identical to typing the prompt.
Step 8: How recording works
A few specifics about recording behavior that affect day-to-day use:
- Press Enter or Space to stop recording. Ctrl+C also stops without exiting.
- The maximum duration is 30 seconds. Recording stops automatically at the limit.
- SoX records 16kHz mono WAV — exactly what whisper.cpp expects.
- The
.wavfile is written to a temp directory and deleted immediately after transcription. - No audio data leaves your machine. whisper.cpp runs the model locally.
For long, complex tasks, 30 seconds is enough — most people can describe a DevOps task in 10–15 seconds. If you find yourself running out of time, split the task into smaller prompts and chain them in chat.
Step 9: Set a custom model (optional)
The default model (ggml-base.en.bin) is optimized for English. If you work in another language, download a multilingual model and point DojOps to it:
# Download the multilingual base model (~142 MB)
curl -L -o ~/.dojops/voice/ggml-base.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin
# Use it for all sessions
export DOJOPS_WHISPER_MODEL=~/.dojops/voice/ggml-base.binAdd the export to your shell profile (~/.zshrc or ~/.bashrc) to make it permanent.
If you have a custom-built whisper.cpp binary:
export DOJOPS_WHISPER_BIN=/usr/local/bin/whisper-cliMost users don’t need either of these — DojOps auto-detects the binary and model from ~/.dojops/toolchain/ and ~/.dojops/voice/.
Try it yourself
Challenge 1: Run dojops chat --voice and conduct a full multi-turn conversation using only voice input. Start by asking about your project structure, then ask for a CI workflow, then ask for a Dockerfile. No typing beyond /voice for each turn.
Challenge 2: Use dojops plan --voice --execute to generate and apply a real config. Compare the quality of the output to a typed prompt — does speaking produce a more or less detailed prompt than you normally type?
Challenge 3: Download the large whisper model (ggml-large-v3.bin, ~3 GB) and compare transcription accuracy against the base model on a technical prompt with tool names and acronyms like “GHCR”, “trivy”, “kubectl”, and “Terraform”.
Troubleshooting
“Voice input requires: whisper.cpp binary”
The binary isn’t installed or can’t be found. Run dojops toolchain install whisper-cpp. If the build fails, check that cmake and a C compiler are installed:
cmake --version
gcc --version # or: clang --version on macOSIf cmake is missing, install it via your package manager and retry.
“Voice input requires: sox”
SoX isn’t installed. Go back to Step 1 and install it for your OS. Verify with sox --version after installation.
“Recording failed — no audio captured”
Your microphone isn’t accessible. On Linux, test the microphone directly:
rec -d test.wav trim 0 3 && aplay test.wavIf rec fails, the issue is at the OS level. Check your audio device configuration and verify your user is in the audio group. On macOS, check System Settings > Privacy & Security > Microphone and ensure Terminal (or your terminal emulator) has permission.
Transcription returns empty text or obvious garbage
Two common causes: the microphone volume is too low, or you’re using the English-only model for non-English speech. For low volume, speak closer to the microphone and increase input gain in your system audio settings. For non-English, download the multilingual model as described in Step 9.
What you learned
Voice input in DojOps is a local-only feature — no audio data leaves your machine. SoX captures the recording, whisper.cpp runs inference on the model at ~/.dojops/voice/ggml-base.en.bin, and the transcribed text is passed to the same pipeline as a typed prompt. The --voice flag works identically across chat, plan, and auto commands, and it composes with other flags like --execute and --yes. Once the setup is working, the only difference from typed input is how you provide the prompt.
Next steps
- Interactive Chat Sessions — Full chat tutorial with slash commands and agent pinning
- Autonomous Agent Mode — Deep dive into the ReAct loop the agent runs after you speak a task
- Advanced Plan Execution — Dry run, resume, replay, and parallel task execution
- CLI Reference — Full command reference for all
--voiceflags