Skip to Content
DojOps: AI-powered DevOps automation. Learn more →
TutorialsVoice Input

Voice input

Speak your DojOps commands instead of typing them. All transcription happens locally via whisper.cpp — no audio ever leaves your machine.

Difficulty: Beginner Duration: 25 minutes (includes building whisper.cpp from source) What you’ll build: A fully working voice-input setup that lets you speak prompts in chat, plan, and autonomous agent mode


What you’ll learn

  • Install whisper.cpp and SoX for your operating system
  • Verify the setup with dojops doctor
  • Use /voice inside interactive chat sessions
  • Speak a plan goal with dojops plan --voice
  • Run autonomous tasks with dojops auto --voice
  • Combine voice with --execute and --yes for hands-free workflows
  • Diagnose the most common voice setup failures

Prerequisites

  • DojOps 1.1.6 installed (npm i -g @dojops/cli)
  • A working microphone
  • A C compiler, cmake, and git (for building whisper.cpp — covered below)

Workshop steps

Step 1: Install build tools and SoX

Voice input requires two system dependencies: SoX records audio from your microphone, and whisper.cpp converts that audio to text. Install the build tools first, then SoX.

macOS:

xcode-select --install brew install cmake sox

Linux / WSL (Debian / Ubuntu):

sudo apt update && sudo apt install -y \ build-essential cmake git sox libsox-fmt-all

Linux (Fedora / RHEL):

sudo dnf install -y gcc gcc-c++ cmake git sox sox-plugins-freeworld

Linux (Arch):

sudo pacman -S base-devel cmake git sox

Verify SoX is installed and can see your audio device:

sox --version
SoX v14.4.2

If SoX is installed but can’t access your microphone on Linux, check that your user is in the audio group:

groups $USER | grep audio

If audio isn’t listed: sudo usermod -aG audio $USER, then log out and back in.

Windows: SoX is available from sox.sourceforge.net . Voice input on Windows is experimental and may have audio device issues.


Step 2: Install whisper.cpp

DojOps builds whisper.cpp from source and places everything in ~/.dojops/toolchain/ — no system permissions needed. The default model (ggml-base.en.bin) is approximately 142 MB and downloads automatically.

dojops toolchain install whisper-cpp
◐ Cloning whisper.cpp v1.7.3... ◐ Building (this takes 1-3 minutes)... ✓ whisper-cpp v1.7.3 installed Binary: ~/.dojops/toolchain/bin/whisper-cli ◐ Downloading model ggml-base.en.bin (142 MB)... ✓ Model saved to ~/.dojops/voice/ggml-base.en.bin

The build duration depends on your hardware. On a modern machine it’s about 90 seconds. The model download depends on your connection speed.

Build requirements: whisper.cpp compiles from C/C++ source, which requires gcc/g++ (or clang), make, and about 512 MB of RAM during compilation. On minimal containers or CI runners, install build-essential (Debian/Ubuntu) or base-devel (Arch) first. The resulting binary is small (~2 MB).

The whisper.cpp installation is global to your user account, not per-project. You only need to do this once.


Step 3: Verify the setup

Run the doctor command and check the Voice section:

dojops doctor
System checks ───────────────────────────────────────── Node.js pass v20.11.0 DojOps CLI pass v1.1.6 Provider pass openai (gpt-4o) Voice ───────────────────────────────────────── whisper.cpp pass Found (~/.dojops/toolchain/bin/whisper-cli) SoX (rec) pass Found (/usr/bin/rec) Whisper model pass Found (~/.dojops/voice/ggml-base.en.bin, 142 MB) Optional tools ───────────────────────────────────────── trivy pass v0.49.1 checkov warn Not installed (IaC scanning unavailable)

All three Voice lines must show pass. If any shows warn or fail, go back to Step 1 or 2 for the specific dependency.

Test actual transcription before moving on:

dojops voice-test
● Recording for 3 seconds... (say something) ● Transcribed: "hello this is a test" ✓ Voice input working correctly

If the transcription is empty or garbled, check Step 4 of the Troubleshooting section at the bottom of this page.


Step 4: Use voice in chat

Start an interactive chat session:

dojops chat

At the prompt, type /voice to start a recording:

┌ DojOps Interactive Chat ◆ You: /voice ● Recording... Speak now (press Enter to stop, max 30s)

Speak your question clearly, then press Enter to stop recording. The audio is transcribed locally:

● Transcribed: "What CI tools are configured in this project?" ◇ DojOps: Based on the project files, I can see: │ · GitHub Actions workflow at .github/workflows/ci.yml │ (Node 20/22 matrix, pnpm, Vitest) │ · Dockerfile with multi-stage build ◆ You: _

You can use /voice as many times as you want during a session. Typed and voice inputs mix freely — use whichever is faster for each message.

A practical tip: start chat with --voice to validate dependencies before you get deep into a session.

dojops chat --voice
✓ Voice input ready (whisper-cli + ggml-base.en.bin) ┌ DojOps Interactive Chat ◆ You: _

If a dependency is missing, the error appears here at startup instead of mid-conversation.


Step 5: Speak a plan goal

Use --voice with the plan command to speak a multi-step goal instead of typing it:

dojops plan --voice
● Recording... Speak your plan goal (press Enter to stop, max 30s) ● Transcribed: "Set up CI/CD for a Node.js app with Docker and push to GitHub Container Registry" ◇ Decomposing into tasks... ◇ Plan: Set up CI/CD for a Node.js app with Docker (4 tasks) │ task-1 github-actions: Create GitHub Actions CI workflow │ task-2 dockerfile: Create multi-stage Dockerfile │ task-3 docker-compose: Create docker-compose.yml for local dev │ task-4 github-actions: Create release workflow for GHCR push ◆ Plan saved as plan-50372126

The transcription fills in the prompt. Everything after that works exactly as if you had typed it.


Step 6: Execute immediately after speaking

Voice composes with all plan flags. Speak and execute in one command:

dojops plan --voice --execute

This records your goal, decomposes it into tasks, shows you the plan, and waits for approval before executing each task.

For fully automated execution with no approval prompts:

dojops plan --voice --execute --yes

Use --yes only when you’ve already reviewed the plan pattern and trust the output. A good workflow is to run --execute first (with approval prompts) on a new class of task, then --execute --yes once you’re confident in what the plan produces.


Step 7: Speak an autonomous agent task

The autonomous agent accepts voice input the same way:

dojops auto --voice
● Recording... Speak your task (press Enter to stop, max 30s) ● Transcribed: "Check our Dockerfile and update it to follow current best practices" ◐ Starting autonomous agent... ◇ read_file Dockerfile ◇ search_files .dockerignore ◇ run_skill dockerfile (with current file as context) ◇ edit_file Dockerfile (pinned base image, added .dockerignore, removed root user) ◇ run_command docker build --dry-run . ◇ done ✓ Done 3 changes applied · 6 iterations · 5,200 tokens

The agent receives your transcribed text as its task and runs the full iterative loop. Voice is purely an input mechanism — the agent behavior is identical to typing the prompt.


Step 8: How recording works

A few specifics about recording behavior that affect day-to-day use:

  • Press Enter or Space to stop recording. Ctrl+C also stops without exiting.
  • The maximum duration is 30 seconds. Recording stops automatically at the limit.
  • SoX records 16kHz mono WAV — exactly what whisper.cpp expects.
  • The .wav file is written to a temp directory and deleted immediately after transcription.
  • No audio data leaves your machine. whisper.cpp runs the model locally.

For long, complex tasks, 30 seconds is enough — most people can describe a DevOps task in 10–15 seconds. If you find yourself running out of time, split the task into smaller prompts and chain them in chat.


Step 9: Set a custom model (optional)

The default model (ggml-base.en.bin) is optimized for English. If you work in another language, download a multilingual model and point DojOps to it:

# Download the multilingual base model (~142 MB) curl -L -o ~/.dojops/voice/ggml-base.bin \ https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin # Use it for all sessions export DOJOPS_WHISPER_MODEL=~/.dojops/voice/ggml-base.bin

Add the export to your shell profile (~/.zshrc or ~/.bashrc) to make it permanent.

If you have a custom-built whisper.cpp binary:

export DOJOPS_WHISPER_BIN=/usr/local/bin/whisper-cli

Most users don’t need either of these — DojOps auto-detects the binary and model from ~/.dojops/toolchain/ and ~/.dojops/voice/.


Try it yourself

Challenge 1: Run dojops chat --voice and conduct a full multi-turn conversation using only voice input. Start by asking about your project structure, then ask for a CI workflow, then ask for a Dockerfile. No typing beyond /voice for each turn.

Challenge 2: Use dojops plan --voice --execute to generate and apply a real config. Compare the quality of the output to a typed prompt — does speaking produce a more or less detailed prompt than you normally type?

Challenge 3: Download the large whisper model (ggml-large-v3.bin, ~3 GB) and compare transcription accuracy against the base model on a technical prompt with tool names and acronyms like “GHCR”, “trivy”, “kubectl”, and “Terraform”.


Troubleshooting

“Voice input requires: whisper.cpp binary”

The binary isn’t installed or can’t be found. Run dojops toolchain install whisper-cpp. If the build fails, check that cmake and a C compiler are installed:

cmake --version gcc --version # or: clang --version on macOS

If cmake is missing, install it via your package manager and retry.

“Voice input requires: sox”

SoX isn’t installed. Go back to Step 1 and install it for your OS. Verify with sox --version after installation.

“Recording failed — no audio captured”

Your microphone isn’t accessible. On Linux, test the microphone directly:

rec -d test.wav trim 0 3 && aplay test.wav

If rec fails, the issue is at the OS level. Check your audio device configuration and verify your user is in the audio group. On macOS, check System Settings > Privacy & Security > Microphone and ensure Terminal (or your terminal emulator) has permission.

Transcription returns empty text or obvious garbage

Two common causes: the microphone volume is too low, or you’re using the English-only model for non-English speech. For low volume, speak closer to the microphone and increase input gain in your system audio settings. For non-English, download the multilingual model as described in Step 9.


What you learned

Voice input in DojOps is a local-only feature — no audio data leaves your machine. SoX captures the recording, whisper.cpp runs inference on the model at ~/.dojops/voice/ggml-base.en.bin, and the transcribed text is passed to the same pipeline as a typed prompt. The --voice flag works identically across chat, plan, and auto commands, and it composes with other flags like --execute and --yes. Once the setup is working, the only difference from typed input is how you provide the prompt.


Next steps