How We Optimize Speed Without Sacrificing WER
Speech products are easy to make faster by making them worse. The real work is improving latency while holding or lowering word error rate. That requires measuring the full pipeline, not just the model call.
- > Audio preprocessing: normalization, sample rate, compression, and silence handling.
- > Provider order: fastest reliable path first, stronger fallbacks behind it.
- > Retry policy: retry transient failures without wasting time on quota or auth errors.
- > Chunk fallback: recover long or difficult audio without making short dictation pay the cost.
- > Local/remote routing: keep easy segments local and escalate hard ones.
The provider chain is part of performance
DictatorFlow's API does not depend on one transcription provider. It runs a chain with circuit breakers, transient retries, quota-aware cooldowns, and a chunk fallback for harder audio. That means a single slow or failing provider does not define the user experience.
The fastest system is not always the one with the fastest model. It is the one that avoids dead paths quickly, keeps short audio on the low-overhead path, and only pays extra compute when the transcript needs it.
Compression can help or hurt
Smaller audio uploads reduce network time, but compression can erase the speech details that models need. Our compression tuning script sweeps sample rates, codecs, and bitrates, then measures file size and WER together. A candidate only wins if it reduces bytes without creating recognition regressions.
GEMINI_API_KEY=... python3 e2e/compression_tuning.py python3 e2e/stt_eval.py --audio-dir e2e/audio # optimize for latency and bandwidth, # reject changes that raise WER
WER is the guardrail
Every speed change needs a regression gate. We track WER against human references when available, compare cloud and local outputs against a Gemini reference, and record latency for each path. The resulting table is what lets us decide whether a change is genuinely faster or merely cheaper.
The principle is simple: improve time-to-result, but never hide accuracy debt. For voice control, the transcript is the command surface, so speed and WER have to move together.