Fast Local Transcription, Remote Fallback When Confidence Drops
The best speech system should feel local most of the time and still recover when local output is uncertain. That means running a fast local pass first, measuring confidence or disagreement, and falling back to a stronger remote model only for the segments that need it.
| Path | Best for | Tradeoff |
|---|---|---|
| Local | Short, clean dictation and privacy-sensitive text. | Lowest latency, but may struggle with noise, names, or domain terms. |
| Remote | Noisy, accented, long-form, or high-value segments. | Better recovery, but network latency and provider cost. |
| Hybrid | Interactive control where speed matters but errors are expensive. | Requires a confidence gate and continuous eval. |
The router is an eval problem
A fallback strategy is only useful if it improves the combined system. The eval has to report four numbers together: local accept rate, remote fallback rate, routed WER, and routed latency. Optimizing only WER sends everything remote. Optimizing only latency accepts too many bad transcripts.
Our STT eval now records per-engine latency and can simulate a hybrid route: accept local output when its disagreement with the Gemini reference stays under a threshold, otherwise use the remote transcript. That is not the production confidence model by itself, but it is a practical offline harness for testing whether the routing idea moves the curve in the right direction.
python3 e2e/stt_eval.py \ --dictatorflow-bin ./voicetypezig/zig-out/bin/dictatorflow \ --hybrid-local-gemini-threshold 0.08 # reports local accept rate, remote fallback rate, # hybrid WER vs human, and hybrid latency
Why this matters for computer control
A wrong word in a paragraph is annoying. A wrong command can be worse. Hybrid transcription lets us keep fast local text for the common case while escalating ambiguous control phrases, file names, code symbols, and long noisy audio to stronger remote models.