Making Local TTS Actually Stream: Fixing Kokoro FastAPI for Real-Time Audio
If you’ve been following along with my local AI setup, you’ll know I run most of my services within Proxmox VE LXC or Podman containers. One of those is Kokoro, a self-hosted text-to-speech instance based on the Kokoro-82M ONNX model. The audio generated is surprisingly good and the model is small so inference is possible and fast on a CPU. There are fifty-nine voices covering a few languages. An important point is that it exposes an OpenAI-compatible API that plugs straight into Open WebUI.
This alone would not have warrantied a blog post, because it's a Text to Speech engine in a repo, but, no surprise, what started as a simple Firefox bug fix turned into a streaming pipeline investigation: with the usual benchmarks, agent assistance with code analysis, duplicate container sandbox, and concluded with a fix that meaningfully reduces time-to-first-audio for conversational use cases.
The Firefox Bug
It worked fine in Chrome, but produced an error in Firefox when clicking Generate Speech:
The culprit was a single line in AudioService.js:
this.sourceBuffer = this.mediaSource.addSourceBuffer('audio/mpeg');It turns out Firefox does not support audio/mpeg. The fix was to test for support and fallback if MediaSource Extensions(MSE) are not available:
if (!window.MediaSource || !MediaSource.isTypeSupported('audio/mpeg')) {
await this.setupBufferedStream(stream, response, onProgress, estimatedChunks);
return;
}The setupBufferedStream fallback collects all incoming
audio chunks into a Blob and sets it as a plain
audio.src. No MSE required and works everywhere. Rather than rebuilding the image the file patch was saved locally and injected using podman cp.
Benchmarking: Does Format or Voice Matter?
With the Firefox issue sorted, I ran a latency benchmark. Then another and another across three formats (mp3, pcm and wav) and three voices. The phrase was short and topical, since I'm seeking some consultancy:
“Hi Mediclinic, your EHR project sounds interesting and has the potential for a lot of impact.”
Three runs per combination, stream: false, measured with
Python’s time.perf_counter().
By format (averaged across all voices)
| Format | Avg latency | File size |
|---|---|---|
| WAV | 1382 ms | ~256 KB |
| PCM | 1417 ms | ~256 KB |
| MP3 | 1457 ms | ~86 KB |
By voice (averaged across all formats)
The choices were: English, Japanese, Mandarin, Spanish, French, Hindi, Italian, and Brazilian Portuguese.
| Voice | Description | Avg latency |
|---|---|---|
| af_heart | American English female | 1379 ms |
| bm_fable | British English male | 1439 ms |
| ef_dora | Spanish female | 1438 ms |
The takeaway: format and voice choice barely matter for
latency. The ONNX inference dominates — everything else (MP3
encoding, voice model differences) contributes at most ~80 ms. MP3 encoding time was minimal and the right choice for web playback because of it's file size advantage.
The French voice (ef_dora) performs on par with the English
voices, which is a good sign for multilingual deployments.
Can we go faster?
I spotted while reading the documentation, yes it's a bad habit I developed when I was young, that there API has a stream: true parameter. For a
conversational applications, I thought this would be useful and a simple switch to enable... You can probably guess that I was being naive, because I thought great enable the flag and the server would stream the audio during generation reducing
perceived latency. It turns out that streaming works with sentances, so I split the test phrase to start with a nice short initial sentance:
“Hi Mediclinic. Your EHR project sounds interesting and has the potential for a lot of impact.”
Then Claude wrote some Python to track exactly when each 1 KB chunk arrived at the client:
t_start = time.perf_counter()
chunks = []
with urllib.request.urlopen(req) as resp:
while True:
chunk = resp.read(1024)
if not chunk: break
t = round((time.perf_counter() - t_start) * 1000)
chunks.append((t, len(chunk)))
print(f"First chunk: {chunks[0][0]}ms")
print(f"Last chunk: {chunks[-1][0]}ms")Results for stream: true, af_heart,
MP3:
First chunk: 1462ms
Last chunk: 1464ms
Chunks: 89All chunks arrived within 2 ms of each other,
after a full 1.4 second wait. stream: false was identical. Even PCM which has zero encoder overhead. Eh? This doesn't seem right. What was going on? Was something buffering the audio before a
single byte was sent?
The Rabbit Hole
I made a copy of the base container called kokoro-stream, on port
8881 as an isolated sandbox for Claude to play with. The
server code uses async generators and
yield statements all the way from the HTTP handler down to
the ONNX inference layer, which is good practice. The StreamingResponse even sets
X-Accel-Buffering: no, so it should work.
Three hypotheses:
| Hypothesis | Evidence for | |
|---|---|---|
| H1 | ONNX inference batches both sentences as one call | PCM (no encoder) also shows simultaneous delivery |
| H2 | Uvicorn buffers the response body below a threshold | No asyncio yield points between sentence yields |
| H3 | PyAV MP3 encoder buffers early frames | Secondary — can’t explain PCM behaviour |
What the code actually does
Inside tts_service.py, smart_split() splits
the input text into chunks before inference, this is good. However, it batches
sentences together when their combined token count is under 250 tokens. Guess what? The two sentence test is only 105 tokens, so both sentences were
delivered as a single string to
KokoroV1.generate().
Inside kokoro_v1.py, the pipeline called split_pattern=r'\n+' meaning it would only split on
newlines, not just full stops. And since there were no newlines, both sentences went through as a
single inference call producing a single audio file. No amount
of downstream async would fix that.
Even if the sentences had been processed separately, the
for result in pipeline(...) loop is synchronous and never
returns control to the asyncio event loop between sentences, so the HTTP
layer has no opportunity to flush.
The Fix
Two changes:
inference/kokoro_v1.py
Change the split pattern to include breaks on full stops:
# before
split_pattern=r'\n+'
# after
split_pattern=r'(?<=[.!?])\s+'inference/kokoro_v1.py and
services/tts_service.py
Add yield points:
yield AudioChunk(...)
await asyncio.sleep(0) # return control to event loop → HTTP layer can flushBefore and now Time To First Audio(TTFA)
| Metric | Before | After |
|---|---|---|
| First chunk (TTFA) | ~1400 ms | ~575 ms |
| Last chunk | ~1400 ms | ~1400 ms |
| Gap | ~2 ms | ~1100 ms |
The first audio now arrives after ~575ms, while the second is still being generated. The total generation time is unchanged, unsurprisingly, but the latency is lower and this is just what is needed for conversational applications, like calling several service centres to ask about the availability and costs of servicing a car. I was surprised that online systems here is South Africa, don't show available slots, but instead are lead generation and a human sends and email or calls you a few hours later.
Conclusion
A few things worth noting:
The architecture. Kokoro uses async generators throughout, so the issue wasn’t bad design, it was two small configuration defaults affect short inputs. The token batching threshold (250 tokens) and the newline-only split pattern made sense in isolation, but eliminate sentence-level streaming for my test input.
PCM as a diagnostic tool. Benchmarking PCM format (raw samples, no encoding) alongside MP3 was valuable, to idnetify, and eliminate the audio encoder as a suspect early. When PCM and MP3 shows similar timings the bottleneck is unrelated and upstream of the encoder.
asyncio.sleep(0) is surprisingly
powerful. A zero-duration sleep doesn’t actually sleep, it yields control to the event loop. That’s enough for uvicorn to flush pending response bytes to the socket. It’s a one-liner with impact on latency.
Podman on Ubuntu 24.04.
Kokoro image:
ghcr.io/remsky/kokoro-fastapi-cpu:latest.
Voices used:
af_heart, bm_fable,
ef_dora.

No comments:
Post a Comment