Making Local TTS Actually Stream: Fixing Kokoro FastAPI for Real-Time Audio

If you’ve been following along with my local AI setup, you’ll know I run most of my services within Proxmox VE LXC or Podman containers. One of those is Kokoro, a self-hosted text-to-speech instance based on the Kokoro-82M ONNX model. The audio generated is surprisingly good and the model is small so inference is possible and fast on a CPU. There are fifty-nine voices covering a few languages. An important point is that it exposes an OpenAI-compatible API that plugs straight into Open WebUI.

This alone would not have warrantied a blog post, because it's a Text to Speech engine in a repo, but, no surprise, what started as a simple Firefox bug fix turned into a streaming pipeline investigation: with the usual benchmarks, agent assistance with code analysis, duplicate container sandbox, and concluded with a fix that meaningfully reduces time-to-first-audio for conversational use cases.

The Firefox Bug

It worked fine in Chrome, but produced an error in Firefox when clicking Generate Speech:

The culprit was a single line in AudioService.js:

this.sourceBuffer = this.mediaSource.addSourceBuffer('audio/mpeg');

It turns out Firefox does not support audio/mpeg. The fix was to test for support and fallback if MediaSource Extensions(MSE) are not available:

if (!window.MediaSource || !MediaSource.isTypeSupported('audio/mpeg')) {
    await this.setupBufferedStream(stream, response, onProgress, estimatedChunks);
    return;
}

The setupBufferedStream fallback collects all incoming audio chunks into a Blob and sets it as a plain audio.src. No MSE required and works everywhere. Rather than rebuilding the image the file patch was saved locally and injected using podman cp.

Benchmarking: Does Format or Voice Matter?

With the Firefox issue sorted, I ran a latency benchmark. Then another and another across three formats (mp3, pcm and wav) and three voices. The phrase was short and topical, since I'm seeking some consultancy:

“Hi Mediclinic, your EHR project sounds interesting and has the potential for a lot of impact.”

Three runs per combination, stream: false, measured with Python’s time.perf_counter().

By format (averaged across all voices)

Format	Avg latency	File size
WAV	1382 ms	~256 KB
PCM	1417 ms	~256 KB
MP3	1457 ms	~86 KB

By voice (averaged across all formats)

The choices were: English, Japanese, Mandarin, Spanish, French, Hindi, Italian, and Brazilian Portuguese.

Voice	Description	Avg latency
af_heart	American English female	1379 ms
bm_fable	British English male	1439 ms
ef_dora	Spanish female	1438 ms

The takeaway: format and voice choice barely matter for latency. The ONNX inference dominates — everything else (MP3 encoding, voice model differences) contributes at most ~80 ms. MP3 encoding time was minimal and the right choice for web playback because of it's file size advantage. The French voice (ef_dora) performs on par with the English voices, which is a good sign for multilingual deployments.

Can we go faster?

I spotted while reading the documentation, yes it's a bad habit I developed when I was young, that there API has a stream: true parameter. For a conversational applications, I thought this would be useful and a simple switch to enable... You can probably guess that I was being naive, because I thought great enable the flag and the server would stream the audio during generation reducing perceived latency. It turns out that streaming works with sentances, so I split the test phrase to start with a nice short initial sentance:

“Hi Mediclinic. Your EHR project sounds interesting and has the potential for a lot of impact.”

Then Claude wrote some Python to track exactly when each 1 KB chunk arrived at the client:

t_start = time.perf_counter()
chunks = []
with urllib.request.urlopen(req) as resp:
    while True:
        chunk = resp.read(1024)
        if not chunk: break
        t = round((time.perf_counter() - t_start) * 1000)
        chunks.append((t, len(chunk)))

print(f"First chunk: {chunks[0][0]}ms")
print(f"Last chunk:  {chunks[-1][0]}ms")

Results for stream: true, af_heart, MP3:

First chunk: 1462ms
Last chunk:  1464ms
Chunks: 89

All chunks arrived within 2 ms of each other, after a full 1.4 second wait. stream: false was identical. Even PCM which has zero encoder overhead. Eh? This doesn't seem right. What was going on? Was something buffering the audio before a single byte was sent?

The Rabbit Hole

I made a copy of the base container called kokoro-stream, on port 8881 as an isolated sandbox for Claude to play with. The server code uses async generators and yield statements all the way from the HTTP handler down to the ONNX inference layer, which is good practice. The StreamingResponse even sets X-Accel-Buffering: no, so it should work.

Three hypotheses:

	Hypothesis	Evidence for
H1	ONNX inference batches both sentences as one call	PCM (no encoder) also shows simultaneous delivery
H2	Uvicorn buffers the response body below a threshold	No asyncio yield points between sentence yields
H3	PyAV MP3 encoder buffers early frames	Secondary — can’t explain PCM behaviour

What the code actually does

Inside tts_service.py, smart_split() splits the input text into chunks before inference, this is good. However, it batches sentences together when their combined token count is under 250 tokens. Guess what? The two sentence test is only 105 tokens, so both sentences were delivered as a single string to KokoroV1.generate().

Inside kokoro_v1.py, the pipeline called split_pattern=r'\n+' meaning it would only split on newlines, not just full stops. And since there were no newlines, both sentences went through as a single inference call producing a single audio file. No amount of downstream async would fix that.

Even if the sentences had been processed separately, the for result in pipeline(...) loop is synchronous and never returns control to the asyncio event loop between sentences, so the HTTP layer has no opportunity to flush.

The Fix

Two changes:

inference/kokoro_v1.py

Change the split pattern to include breaks on full stops:

# before
split_pattern=r'\n+'
# after
split_pattern=r'(?<=[.!?])\s+'

inference/kokoro_v1.py and services/tts_service.py

Add yield points:

yield AudioChunk(...)
await asyncio.sleep(0)  # return control to event loop → HTTP layer can flush

Before and now Time To First Audio(TTFA)

Metric	Before	After
First chunk (TTFA)	~1400 ms	~575 ms
Last chunk	~1400 ms	~1400 ms
Gap	~2 ms	~1100 ms

The first audio now arrives after ~575ms, while the second is still being generated. The total generation time is unchanged, unsurprisingly, but the latency is lower and this is just what is needed for conversational applications, like calling several service centres to ask about the availability and costs of servicing a car. I was surprised that online systems here is South Africa, don't show available slots, but instead are lead generation and a human sends and email or calls you a few hours later.

Conclusion

A few things worth noting:

The architecture. Kokoro uses async generators throughout, so the issue wasn’t bad design, it was two small configuration defaults affect short inputs. The token batching threshold (250 tokens) and the newline-only split pattern made sense in isolation, but eliminate sentence-level streaming for my test input.

PCM as a diagnostic tool. Benchmarking PCM format (raw samples, no encoding) alongside MP3 was valuable, to idnetify, and eliminate the audio encoder as a suspect early. When PCM and MP3 shows similar timings the bottleneck is unrelated and upstream of the encoder.

asyncio.sleep(0) is surprisingly powerful. A zero-duration sleep doesn’t actually sleep, it yields control to the event loop. That’s enough for uvicorn to flush pending response bytes to the socket. It’s a one-liner with impact on latency.

Podman on Ubuntu 24.04.

Kokoro image: ghcr.io/remsky/kokoro-fastapi-cpu:latest.

Voices used: af_heart, bm_fable, ef_dora.

So many ideas, so little time

Thursday, April 2, 2026

Kokoro Streaming Latency Investigation on Firefox