himmi

SuperSay: On-Device Text-to-Speech That Streams in Under 200ms

16 min read
engineeringdeep-diveswiftpythonttssupersay

SuperSay is a native macOS app that turns text into speech — entirely on-device. No cloud TTS API, no text transmitted anywhere by default, 28 voices across 8 languages, running a Kokoro-82M model locally on Apple Silicon. The constraint that makes this interesting isn't the model — it's making local inference feel as fast and seamless as a cloud API that has a server farm behind it, without giving up the privacy guarantee that's the entire point of running on-device in the first place.

This is a deep dive into the architecture decisions that got time-to-first-audio under 200ms despite the model only being able to do one thing at a time, why a "free" performance win (INT8 quantization) got tried, measured, and thrown out, and the production bugs that taught the team to verify before acting.

The problem

Cloud TTS is the easy path: send text to an API, get audio back, the server handles concurrency and scaling. The cost is that every word you ever convert to speech passes through someone else's infrastructure. SuperSay's bet is that on-device inference can deliver comparable speed and quality without that tradeoff — but on-device means you're bound by the Mac's CPU, a model with real thread-safety constraints, and a phonemizer library that wasn't written with this use case in mind. None of that is solvable by buying more servers; it has to be solved with architecture.

Architecture

Rendering diagram…

Solid arrows are the always-on, fully local path. Dashed arrows are the two places anything ever leaves the Mac — both opt-in, both covered in the privacy section below.

LayerChoiceWhy
FrontendSwift 6, SwiftUI + AppKitNative UI, AppKit for global hotkeys and system services
Audio playbackAVAudioEngineFrame-accurate progress tracking, hardware-synced
BackendPython 3.11, FastAPI, bundled via PyInstallerCompiled into a standalone macOS binary shipped inside the app
TTS modelKokoro-82M, ONNX runtime82M parameters, fast enough for realtime on consumer hardware
Phonemizerespeak-ng (bundled)Grapheme-to-phoneme for 9 locales
Local storageSQLite (WAL mode)Audiobook metadata, fully on-device
Optional cloudSupabase (Postgres) + VercelCounts-only telemetry, optional email attribution — opt-in, not required for the app to function

The frontend and backend talk over HTTP, but it's not a network call in any meaningful sense — the backend binds exclusively to 127.0.0.1:10101, loopback only, so nothing outside the Mac can reach it. That detail matters enough that it's worth its own section below.

Design decisions and tradeoffs

Sequential inference, parallel-feeling latency

The single most consequential constraint in the system: espeak-ng has C-level shared state — phonemizer tables and language buffers — that is not thread-safe. Run two inference jobs concurrently and the phonemizer corrupts, producing garbled audio or a crash. So Kokoro inference runs through a ThreadPoolExecutor(max_workers=1) — strictly one job at a time, no exceptions.

That sounds like it should make the app feel slow. It doesn't, because speed comes from pipelining at the sentence level instead of from parallel inference:

  1. Text is split into sentences (5+ words, ending on punctuation).
  2. Sentence 1 is inferred — roughly 150–250ms on Apple Silicon.
  3. While sentence 1 is playing in hardware, sentence 2 is already inferring in the executor.
  4. Time-to-first-audio is just sentence 1's latency, not the whole text's.
  5. Total wall-clock time becomes the sum of per-sentence inference, overlapped with playback rather than blocking on it.
Rendering diagram…

Only one inference box is ever active at a time — the executor genuinely never runs two at once — but because each sentence's inference finishes well before the previous sentence's playback does, the next sentence is always ready the instant it's needed. The listener never sees the serial constraint; they just hear continuous audio.

Measured warm time-to-first-audio sits around 511ms cold, with inference itself running at roughly 4× realtime — the model generates audio faster than it plays back, which is what makes the overlap strategy work at all. The lesson here generalizes past TTS: when a resource is fundamentally serial, the fix usually isn't fighting the constraint, it's restructuring the unit of work so serial execution still produces continuous output.

A lookahead cache that turns common cases into near-zero latency

On top of sentence pipelining, the client can call POST /prewarm { text, voice, speed } proactively — triggered by clipboard changes or window focus — so the backend pre-runs inference on the likely-first segment and caches the PCM result (LRU, 10 entries, keyed by (segment_text, voice, speed)). When the matching /speak request actually arrives, it hits the cache and returns audio in under 20ms instead of waiting on a fresh inference pass. It's a small, almost opportunistic optimization, but for the dominant real-world workflow — copy text, hit the speak shortcut — it's the difference between "instant" and "fast."

Idle-unload: trading a 1.1-second cold start for ~550MB of idle memory

The Kokoro model occupies roughly 600MB loaded, settling around 527MB steady-state. A desktop utility app has no business holding that much memory when nobody's using it, so after 300 seconds of inactivity the model unloads entirely — idle RSS drops to about 43MB. The next request pays a cold-start cost: roughly 0.45s to reload the model plus 0.67s of warmup, about 1.1 seconds total.

That tradeoff only works because most usage is bursty, not continuous — people speak something, read or listen, then come back later — and because the lookahead cache independently covers the "I just copied text and I'm about to hit speak" case regardless of whether the model is warm. Idle-unload and lookahead caching solve two different problems (steady-state memory footprint vs. perceived first-request latency) and neither one would be sufficient alone.

INT8 quantization: tried, measured, rejected

Quantizing a model to INT8 is one of the standard "free" inference speedups — except here it wasn't free, and the team verified that with data instead of assuming it. Two attempts:

  • Full dynamic INT8 quantization turned Kokoro's convolution layers into ConvInteger ops, which the ONNX Runtime CPU execution provider doesn't support at all — the model simply fails to load.
  • MatMul-only INT8 quantization loads, but delivers a 1.02× speedup (statistical noise, not a real gain) while destroying audio quality — output correlation against the unquantized baseline measured 0.09–0.29, with RMS error exceeding 100%.

The root cause is architectural: Kokoro is convolution-dominated, not MatMul-heavy, so the quantization technique that helps transformer-style MatMul-bound models does essentially nothing here while actively corrupting the parts of the network that matter. The honest conclusion, documented rather than quietly dropped: a real speedup would require static/QDQ quantization with calibration or a CoreML export, both out of scope for now. This is the kind of negative result that's easy to skip writing down — and exactly the kind worth writing down, so nobody re-discovers the same dead end in six months.

Multilingual support, gated by what actually sounds correct

Kokoro-82M physically contains voices across 9 locales, but SuperSay ships 8 languages and 28 voices — Japanese is deliberately excluded. The reason is upstream of the model: phonemization runs through espeak-ng, which has no kanji grapheme-to-phoneme conversion. Real Japanese input gets transliterated as "Chinese letter" repeated, producing nine seconds of nonsense audio. That's not a quality compromise worth shipping, so Japanese was cut entirely rather than shipped broken, and Mandarin — where espeak does produce correct pinyin-IPA but tone fidelity couldn't be confirmed without human listening review — shipped as Beta rather than full support. Every other supported language got the same treatment: actual audio output reviewed before shipping, not just "the model technically supports this locale."

Rendering diagram…

The implementation detail that keeps this maintainable: language is derived entirely from the voice ID's first character, backend-side — the client never transmits a language field at all, just a voice ID:

# backend/app/services/languages.py
# first character → espeak locale
# a/b → en-us/en-gb, e → es, f → fr-fr, i → it,
# p → pt-br, h → hi, z → cmn (Beta)

That keeps the wire protocol backward-compatible — audiobooks that stored a voice ID under the old single-language version of the app still resolve correctly today, since the language was never part of the contract to begin with.

Identity without accounts

There are no passwords and no real account system. Every install generates a persistent anonymous UUID on first launch. A user can optionally enter an email during onboarding, which gets POSTed once and attached to that UUID purely for returning-user attribution — nothing in the app is gated behind it, and no JWT or session is ever issued. The simplicity is the point: no password means no credential-breach surface, no phishing vector, and no auth flow to debug. The explicit tradeoff is that there's no way to restore history across devices — a non-goal stated directly in the project's own design docs, not an oversight.

Telemetry that can't smuggle content, enforced twice

When telemetry is on (it's on by default, with a real off switch), only aggregate counts ever leave the device — never raw text, never filenames. The whitelist of allowed event properties is enforced in two independent places: once client-side, where unknown keys are dropped before HTTP serialization even happens, and again server-side, where the ingestion endpoint re-validates and drops anything not on the list before it touches the database. Defense in depth here isn't decorative — it means a bug in either the client or the server alone can't leak a field that was never supposed to be sent. Flip the telemetry toggle off and the outbox is wiped immediately, not just stopped going forward.

Consent and a hard cost ceiling on the one feature that does leave the device

The audiobook feature can optionally use Google Gemini to clean up OCR'd or messy PDF text — the one feature where document content genuinely leaves the Mac. It requires an explicit per-book checkbox in the upload UI, not a one-time global setting, and the backend independently enforces a per-book cost cap (default $5.00, computed from estimated token usage at upload time) — exceed it and the upload is rejected with HTTP 413 before any content is transmitted. Consent UI alone is something a user can misunderstand or a future refactor can accidentally bypass; a server-side cost gate is a hard backstop that doesn't depend on the client behaving correctly.

Loopback-only, the hard way

An earlier version of the backend bound to 0.0.0.0 — reachable from every device on the same LAN, no authentication. The Swift client only ever talked to it via localhost anyway, so nothing about the feature set required network-wide exposure; it was a default that was simply wrong. The fix was a breaking change: bind exclusively to 127.0.0.1. It's a one-line change with an outsized blast radius if missed, which is exactly why it's documented as a hard architectural invariant (HARD-004) rather than left as an implicit assumption about deployment environment.

Hard problems, solved

A reinstalled app that looked completely dead

After a fresh reinstall, macOS re-signs the app with a new ad-hoc signature, and the system silently revokes the Accessibility permission that was tied to the old signature. The global hotkey for "speak selected text" depends on that permission. Onboarding — which is where the permission gets requested — didn't re-run, because the hasOnboarded flag had already been persisted from the previous install and survived the reinstall untouched. The result: the hotkey silently failed to read selected text, with the main thread sitting idle in the normal AppKit event loop and no error surfaced anywhere. From the outside, the app looked simply broken.

The fix: onboarding now re-runs if either hasOnboarded is false or AXIsProcessTrusted() returns false — so a reinstall that quietly revokes the permission automatically triggers the re-request flow instead of leaving the user stuck. The retrospective note in the project's own engineering journal is worth keeping verbatim: the investigation initially theorized across several turns about what might be wrong with the app, when the actual fix only took one step — checking the permission and onboarding gate state directly on the live process. Data first, theories second.

Stuck at 0:00

A user reported the waveform UI frozen at 0:00, with the Mandarin voice selected over English text. Three separate things were compounding: the model had idle-unloaded after 300+ seconds of inactivity and needed a cold reload, the voice selected didn't match the language of the text (producing mangled espeak output), and — the part that actually caused the "frozen" perception — the UI flipped its status to .speaking at the moment the request was sent, not when audio actually started arriving, so during the cold-reload window the waveform sat at zero with no indication anything was happening.

The fix on the UI side: an awaitingFirstChunk flag keeps the displayed status at .thinking until the first real audio chunk plays, so the waveform only ever shows "speaking" once there's something to show. Separately, automatic language detection (matching voice to text) closes the wrong-voice-for-text half of the bug. Reproducing the backend behavior directly via curl confirmed it was never broken — 2.8 seconds for Mandarin-voice-on-English-text, 0.8 seconds for Mandarin-on-Chinese — the entire bug lived in the client's state machine, not the inference pipeline. That distinction mattered: a less careful diagnosis could have spent real effort optimizing backend latency that was never the actual problem.

Seeking past the end corrupted playback state forever

Seeking after a clip or audiobook had already finished playing could leave the app stuck reporting "Speaking" indefinitely. Two separate bugs combined: seek() didn't increment a generation counter used elsewhere as a guard against stale audio-buffer callbacks, and seekAudiobook() never cleared a playbackCompleted flag, so scrubbing backward after a natural end would either restart from zero or get stuck on stale handlers depending on timing. The fix made both seek paths bump the generation counter and clear playbackCompleted, mirroring a pattern the existing stop() path already used correctly — the bug was an inconsistency between two code paths that needed to agree on state, not a new mechanism.

A security finding that turned out to be correct as written

An audit pass flagged a potential infinite crash-loop: if a guard flag (_isLaunching) got stuck true during backend startup, the app might never recover. Investigating before fixing it revealed the flag is correctly reset on every exit path of the launch routine, success or failure, plus by the termination handler — and the one branch that deliberately does not clear it (an offline health-check path) does so on purpose, because clearing it there would let a concurrent launch attempt double-spawn the backend process while one was already mid-startup, which is precisely the failure mode the guard exists to prevent. The proposed "fix" would have been a regression, not an improvement.

This is documented in the project's own journal explicitly as a case study: verify a finding against the actual code before acting on it, especially when the finding comes from an automated audit rather than a reproduced failure. Plausible-sounding bug reports aren't automatically correct ones.

What I'd do differently

A few things are explicitly logged as deferred rather than ignored. Automatic language detection currently defaults to off, requiring users to opt in, even though the team's own assessment is that it should probably default to on — left as a product decision rather than an engineering one. The PyInstaller build spec has a hardcoded development-machine path baked in from how the bundling was originally set up; it works today because builds happen from the same machine, but it's flagged as something that needs proper CoreML/build-environment testing before it can be generalized. Both are honest admissions of scope cut for now, not bugs nobody noticed — the kind of debt that's fine to defer as long as it's written down where the next person (or the same person, six months later) will actually find it.

Stack at a glance

FrontendSwift 6, SwiftUI + AppKit, AVAudioEngine
BackendPython 3.11, FastAPI, bundled to a standalone binary via PyInstaller
TTS modelKokoro-82M, ONNX Runtime, ~4× realtime on Apple Silicon
Phonemizerespeak-ng, 9 locales, single-threaded by hard constraint
Local storageSQLite, WAL mode, fully on-device
Network boundaryLoopback-only (127.0.0.1:10101) — nothing reachable off-device
Optional cloudSupabase + Vercel, counts-only telemetry, opt-in email attribution
Tests178 backend tests (76% coverage), 76 frontend tests
Languages shipped8 (28 voices), each validated by actual audio review before release

More deep dives in this series: BroSki, a Rust task runner whose transactional output promotion solves the same "never leave things half-done" problem SuperSay's staged audiobook pipeline cares about, and SuperZen, a macOS wellness app that makes the same bet — do the invasive-looking thing (local-only, on-device) and engineer around the constraints it creates instead of reaching for the cloud by default.