SuperSay: On-Device Text-to-Speech That Streams in Under 200ms

Jul 2, 2026

24 min read

engineeringdeep-diveswiftpythonttssupersay

SuperSay is a native macOS app that turns text into speech — entirely on-device. No cloud TTS API, no text transmitted anywhere by default, 28 voices across 8 languages, running a Kokoro-82M model locally on Apple Silicon. The constraint that makes this interesting isn't the model — it's making local inference feel as fast and seamless as a cloud API backed by a server farm, without giving up the privacy guarantee that's the entire point of running on-device in the first place.

This is a deep dive into the architecture that gets time-to-first-audio under 200ms despite the model only being able to do one thing at a time, a "free" performance win (INT8 quantization) that got tried, measured, and thrown out, a pre-launch adversarial audit that turned up a real security gap before it shipped, and the business reasoning behind betting on local compute instead of a metered API — five phases of shipped work in under six months, on a single-developer project with 308 backend tests and 74 Swift tests green.

The problem

Cloud TTS is the easy path: send text to an API, get audio back, the server handles concurrency and scaling. The cost is that every word ever converted to speech passes through someone else's infrastructure, and someone else's meter runs on every character. SuperSay's bet is that on-device inference can deliver comparable speed and quality without that tradeoff — but on-device means being bound by the Mac's CPU, a model with real thread-safety constraints, and a phonemizer library that wasn't written with this use case in mind. None of that is solvable by buying more servers; it has to be solved with architecture.

Architecture

Rendering diagram…

Solid arrows are the always-on, fully local path. Dashed arrows are the two places anything ever leaves the Mac — both opt-in, both covered in the privacy section below.

Layer	Choice	Why
Frontend	Swift 6, SwiftUI + AppKit	Native UI, AppKit for global hotkeys and system services
Audio playback	AVAudioEngine	Frame-accurate progress tracking, hardware-synced
Backend	Python 3.11, FastAPI, bundled via PyInstaller	Compiled into a standalone macOS binary shipped inside the app
TTS model	Kokoro-82M, ONNX runtime	82M parameters, fast enough for realtime on consumer hardware
Phonemizer	espeak-ng (bundled)	Grapheme-to-phoneme for 9 locales
Local storage	SQLite (WAL mode)	Audiobook metadata, fully on-device
Optional cloud	Supabase (Postgres) + Vercel	Counts-only telemetry, optional email attribution — opt-in, not required for the app to function

The frontend and backend talk over HTTP, but it's not a network call in any meaningful sense — the backend binds exclusively to 127.0.0.1:10101, loopback only, so nothing outside the Mac can reach it. That detail matters enough that it's worth its own section below.

Design decisions and tradeoffs

Sequential inference, parallel-feeling latency

The single most consequential constraint in the system: espeak-ng has C-level shared state — phonemizer tables and language buffers — that is not thread-safe. Run two inference jobs concurrently and the phonemizer corrupts, producing garbled audio or a crash. So Kokoro inference runs through a ThreadPoolExecutor(max_workers=1) — strictly one job at a time, no exceptions.

That sounds like it should make the app feel slow. It doesn't, because speed comes from pipelining at the sentence level instead of from parallel inference:

Text is split into sentences (5+ words, ending on punctuation).
Sentence 1 is inferred — roughly 150–250ms on Apple Silicon.
While sentence 1 is playing in hardware, sentence 2 is already inferring in the executor.
Time-to-first-audio is just sentence 1's latency, not the whole text's.
Total wall-clock time becomes the sum of per-sentence inference, overlapped with playback rather than blocking on it.

Rendering diagram…

Only one inference box is ever active at a time — the executor genuinely never runs two at once — but because each sentence's inference finishes well before the previous sentence's playback does, the next sentence is always ready the instant it's needed. The listener never sees the serial constraint; they just hear continuous audio.

Measured warm time-to-first-audio sits around 511ms cold, with inference itself running at roughly 4× realtime — the model generates audio faster than it plays back, which is what makes the overlap strategy work at all. The lesson generalizes past TTS: when a resource is fundamentally serial, the fix usually isn't fighting the constraint, it's restructuring the unit of work so serial execution still produces continuous output.

A lookahead cache that turns common cases into near-zero latency

On top of sentence pipelining, the client can call POST /prewarm { text, voice, speed } proactively — triggered by clipboard changes or window focus — so the backend pre-runs inference on the likely-first segment and caches the PCM result (LRU, 10 entries, keyed by (segment_text, voice, speed)). When the matching /speak request actually arrives, it hits the cache and returns audio in under 20ms instead of waiting on a fresh inference pass. It's a small, almost opportunistic optimization, but for the dominant real-world workflow — copy text, hit the speak shortcut — it's the difference between "instant" and "fast."

The same instinct shows up in onboarding: rather than round-tripping to the local backend just to preview a voice, all 28 voices ship as pre-generated ~3-second WAV samples (~4MB total) bundled directly into the app's resources, each reading a native-language pangram. "Hear a sample" plays instantly via AVAudioPlayer instead of waiting on even a fast local inference call — a deliberate choice to make the first five seconds of using the app feel as fast as the thousandth.

Idle-unload: trading a 1.1-second cold start for ~550MB of idle memory

The Kokoro model occupies roughly 600MB loaded, settling around 527MB steady-state. A desktop utility app has no business holding that much memory when nobody's using it, so after 300 seconds of inactivity the model unloads entirely — idle RSS drops to about 43MB. The next request pays a cold-start cost: roughly 0.45s to reload the model plus 0.67s of warmup, about 1.1 seconds total.

That tradeoff only works because most usage is bursty, not continuous — people speak something, read or listen, then come back later — and because the lookahead cache independently covers the "I just copied text and I'm about to hit speak" case regardless of whether the model is warm. Idle-unload and lookahead caching solve two different problems (steady-state memory footprint vs. perceived first-request latency) and neither one would be sufficient alone.

INT8 quantization: tried, measured, rejected

Quantizing a model to INT8 is one of the standard "free" inference speedups — except here it wasn't free, and the team verified that with data instead of assuming it. Two attempts:

Full dynamic INT8 quantization turned Kokoro's convolution layers into ConvInteger ops, which the ONNX Runtime CPU execution provider doesn't support at all — the model simply fails to load.
MatMul-only INT8 quantization loads, but delivers a 1.02× speedup (statistical noise, not a real gain) while destroying audio quality — output correlation against the unquantized baseline measured 0.09–0.29, with RMS error exceeding 100%.

The root cause is architectural: Kokoro is convolution-dominated, not MatMul-heavy, so the quantization technique that helps transformer-style MatMul-bound models does essentially nothing here while actively corrupting the parts of the network that matter. The honest conclusion, documented rather than quietly dropped: a real speedup would require static/QDQ quantization with calibration or a CoreML export, both out of scope for now. This is the kind of negative result that's easy to skip writing down — and exactly the kind worth writing down, so nobody re-discovers the same dead end in six months.

Multilingual support, gated by what actually sounds correct

Kokoro-82M physically contains voices across 9 locales, but SuperSay ships 8 languages and 28 voices — Japanese is deliberately excluded. The reason is upstream of the model: phonemization runs through espeak-ng, which has no kanji grapheme-to-phoneme conversion. Real Japanese input gets transliterated as "Chinese letter" repeated, producing nine seconds of nonsense audio. That's not a quality compromise worth shipping, so Japanese was cut entirely rather than shipped broken, and Mandarin — where espeak does produce correct pinyin-IPA but tone fidelity couldn't be confirmed without human listening review — shipped as Beta rather than full support. Every other supported language got the same treatment: actual audio output reviewed before shipping, not just "the model technically supports this locale."

Rendering diagram…

The implementation detail that keeps this maintainable: language is derived entirely from the voice ID's first character, backend-side — the client never transmits a language field at all, just a voice ID:

# backend/app/services/languages.py
# first character → espeak locale
# a/b → en-us/en-gb, e → es, f → fr-fr, i → it,
# p → pt-br, h → hi, z → cmn (Beta)

That keeps the wire protocol backward-compatible — audiobooks that stored a voice ID under the old single-language version of the app still resolve correctly today, since the language was never part of the contract to begin with. It also means the catalog grew 8 voices → 28 voices with a zero-line change to the HTTP contract — the picker just offers more entries, grouped by language, and the backend already knew what to do with any of them.

Identity without accounts

There are no passwords and no real account system. Every install generates a persistent anonymous UUID on first launch. A user can optionally enter an email during onboarding, which gets POSTed once and attached to that UUID purely for returning-user attribution — nothing in the app is gated behind it, and no JWT or session is ever issued. The simplicity is the point: no password means no credential-breach surface, no phishing vector, and no auth flow to debug. The explicit tradeoff is that there's no way to restore history across devices — a non-goal stated directly in the project's own design docs, not an oversight.

Telemetry that can't smuggle content, enforced twice

When telemetry is on (it's on by default, with a real off switch), only aggregate counts ever leave the device — never raw text, never filenames. The whitelist of allowed event properties is enforced in two independent places: once client-side, where unknown keys are dropped before HTTP serialization even happens, and again server-side, where the ingestion endpoint re-validates and drops anything not on the list before it touches the database. Both sides log to dropped_keys, so schema drift between client and server would show up as a signal instead of silently vanishing. Defense in depth here isn't decorative — it means a bug in either the client or the server alone can't leak a field that was never supposed to be sent. Flip the telemetry toggle off and the outbox is wiped immediately, not just stopped going forward.

Consent and a hard cost ceiling on the one feature that does leave the device

The audiobook feature can optionally use Google Gemini to clean up OCR'd or messy PDF text — the one feature where document content genuinely leaves the Mac. It requires an explicit per-book checkbox in the upload UI, and the backend independently enforces a per-book cost cap (default $5.00, computed from estimated token usage at upload time) — exceed it and the upload is rejected with HTTP 413 before any content is transmitted. A pre-launch adversarial audit caught that the consent checkbox itself was, for a while, enforced client-side only — a modified client could skip the checkbox entirely and still trigger a Gemini upload. The fix threads consent through as a required, server-verified X-Gemini-Consent header on /audiobook/{id}/start, so the guarantee doesn't depend on the official app being the one making the request. That's the difference between a UI convenience and an actual boundary.

Loopback-only, the hard way

An earlier version of the backend bound to 0.0.0.0 — reachable from every device on the same LAN, no authentication. The Swift client only ever talked to it via localhost anyway, so nothing about the feature set required network-wide exposure; it was a default that was simply wrong. The fix was a breaking change: bind exclusively to 127.0.0.1. It's a one-line change with an outsized blast radius if missed, which is exactly why it's documented as a hard architectural invariant (HARD-004) rather than left as an implicit assumption about deployment environment.

Built to survive contact with the real world

A reinstalled app that looked completely dead

After a fresh reinstall, macOS re-signs the app with a new ad-hoc signature, and the system silently revokes the Accessibility permission that was tied to the old signature. The global hotkey for "speak selected text" depends on that permission. Onboarding — which is where the permission gets requested — didn't re-run, because the hasOnboarded flag had already been persisted from the previous install and survived the reinstall untouched. The result: the hotkey silently failed to read selected text, with the main thread sitting idle in the normal AppKit event loop and no error surfaced anywhere. From the outside, the app looked simply broken.

The fix: onboarding now re-runs if either hasOnboarded is false or AXIsProcessTrusted() returns false — so a reinstall that quietly revokes the permission automatically triggers the re-request flow instead of leaving the user stuck. The retrospective note in the project's own engineering journal is worth keeping verbatim: the investigation initially theorized across several turns about what might be wrong with the app, when the actual fix only took one step — checking the permission and onboarding gate state directly on the live process. Data first, theories second.

A permission bug hiding behind a beta-OS guard

A related but separate permission failure surfaced in a pre-launch sweep: notification prompts had stopped appearing at all on current macOS. The proximate cause looked like a #available(macOS 26, *) guard added earlier as a workaround for a different, now-fixed crash — but removing that guard only mattered because of a deeper bug underneath it. The app's Info.plist was missing CFBundleIdentifier entirely, and both the Accessibility and Notification permission systems key off bundle identity — so the same missing key had been silently breaking two unrelated permission prompts at once. The macOS-version guard had been a plausible-looking patch over a symptom; the actual fix was restoring the bundle identity the OS needed to grant permissions against in the first place.

A redirect that silently ate every identify call

The optional email-attribution endpoint (/api/supersay/identify) was returning 400 on every real call in production, even though it worked fine locally. Reproducing it with curl against the live endpoint showed why: the client pointed at the www subdomain, which 308-redirects to the bare domain — and URLSession on macOS drops the POST body when it follows a redirect. Every identify request arrived server-side as an empty body, and the endpoint correctly rejected it as invalid. The fix was a one-line change (point the client at the bare domain, no redirect hop), but finding it required reproducing the exact failure against production rather than trusting that "the client and the endpoint look right in isolation" was the same thing as "the request actually arrives intact."

Stuck at 0:00

A user reported the waveform UI frozen at 0:00, with the Mandarin voice selected over English text. Three separate things were compounding: the model had idle-unloaded after 300+ seconds of inactivity and needed a cold reload, the voice selected didn't match the language of the text (producing mangled espeak output), and — the part that actually caused the "frozen" perception — the UI flipped its status to .speaking at the moment the request was sent, not when audio actually started arriving, so during the cold-reload window the waveform sat at zero with no indication anything was happening.

The fix on the UI side: an awaitingFirstChunk flag keeps the displayed status at .thinking until the first real audio chunk plays, so the waveform only ever shows "speaking" once there's something to show. Separately, automatic language detection (matching voice to text, via Apple's on-device NLLanguageRecognizer — zero dependencies, fully offline) closes the wrong-voice-for-text half of the bug. Reproducing the backend behavior directly via curl confirmed it was never broken — 2.8 seconds for Mandarin-voice-on-English-text, 0.8 seconds for Mandarin-on-Chinese — the entire bug lived in the client's state machine, not the inference pipeline. That distinction mattered: a less careful diagnosis could have spent real effort optimizing backend latency that was never the actual problem.

Seeking past the end corrupted playback state forever

Seeking after a clip or audiobook had already finished playing could leave the app stuck reporting "Speaking" indefinitely. Two separate bugs combined: seek() didn't increment a generation counter used elsewhere as a guard against stale audio-buffer callbacks, and seekAudiobook() never cleared a playbackCompleted flag, so scrubbing backward after a natural end would either restart from zero or get stuck on stale handlers depending on timing. The fix made both seek paths bump the generation counter and clear playbackCompleted, mirroring a pattern the existing stop() path already used correctly — the bug was an inconsistency between two code paths that needed to agree on state, not a new mechanism.

A security finding that turned out to be correct as written

An audit pass flagged a potential infinite crash-loop: if a guard flag (_isLaunching) got stuck true during backend startup, the app might never recover. Investigating before fixing it revealed the flag is correctly reset on every exit path of the launch routine, success or failure, plus by the termination handler — and the one branch that deliberately does not clear it (an offline health-check path) does so on purpose, because clearing it there would let a concurrent launch attempt double-spawn the backend process while one was already mid-startup, which is precisely the failure mode the guard exists to prevent. The proposed "fix" would have been a regression, not an improvement.

This is documented in the project's own journal explicitly as a case study: verify a finding against the actual code before acting on it, especially when the finding comes from an automated audit rather than a reproduced failure. Plausible-sounding bug reports aren't automatically correct ones.

The adversarial process itself

Most of the bugs above weren't found by users — they were found by deliberately trying to break the app before shipping it. The pattern that runs through this whole project: raise findings cheaply, then spend the real effort verifying each one against actual behavior before touching any code.

Rendering diagram…

The two debunked findings matter as much as the six fixed ones. Both looked plausible on first read; both turned out to be wrong once traced against the real lock chain and lifecycle. Shipping the "fix" for either would have been a regression — one would have reintroduced a crash-loop the guard exists to prevent, the other would have "fixed" a lost-update race that a per-book asyncio.Lock already closes. Treating an audit finding as a hypothesis to verify, not a ticket to close, is what kept both out of the codebase.

Five phases in five months

Rendering diagram…

246 commits, one developer, roadmap phases that each shipped as a real, tested version rather than a perpetual work-in-progress — Phase 3's optional accounts/analytics layer and Phase 4's security sweep are both explicitly not new features; they're deliberate passes to make the previous phase's work trustworthy before building the next one on top of it. That ordering — features, then hardening, then more features — shows up in the git history as a repeating rhythm, not a one-time cleanup.

The business case for running the model yourself

	Typical cloud TTS API	SuperSay
Marginal cost per generation	Metered — billed per character or per audio-second	$0. The model runs on hardware the user already owns
Works with no internet	No	Yes — the core feature has no network dependency at all
Text leaves the device	Always, by construction	Never, for the core feature
Blast radius of a usage spike	Someone's cloud bill	Someone's own CPU cycles
The one feature that does call a paid API	—	Audiobook cleaning — opt-in, the user's own Gemini key, hard $5/book cost ceiling enforced server-side

That last row is the interesting one from a product-economics standpoint: SuperSay isn't reflexively cloud-averse, it's cloud-deliberate. The one feature that legitimately benefits from a large cloud model (cleaning messy OCR'd text before narration) uses one — but the design treats "this could get expensive" as a real constraint to engineer against, not just a line item to monitor after the fact. A cost cap enforced server-side, rejecting the upload before a single API call fires, is a stronger guarantee than a dashboard alert that fires after the money's already spent.

The bigger structural bet is that the core value proposition — instant speech from any selected text — has genuinely zero marginal cost to serve, which is not true of a single cloud-TTS competitor. No rate limits to buy around, no API key to provision, no per-character bill that scales against usage. The tradeoff is real: no server-side elasticity to hide behind, no ability to swap in a bigger model without shipping a new binary, and every millisecond of latency has to be earned on the user's own machine, which is exactly why so much of this project's engineering effort went into pipelining, caching, and idle-memory management instead of just "add more compute."

Instrumented before it needed to be

Phase 3 (v2.0) shipped a full analytics stack — batched counts-only telemetry, a nightly Supabase rollup cron, retention-cohort views, and a public metrics dashboard (DAU/WAU, audio-hours, voice distribution, D1/D7/D30 retention, audiobook funnel) — months before the app had anything resembling a real public-launch push behind it. That ordering is deliberate: build the instrumentation for the growth curve before the growth curve exists, rather than retrofitting analytics onto a userbase that already accumulated with no way to see how it behaved.

The philosophy shows up in small details as much as the big pipeline: every rollup is idempotent (INSERT ... ON CONFLICT DO UPDATE) so a cron re-run can never double-count a day; the props whitelist that keeps telemetry honest is enforced on both sides of the wire, not just documented as a policy; and the onboarding flow captures optional email attribution without gating a single feature behind it, so the growth signal doesn't cost the product its promise of "no account required." None of that complexity is visible to a user who never opens the metrics dashboard — it exists so that whenever real usage starts, the first day of data is already trustworthy instead of being the day the tracking finally got built.

What I'd do differently

A few things are explicitly logged as deferred rather than ignored. Automatic language detection currently defaults to off, requiring users to opt in, even though the project's own assessment is that it should probably default to on — left as a product decision rather than an engineering one, precisely because English text always keeps the user's chosen voice and only non-English text ever re-routes, which makes the "on by default" case safer than it sounds. The PyInstaller build spec still has a hardcoded development-machine path baked into how espeak's data files get bundled — it works today because every build happens from the same machine, but it would break a clean checkout or a CI runner, and it's flagged rather than silently left for someone to rediscover the hard way. Both are honest admissions of scope cut for now, not bugs nobody noticed — the kind of debt that's fine to defer as long as it's written down where the next person (or the same person, months later) will actually find it.

Stack at a glance


Frontend	Swift 6, SwiftUI + AppKit, AVAudioEngine
Backend	Python 3.11, FastAPI, bundled to a standalone binary via PyInstaller
TTS model	Kokoro-82M, ONNX Runtime, ~4× realtime on Apple Silicon
Phonemizer	espeak-ng, 9 locales, single-threaded by hard constraint
Local storage	SQLite, WAL mode, fully on-device
Network boundary	Loopback-only (`127.0.0.1:10101`) — nothing reachable off-device
Optional cloud	Supabase + Vercel, counts-only telemetry, opt-in email attribution, server-enforced consent + cost cap on the one paid-API feature
Tests	308 backend tests, 81% coverage (verified); 74 Swift tests
Languages shipped	8 (28 voices), each validated by actual audio review before release
History	246 commits, v1.0 → v2.3.0, 5 roadmap phases, under 6 months

More deep dives in this series: BroSki, a Rust task runner whose transactional output promotion solves the same "never leave things half-done" problem SuperSay's staged audiobook pipeline cares about, SuperZen, a macOS wellness app that makes the same bet — do the invasive-looking thing (local-only, on-device) and engineer around the constraints it creates instead of reaching for the cloud by default, and ytdld, a terminal downloader built on the same instinct as SuperSay's TTFA work — restructure a UI around a blocking, single-purpose third-party engine instead of fighting it.