Chanakya: A 13-Agent Portfolio Brain, Engineered for Restraint

Jul 2, 2026

21 min read

engineeringdeep-divepythontypescriptllm-agentsfastapichanakya

Chanakya is a personal stock-portfolio intelligence platform — a system that watches a real, dual-market (US + India) portfolio, reasons about it through a pipeline of specialized agents, and proposes trades that a human has to approve before anything touches money. It's private: proprietary, personal-use only, no public repo. That's not a marketing constraint I'm working around here — it's the actual design center. Every architectural decision in this system assumes a single operator who is also the only person accountable if it's wrong, which is a very different set of incentives than a SaaS product optimizing for growth. That shows up everywhere: in how aggressively cost gets treated as a first-class engineering concern, in how conservatively money-moving code is gated, and in a sprint-and-retro discipline that reads more like an engineering journal than a roadmap.

503 commits, 15 shipped sprints (each closed with a written retro), 47 database migrations, roughly 66k lines of Python and 31k lines of TypeScript, built solo over about seven weeks. The interesting part isn't the line count — it's that the system was formally declared "feature-complete" once, at sprint 1, and then kept growing for 14 more sprints anyway, because a personal tool's real requirements only reveal themselves once you're actually using it daily.

The problem

Retail portfolio tools are single-lens: a charting app for price action, a separate news reader, a spreadsheet for fundamentals, a mental model for macro regime. Forming a considered view means manually synthesizing all of that, every time, for every symbol — and there's no natural place to encode "I don't trust this signal enough to act on it alone" versus "three independent signals agree, this is worth a closer look." Chanakya's bet is that an event-driven multi-agent pipeline can do that synthesis continuously and cheaply enough to run daily, as long as two things are true: the system has to stay honest about its own accuracy (a forecast that's just confidently wrong is worse than no forecast), and nothing with real financial consequence executes without a human in the loop.

Architecture

Layer	Choice	Why
Data ingestion	9 async collectors (SEC EDGAR, Alpaca, FRED, Finnhub, Yahoo/Google News RSS, Reddit/StockTwits)	Each source used where it's actually strongest — EDGAR for fundamentals (official XBRL), Alpaca for OHLCV, FRED for macro — rather than one vendor's mediocre coverage of everything
Event bus	Kafka (Redpanda)	Every agent is an independent consumer group on the same topics — restartable and horizontally scalable piece by piece, no central orchestrator to keep alive
Storage	TimescaleDB hypertables + pgvector	Hypertables for the genuinely time-series tables (`bars`, `indicators`, `macro_series`); plain Postgres for everything sparse (`corporate_actions`, `tickers`); pgvector for news-dedup and agent-memory embeddings
Agent runtime	13 agents, one `AIOKafkaConsumer` process, topic → handler map	Deliberately not arq yet — one process is sufficient at this scale, and the topic-per-agent structure is already shaped for that migration if it's ever needed
API	FastAPI + SQLAlchemy 2 (async)	REST + Server-Sent Events for job progress; every state-mutating, safety-gated endpoint requires session auth and CSRF
Frontend	Next.js 15, ECharts via a shared `ChartContainer`	Pure `build*Option()` functions per chart type, unit-tested without touching a canvas
LLM	Google Gemini, 3-tier routing (lite / flash / pro)	Each agent routes to the reasoning tier its task actually needs — a classification call and a judgment call are not the same job

Rendering diagram…

Design decisions and tradeoffs

Every agent is a Kafka consumer group, not a call chain

The README's architecture diagram draws a clean linear arrow — News → Sentiment → Technical → ... → Grading — and it's easy to read that as a sequential pipeline. It isn't one. There's no orchestrator class calling agents in order; there's a single TOPIC_HANDLERS map (main.py) wiring 11 Kafka topics to handler functions, and each agent is its own consumer group subscribed to whatever topic feeds it. indicators.updated triggers TechnicalAgent. theses.generated triggers DecisionAgent, which emits decisions.proposed, which triggers RiskAgent, which emits decisions.approved, which triggers ExecutionAgent. The "pipeline" is an emergent property of topic names, not a control-flow object anywhere in the code — which means any agent can be killed, redeployed, or scaled independently without anyone else in the system noticing, and a crash in one doesn't cascade into the others' consumer offsets.

Two resilience mechanisms sit under all of this: a Supervisor that restarts crashed asyncio tasks with exponential backoff, and a per-topic CircuitBreaker that trips open after a 50% failure rate over a 20-event sliding window and drops events for a cooldown period. Both exist because of a real incident — see below.

Not everything is event-triggered, either. Synthesis, Debate, Discovery, and the LLM forecast run on scheduled daily sweeps, not per-event — matching inference cadence to how markets actually move (most of what matters shows up once, in the pre-market window) rather than to how often an upstream topic happens to emit.

One agent, three roles: how the debate actually works

The README calls it a "Bull/Bear/Judge debate," which reads like three separate agents. It's actually one DebateAgent making three sequential Gemini calls inside a single tracked run:

Rendering diagram…

The Bull and Bear prompts each explicitly require an acknowledged weakness — "the judge will penalize blind optimism" — so the debate can't degenerate into two agents independently agreeing with the price action. The Judge runs on the higher-reasoning pro tier and is fed both stances plus the underlying data, and its judge_score becomes one of seven weighted signals — tied for the highest weight (0.20) alongside the technical signal — that a separate, purely deterministic DecisionAgent composes into a final conviction. If any of the three calls fails, no DebateRound is persisted at all, and DecisionAgent renormalizes its weights around the missing signal rather than defaulting it to a neutral 0.5 — a fix for a documented earlier bug where a silent default caused an "all-HOLD" failure mode.

The LLM proposes, deterministic code disposes

This is the split that actually matters for a system where a bad call costs real money: of the 13 agents, only 9 ever call an LLM at all. Sentiment, Decision, Risk, and Execution — the four agents closest to actually moving money — are pure deterministic Python. RiskAgent runs five hard checks against every proposed decision: single-name position weight cap (10%), gross exposure cap (100%), a correlation-penalty veto, sector/name concentration hard caps, and a drawdown gate against a Redis-persisted portfolio high-water-mark that force-halves position size at a 10% drawdown and fully blocks new risk at 15%. None of that is an LLM judgment call — it's arithmetic against hard thresholds, on purpose.

ExecutionAgent sits behind an even harder boundary — a triple gate that must unanimously agree before real money moves:

def evaluate_gates(decision: Decision | None) -> LiveTradingDecision:
    settings = get_settings()
    gate_env = not settings.alpaca_paper        # CK_ALPACA_PAPER=false
    gate_flag = settings.live_trading_enabled    # runtime kill-switch
    gate_hitl = decision is not None and decision.human_approved_at is not None
    if gate_env and gate_flag and gate_hitl:
        return LiveTradingDecision(mode="live", ...)
    # any single gate failing → silent fallback to paper, never a silent "live" attempt

Toggling live_trading_enabled is deliberately not exposed over the API — the docstring calls this out directly, to "preserve the deliberate-action property of the gate." Every live-money attempt gets written to an audit log table; paper attempts don't, by design. This is the kind of asymmetry that only shows up when you've actually thought about what happens when the LLM is wrong.

The leakage guard and the neutral band

ForecastAgent — the newest agent, added in the sprint that also rebuilt the cost model — predicts a 3-class direction (up / down / flat, internally nicknamed Bull/Bear/Horse) for both a 1-day and 5-day horizon in a single Gemini call per symbol, about 109 calls each morning. The interesting engineering detail is how it prevents the single most common backtesting sin — using information that wouldn't have been available at decision time:

# LEAKAGE GUARD: never surface an article published after the call time.
news_rows = await session.execute(
    select(NewsArticle, SentimentScore)
    .where(
        NewsArticle.symbol == sym,
        NewsArticle.published_at <= now,   # LEAKAGE GUARD
        NewsArticle.published_at >= since,
    )
)

now is the caller-supplied decision timestamp, not datetime.now() — so replaying a historical decision for backtesting or grading uses exactly the news that existed at that moment, not news from the future. This is asserted directly by a dedicated test (test_forecast_agent_leakage.py) that plants future-dated articles and confirms they never enter the context.

The "flat" class isn't a fixed ±0.5% dead zone either — it's volatility-scaled per symbol and horizon, computed deterministically (never by the LLM):

def neutral_band(atr: float | None, close: float, horizon: int) -> float:
    atr_pct = (atr / close) if (atr is not None and close > 0) else 0.015
    return _clamp(_BAND_K * atr_pct * math.sqrt(horizon), _BAND_FLOOR, _BAND_CEIL)

A volatile small-cap gets a wider "flat" band than a stable large-cap, and grading against the eventual actual close uses the same band the forecast was made against — so a call is only wrong if it missed the class that was actually in play. The scorecard that grades this — 3×3 confusion matrix, per-class precision/recall, and a Gorodkin multiclass MCC — is deliberately not raw accuracy: on a skewed base rate, accuracy flatters a constant predictor, while balanced accuracy and MCC both land at zero for a model that always guesses "flat." That anti-gameability check is verified directly: an always-flat predictor scores balanced_accuracy ≈ 1/3, MCC = 0 in the test suite, not by inspection.

Feedback loops that close over time

Four systems keep the agents from running forever on frozen heuristics:

Model promotion gate — a new prediction model trains every Sunday night, is scored on a holdout split, and only replaces the live champion model if it beats the incumbent on AUC × hit-rate × Brier score. A worse retrain silently stays a candidate; the live model is untouched. (This gate exists because of a specific prior bug: an earlier version let a worse retrain silently replace a better incumbent — a real risk when the model is informing live trades.)
Calibration scorecard — walks every prediction past its target time, computes the realized return, and assigns an A–F grade per confidence quantile band, so the dashboard shows whether the model's stated confidence intervals are actually honest.
Regime-conditional weight matrix — 5 market regimes × 7 agent signals = 35 weight cells. DecisionAgent multiplies a regime prior against a Bayesian per-agent posterior before composing the final conviction, and every closed trade updates the cell for the regime it was opened in — the matrix genuinely learns over time rather than being hand-tuned once.
HITL meta-classifier — trains a LightGBM model on (decision features → was the human's override actually correct), using the same feature vector DecisionAgent itself used. Every decision response now carries an override_confidence and a recommendation (trust_agent / consider_override / override_likely_correct) next to the approve/reject buttons — the system learning when to trust its own operator's gut versus its own signal.

Hard problems, solved

Rethinking the inference cadence

A recurring finding from actually running the system daily: hourly, per-symbol sweeps on the strongest model tier were architectural waste, not something a smaller tuning knob could fix. The signals that matter mostly settle once a day, in the pre-market window — polling for them hourly just re-asks a question whose answer hasn't changed. That realization led to a coordinated redesign rather than a series of small patches:

Lever	Before	After
Sweep cadence	Hourly, per-symbol	Once daily, pre-market window per region, plus a startup-catchup pass
Model tier	Mostly the strongest tier by default	Tiered: lightweight for classification-style calls, mid-tier for extraction, strongest tier reserved for judgment calls (the debate's Judge, Synthesis, Macro)
Generation parameters	One-size-fits-all defaults	Reasoning depth configured per call type, so a structured classification call isn't paying for the same open-ended deliberation a judgment call needs
Embeddings	Hosted API, 768-dim	Local `sentence-transformers` running on-device (Apple Silicon MPS), 384-dim — no external dependency for a high-volume, low-value-per-call workload
Repeat work	Every restart re-runs everything	SHA-256 input-hash dedup, 24h window

The dedup mechanism deserves its own callout because of how directly it's verified — not "restarts should be cheaper now" but a test that asserts an LLM mock is called exactly once across two identical requests:

dedup_window_seconds: int = 86400   # class attribute on Agent

if not force and self.dedup_window_seconds > 0:
    cached = await self._find_cached_run(input_hash)
    if cached is not None:
        return await self._return_cached_result(cached, input_hash)

A cache hit writes a status='cached' marker row pointing back at the original run rather than making a fresh call, and downstream orchestration explicitly skips re-emitting Kafka events on a cache hit — so a restart doesn't cascade into re-triggering DecisionAgent for work that already happened. That's as much a reliability property as an efficiency one: the system is safe to restart mid-day without duplicating effort or double-counting anything downstream. Together, the cadence change, the tiered routing, and the dedup layer brought daily inference load down by a substantial, multi-fold margin — without touching a single line of the alpha-generating logic itself.

A single failed call that could drain a day's budget

The token-budget guard reserves tokens optimistically before a call and reconciles after. The refund() method exists because of a bug that was actually observed in production, documented directly in the code:

"Without this, every failed call permanently subtracts expected_tokens from the daily budget — a 404 storm or a flaky network drains the counter to zero in minutes (observed: 1,008,022 tokens 'used' at process start with no successful calls)."

A million tokens of phantom spend before a single real call succeeded. The fix is small — refund the reservation on any failure path — but finding it required actually watching the system misbehave in a way that pure code review wouldn't have caught, since the bug only manifests under the specific failure pattern of many rapid, real failures early in a process's life.

211 consecutive failures and the missing circuit breaker

Related but distinct: once the daily budget or every model tier's rate limit is genuinely exhausted, the per-article NewsAgent stream — which fires once per incoming article, potentially dozens per hour — was spawning one failing run per article rather than backing off, observed directly as 211 consecutive failures in a single window. The fix is a third layer of budget protection, on top of the per-call reservation and per-tier rate limiting: a 15-minute cooldown that trips the moment any run fails with a budget-related error, checked before dispatch so the news stream stops hammering an already-exhausted budget instead of generating pure wasted CPU and log noise.

A stale chunk interval nobody had revisited

TimescaleDB hypertables were created with 7-day chunks — reasonable for high-frequency data, wrong for once-a-day OHLCV bars. Daily data puts about 5 rows in a 7-day chunk, so 20 years of history fans out to roughly 1,000 chunks, and an unfiltered query needs a lock per chunk it might touch (the actual reason max_locks_per_transaction had been raised to 512 earlier in the project — a symptom that had been patched without anyone revisiting the cause). Retuning to 365-day chunks collapsed that to about 20 chunks for the same 20 years of history, while set_chunk_time_interval only governs future chunks — sequencing the migration to run before the daily seed step meant a fresh database gets the fast chunking from row one, while an already-running database keeps its legacy 7-day chunks until they age out naturally.

A laptop running at 180°F

Not every hard problem is subtle. Running the full 13-process fleet with 10 separate Docker images was thermally punishing enough to become an explicit sprint theme, root-caused to two compounding issues: no per-container CPU ceiling, and no cap on native thread pools (OpenBLAS/OMP/numba each independently deciding to use every core). The fix collapsed 10 Python images into a single shared base image — build time dropped from roughly 30 minutes to 140 seconds, image size from ~19 GB to ~2 GB, and roughly 100 GB of accumulated Docker disk got reclaimed in the same pass. A companion fix (lazy intraday seeding instead of eagerly backfilling every interval on setup) cut first-run setup time from ~30 minutes to ~4.2 minutes — a 7× improvement that came from questioning whether the eager work was ever necessary, not from making the eager work faster.

What I'd do differently

The project is candid with itself about remaining debt, and independent review of the current code surfaces a few more:

Read-only reasoning transcripts are unauthenticated. GET /agents/runs/{id} — which returns the full LLM message transcript for any agent run — has no get_current_user dependency, unlike every mutating endpoint in the same file. A prior sprint explicitly closed the same auth gap on /portfolio and /decisions reads; this one was missed. Worth closing, even for a single-operator system, since exposed reasoning transcripts are still exposed data.
CSRF fails open under Redis outage. The double-submit CSRF check is Redis-backed and explicitly documented as fail-open if Redis is unreachable — the same policy applied to the drawdown risk gate ("never block on infra unavailability"). That's a defensible tradeoff for solo use, but it's worth being honest that it is a tradeoff, not a guarantee.
A few docs and diagrams have drifted from the code. The README's architecture diagram still shows OpenTelemetry tracing wired across API, workers, collectors, and agents; in the current code it's only wired in the API service. The README's "Discovery → Macro → News → Trader" terminology and its /review page were both swept away by later refactors without the diagram catching up. A local-only Astro documentation site (33 pages, 12 real dashboard screenshots) is genuinely well-built, but by its own README's admission, deliberately never deployed — and deliberately withholds the specific agent reasoning and sizing math, on the reasoning that "those are Chanakya's edge."
The rebrand's own rationale isn't documented. The project shipped for most of its life as "SuperWealth" before a same-sprint rename to "Chanakya" — mechanically well-documented (an env-prefix change across 83 files, a CLI rename, a Redis-namespace migration, all done as one atomic commit specifically because a partial rename would break the pre-commit type-check), but there's no ADR or README note explaining why that name. Worth writing down, if only so future-me remembers.

None of these are severe for a system whose entire threat model is "one operator, personal use" — but writing them down here is the same discipline the project already applies to itself in every sprint retro: the honest gap is more useful than the confident gloss.

The growth story: sprint discipline as the actual product

The most telling data point isn't a feature, it's the test count over time: 299 → 340 → 360 → ~1,004 at the point the project first declared itself "feature-complete" (Sprint 1, ~53 atomic tasks, May 2026) → 972 → 830 backend / 175 frontend → 210 frontend alone → 1,128 backend / 336 frontend → 681+ backend / 376 frontend today, after later suites were deliberately re-scoped to exclude slow DB-integration tests from the fast local loop. A monotonic count would suggest feature creep; this trajectory — including the deliberate contractions — reads more like an engineer who kept the suite honestly load-bearing rather than optimizing for a vanity number.

Every sprint close carries the same shape: a quality gate (ruff clean · pyright 0 errors · biome clean · tsc clean, plus the test counts) and a written retro with what actually went right or wrong. A few of those retros are worth quoting directly, because they're the kind of insight that only comes from actually shipping something and watching it break:

"Preference wiring exposed the 'dead code' pattern. Every agent that ignored its configured prefs was silently computing work and discarding it."

"Lean-by-default was the highest-leverage change: the platform ran 14 heavy services for a dashboard that needs 3."

"Not done — external blockers, not deferred negligence: IBKR live account, Twitter API key, multi-user rollout, mobile UI, sandboxed scripts. These require external decisions, not code."

That last one matters for reading Chanakya's roadmap honestly. plans/ holds six deferred efforts — Interactive Brokers as a second execution venue, Twitter/X sentiment (gated on a cost-benefit ADR against a materially pricier API tier), a mobile-responsive overhaul, sandboxed user-defined strategy scripts, and full multi-user isolation (~28 router gates, 7 new foreign keys, CSRF/CORS hardening) — and every one of them is gated on an explicit external trigger, not "got around to it eventually." Multi-user rollout waits on a second user actually asking for access. IBKR waits on Alpaca live trading being proven first with real money. That's a roadmap shaped by actual demand signals rather than a backlog grooming exercise.

Stack at a glance


Backend	Python 3.12, FastAPI, SQLAlchemy 2 (async), Alembic — 47 migrations
Data layer	TimescaleDB (hypertables + compression), pgvector (384-dim, local embeddings)
Event bus	Kafka via Redpanda — 13 agents, each an independent consumer group
Agents	13 wired (+ 1 pure-function scorer), only 9 ever call an LLM
LLM	Google Gemini, 3-tier routing (lite / flash / pro), per-call reasoning-depth tuning
Frontend	Next.js 15, TypeScript, ECharts via pure `build*Option()` functions
Execution	Alpaca (paper by default), triple-gated live trading
Tests	681+ backend (pytest), 376 frontend (vitest), Playwright e2e
CI	GitHub Actions — ruff, pyright, pytest, biome, tsc, vitest, gated e2e
Observability	OpenTelemetry (API), Prometheus + Grafana, Jaeger — opt-in stack
License	Proprietary, personal-use only

More deep dives in this series: SuperZen, where a single 1Hz heartbeat replaces a pile of independent timers for the same reason Chanakya's DecisionAgent renormalizes around a missing signal instead of defaulting it — one disciplined mechanism beats a special case bolted onto every caller; and BroSki, a Rust task runner built around the same instinct that shaped Chanakya's budget-refund fix — that a system should be able to explain exactly why it did (or didn't) do something, not just that it did it.