AgentForge Clinical Co-Pilot — System Visibility

Live introspection into the deployed agent: corpus, supervisor routing, eval coverage, recent decisions, and a hybrid-retrieval inspector. Companion view: /adversarial (W3 adversarial platform — coverage, vuln pipeline, recent campaigns).

Supervisor → worker graph

┌──────────────┐ user message →│ supervisor │← hops counter (cap: 5) └──┬────┬───┬──┘ │ │ │ attachment? │ │ │ guideline-trigger token? ▼ │ │ │ (and no evidence yet) ┌──────────────┐│ │ │ │ intake_ ││ │ ▼ │ extractor ││ │ ┌────────────────────┐ │ (vision) ││ │ │ evidence_retriever │ └──────┬───────┘│ │ │ (BM25+dense+rerank)│ │ │ │ └─────────┬──────────┘ │ │ │ │ └──→ supervisor (re-route) ←┘ │ ▼ default ┌──────────────┐ │ answer node │ ← W1 orchestrator (verifier + retry) └──────────────┘

Routing rules (deterministic, in evaluation order)

RuleTestDecisionRationale

Recent supervisor decisions

In-memory ring buffer (last 20). Cleared on agent restart.

WhenDecisionHopsHas attachHas evidenceMessage preview

Live retrieval inspector

Type any clinical question and see what the hybrid retriever returns BEFORE the LLM sees it. Same retriever the agent uses on every chat turn.

Architecture

  • BM25 over the corpus text — keyword recall layer, dependency-free, in-memory.
  • Voyage voyage-3 for dense embeddings — semantic recall layer. Optional; degrades gracefully to BM25-only if VOYAGE_API_KEY is unset.
  • Cohere Rerank 3 over the BM25 ∪ dense union — final precision layer. Optional; if missing, the retriever fuses upstream scores via reciprocal rank.

Per-category coverage

CategoryCasesTargetBaseline

Per-rubric baseline rate

RubricPass rate

Clinical guideline corpus

Hand-curated. chunks across USPSTF, ADA, ACIP, ACC/AHA, CDC.

SourceYearTitleChunk ID

Selected chunk

Click a row above to inspect the full text.