Building a biomedical evidence agent around audit trails • Andrew Zhang

I do not want bio-agent to be a biomedical chatbot with citations pasted onto the end. The useful version is an evidence workflow. An answer has to move through controlled retrieval, structured extraction, packet construction, citation audit, logic audit, revision, provenance, and eval before I trust it.

The central boundary is between context that helps the workflow and evidence that can support a biomedical claim. Memory, reviewer notes, saved papers, Obsidian exports, and model drafts can guide the process. They cannot become evidence on their own. A claim earns its place only after the system ties it to retrieved papers, extracted spans, and audit records.

Design thesis

I care less about how confident the agent sounds and more about whether its claims can be replayed, challenged, narrowed, exported, and evaluated. The tools are checkpoints, not decorative capabilities.

01 Boundary

Refuse clinical advice before tools run, and keep memory outside the evidence channel.

02 Packet

Compress retrieval, extraction, coverage, gaps, and warnings into one contract.

03 Audit

Check claims, revise overreach, export provenance, and gate the release with eval.

Tools as checkpoints

Serious steps return IDs, warnings, traces, and recoverable failure states.

Memory as context

Project memory can shape preferences, but it can never become citation support.

Mock as discipline

Deterministic mock data makes release gates repeatable before live PubMed is trusted.

Math as advisory

Selection and retrieval advice can guide the workflow, but policy still wins.

Tool grammar

The public tools read like a lab protocol: plan, retrieve, extract, cover, packet, answer, audit, revise, watch, export. A reviewer can stop at any verb and inspect what happened.

Evidence OS

The project is not trying to make one magic answer endpoint. It is closer to a small operating layer for biomedical evidence artifacts: papers, spans, claims, packets, audits, traces, decisions, and exports.

Reviewer first

The intended user is the person who asks where a claim came from, why a paper was skipped, and what would fail if the run were repeated tomorrow.

Chapter I · The Boundary

Do not let the agent answer before the evidence exists

I built this on top of akashic-agent , which already gives me a plugin lifecycle, tool registration, memory, proactive workflow, and FastAPI dashboard surface. The biomedical layer uses that framework, but it keeps trust out of the general agent loop. The plugin owns its clinical boundary, retrieval policy, tool guards, schemas, storage, audit trail, and dashboard views.

The first gate is deliberately conservative: clinical or patient-specific prompts are refused before memory, retrieval, LLM calls, export, or provenance work can run. A "helpful" model can turn a research assistant into an unsafe advice system very quickly. Refusal is not a final polish step here. It is a pre-tool policy decision.

The guardrail lives outside prompt text. The plugin uses the host framework's @on_tool_pre hook to deny clinical requests, apply source policy, cap oversized calls, validate project IDs, and inject safe defaults before a tool executes. The safety boundary lives where side effects happen.

The second gate is source control. The default source is deterministic mock data, so CI, demos, and evaluation can run without keys and without a changing literature index. Live PubMed retrieval is opt-in through NCBI E-utilities. Always hitting live APIs sounds more impressive, but it makes demos and tests harder to trust. A reviewer should be able to clone the repo, run it, and see the same evidence path.

research question clinical boundary controlled retrieval retrieval manifest evidence spans

The retrieval manifest does most of the quiet work at this stage. A search result is not a loose list of papers. It records the source, original query, compiled query, API parameters, pagination, result counts, warnings, stored paper IDs, and returned paper IDs. Retrieval becomes inspectable infrastructure instead of a hidden prompt side effect.

check_literature_access is not a shallow ping. It exercises search, manifest creation, paper persistence, and abstract coverage. A live source counts as ready only when the evidence path is ready. The dashboard, CI smoke tests, and agent tools can all ask the same question: can this source actually support the workflow?

The quiet design choice:

I do not treat "online" as enough. A source is useful only if it can produce artifacts that survive review: manifest IDs, stored papers, warnings, abstract coverage, and a repeatable fallback story. Deterministic mock data stays in the release path because it is the regression harness, not a toy source.

Chapter II · The Evidence Packet

Turn retrieval into a compact contract the model cannot bypass

Release 1.0 breaks the biomedical workflow into tools. The system no longer hides the evidence pipeline behind one large "answer" action. It exposes independently callable tools for multi-pass literature search, batch extraction, coverage-gap analysis, evidence-packet construction, answer trace lookup, Obsidian export, and provenance graph export.

run_multi_pass_literature_search

Plans focused searches, executes bounded passes, dedupes papers, and records manifests.

extract_evidence_batch

Turns papers into claim, entity, method, limitation, confidence, and evidence-span objects.

analyze_coverage_gaps

Marks subquestions as covered, weak, conflicted, missing, or source-limited.

build_evidence_packet

Selects a compact packet that synthesis must use, with selected and dropped evidence trace.

get_answer_trace

Replays classify, plan, retrieve, extract, packet, draft, audit, revise, and finalize steps.

export_provenance_graph

Exports a PROV/OpenLineage-style graph while redacting prompts, secrets, and raw provider output.

I split the workflow into these tools because scientific software needs interruptibility. If the answer looks wrong, I do not want to debug one long prompt. I want to ask: did the planner miss a subquestion, did retrieval stop too early, did extraction lose the limitation, did packet selection drop a conflict, or did synthesis overclaim after audit? Each tool boundary gives me one place to inspect and one place to test.

The tool names also carry a product idea. The agent is allowed to be helpful, but the workflow has to be legible enough for a dashboard, a CLI, a test, or another agent to drive the same service layer. Project tools such as record_project_paper_decision and watch tools such as watch_research_topic sit beside the core evidence tools. The result feels less like a single chat and more like a research desk.

The middle layer starts with a planner. One research question is decomposed into focused retrieval subquestions: background, support, refute, mechanism, limitation, and recent evidence. Each subquestion is executed through the controlled search_literature tool, deduped by stable paper ID, and tied back to a retrieval manifest.

After retrieval, the system builds a coverage matrix. The matrix asks: which subquestions are covered, weak, missing, conflicted, or source-limited? Weak coverage can trigger one bounded follow-up pass, but the loop has to persist a stop reason. It may stop because coverage is sufficient, source limits were reached, policy blocked retrieval, or no useful follow-up remained.

This is the part I care about most. A normal RAG pipeline retrieves once, stuffs text into a prompt, and hopes the model notices what is missing. Here, missingness is a first-class object. The system can say, "mechanism is covered, refutation is weak, limitations are source-limited," and it can persist why it stopped instead of pretending the search was complete.

The core artifact is the evidence packet. It contains selected papers, evidence IDs, supported and conflicting claims, limitations, coverage rows, gap decisions, warnings, and manifest references. It is the contract between retrieval/extraction and synthesis. The model receives a curated packet, not raw search noise, duplicate abstracts, memory snippets, or untraceable summaries.

The packet selection has a deliberately modest mathematical layer. The submodular_greedy strategy tries to preserve coverage and reduce redundancy, but it protects conflict and limitation evidence instead of dropping it for a cleaner summary. Optimization is useful only if it does not optimize away the parts a reviewer most needs to see.

Planner: converts a broad biomedical question into traceable subquestions.
Search: returns papers, warnings, coverage metrics, and retrieval manifests.
Extractor: converts abstracts into entities, claims, methods, limitations, confidence, and spans.
Packet builder: selects a compact, traceable evidence set for synthesis and audit.

Chapter III · Audit, Trace, and Release Gates

Make every strong claim pay rent

Synthesis is only allowed to draft from the evidence packet. After that, the answer is decomposed into atomic claims. Citation audit checks whether each claim is supported, partially supported, contradicted, insufficiently supported, irrelevant, or uncited. The logic audit looks for semantic overclaims, such as treating an association as causation or converting indirect animal evidence into human clinical certainty.

The logic layer is intentionally not "LLM judge says yes or no." Optional LLM parsing can help convert claim and evidence text into typed frames, but deterministic rules remain the verifier of record. The system can also export symbolic logic facts, so the same audit path can later feed Datalog, Prolog, or solver-style experiments.

Revision is then a system responsibility, not a writing preference. Unsupported or overclaimed statements should be removed, softened, limited, abstained from, or refused. Optional LLM verifier and revision stages can help, but deterministic audit remains the verifier of record. The LLM may assist the workflow; it does not become the trust anchor.

Release 1.0 also standardizes tool responses with release-tool-envelope-v1. Every tool returns ok, result, warnings, errors, error_code, trace, stable IDs, and metadata. A clinical boundary failure, source policy block, budget failure, unknown run ID, or export-path violation is still a schema-valid output. Downstream agents and dashboards can handle failure as structured state instead of parsing prose.

The envelope is one of the most practical details in the release. It gives every tool a risk level, source policy, side-effect declaration, runtime expectation, IDs, traces, and next allowed actions. If a PubMed call is blocked, the caller sees source_policy_blocked. If a request is too large, it sees budget_exceeded. If an export path is unsafe, it sees export_path_blocked. Failure becomes an API surface.

Observability gets the same treatment. The dashboard Trace view exposes planning, retrieval, extraction, packet selection, draft, audit, revision, memory effects, budget snapshots, step telemetry, Obsidian export, and provenance graph output. The provenance graph links answer runs, papers, evidence, retrieval manifests, packets, audits, revisions, tools, activities, and agents while redacting prompts, provider raw responses, secrets, and API keys.

Obsidian export follows the same rule. It is one-way reviewer output, not a new source of truth. Exported notes carry metadata such as source_of_truth=biomed_sqlite and imported_as_evidence=false. The point is to support human review without letting reviewer notes leak back into the evidence channel.

Research Watch Long-running topics should leave decisions behind.

Research Watch is where the system starts to look beyond one answer run. A topic keeps include terms, exclude terms, preferred methods, schedule, threshold, snapshots, and paper-level push or skip decisions. Each decision carries a relevance score and a rationale. Repeated checks dedupe by stable paper identity instead of pushing the same item again.

Watch does not pretend to be an autonomous scientist. It records why a new paper deserved attention or why it was skipped. Later, those decisions can shape project context, review queues, and exports, but they still do not become evidence unless they point back to retrieved papers and extracted spans.

The design principle: mathematical aids are allowed only when their role is explicit. Submodular-style packet selection can prioritize coverage and retain conflict/limitation evidence. Contextual-bandit-style retrieval advice can suggest stop, broaden, support, refute, mechanism, or limitation searches. Neither can override clinical policy, source caps, citation audit, or logic audit.

What this design is really about

The project is less about answering biomedical questions and more about forcing the answer into a reviewable workflow. The same service layer supports agent tools, FastAPI routes, the dashboard, tests, and evaluation, so the system has fewer places to drift.

Evaluation follows that line. The release gate checks citation coverage, schema validity, refusal behavior, retrieval repeatability, manifest validity, multi-pass coverage, packet traceability, audit trace completeness, revision success, memory boundary preservation, structured error validity, Obsidian export safety, provenance graph validity, and prompt-injection boundary success. It is not biomedical expert review, but it proves the engineering contract is testable.

The longer-term version is a research workbench where every generated statement carries a provenance trail: which query found the paper, which span supported the claim, which audit rule accepted or rejected it, which revision changed it, which export produced a reviewer note, and which eval gate would fail if the behavior regressed. The agent is not replacing expert review. It is making the review surface sharper.

The small choices add up. Stable IDs make artifacts addressable. Stop reasons make silence explainable. Structured errors make failure composable. One-way export keeps review notes useful without corrupting the evidence path. Advisory math makes optimization visible without giving it authority. The design is the accumulation of those constraints.

The next version I would want is not mainly a bigger model. I would rather widen this evidence contract: Europe PMC as another structured adapter, full-text and PDF ingestion behind the same provenance gates, better visualization for provenance graphs, formal argumentation for conflicting claims, calibrated uncertainty, causal claim typing, topic-drift detection for Research Watch, and eventually a hypergraph view of papers, claims, entities, methods, limitations, audits, and reviewer decisions.

The project I want to show is not an agent that sounds confident about papers. It is a research workflow where evidence can be found, narrowed, cited, challenged, revised, exported, and rerun. The model still matters, but the reliability comes from the system around it.