I do not want bio-agent to be a biomedical chatbot with citations pasted onto
the end. The useful version is an evidence workflow. An answer has to move through controlled
retrieval, structured extraction, packet construction, citation audit, logic audit, revision,
provenance, and eval before I trust it.
The central boundary is between context that helps the workflow and evidence that can support a biomedical claim. Memory, reviewer notes, saved papers, Obsidian exports, and model drafts can guide the process. They cannot become evidence on their own. A claim earns its place only after the system ties it to retrieved papers, extracted spans, and audit records.
I care less about how confident the agent sounds and more about whether its claims can be replayed, challenged, narrowed, exported, and evaluated. The tools are checkpoints, not decorative capabilities.
Refuse clinical advice before tools run, and keep memory outside the evidence channel.
Compress retrieval, extraction, coverage, gaps, and warnings into one contract.
Check claims, revise overreach, export provenance, and gate the release with eval.
Serious steps return IDs, warnings, traces, and recoverable failure states.
Project memory can shape preferences, but it can never become citation support.
Deterministic mock data makes release gates repeatable before live PubMed is trusted.
Selection and retrieval advice can guide the workflow, but policy still wins.
The public tools read like a lab protocol: plan, retrieve, extract, cover, packet, answer, audit, revise, watch, export. A reviewer can stop at any verb and inspect what happened.
The project is not trying to make one magic answer endpoint. It is closer to a small operating layer for biomedical evidence artifacts: papers, spans, claims, packets, audits, traces, decisions, and exports.
The intended user is the person who asks where a claim came from, why a paper was skipped, and what would fail if the run were repeated tomorrow.
Chapter I · The Boundary
Do not let the agent answer before the evidence exists
I built this on top of akashic-agent , which already gives me a plugin lifecycle, tool registration, memory, proactive workflow, and FastAPI dashboard surface. The biomedical layer uses that framework, but it keeps trust out of the general agent loop. The plugin owns its clinical boundary, retrieval policy, tool guards, schemas, storage, audit trail, and dashboard views.
The first gate is deliberately conservative: clinical or patient-specific prompts are refused before memory, retrieval, LLM calls, export, or provenance work can run. A "helpful" model can turn a research assistant into an unsafe advice system very quickly. Refusal is not a final polish step here. It is a pre-tool policy decision.
The guardrail lives outside prompt text. The plugin uses the host framework's
@on_tool_pre hook to deny clinical requests, apply source policy, cap oversized
calls, validate project IDs, and inject safe defaults before a tool executes. The safety
boundary lives where side effects happen.
The second gate is source control. The default source is deterministic mock data, so CI, demos, and evaluation can run without keys and without a changing literature index. Live PubMed retrieval is opt-in through NCBI E-utilities. Always hitting live APIs sounds more impressive, but it makes demos and tests harder to trust. A reviewer should be able to clone the repo, run it, and see the same evidence path.
The retrieval manifest does most of the quiet work at this stage. A search result is not a loose list of papers. It records the source, original query, compiled query, API parameters, pagination, result counts, warnings, stored paper IDs, and returned paper IDs. Retrieval becomes inspectable infrastructure instead of a hidden prompt side effect.
check_literature_access is not a shallow ping. It exercises search, manifest
creation, paper persistence, and abstract coverage. A live source counts as ready only when
the evidence path is ready. The dashboard, CI smoke tests, and agent tools can all ask the
same question: can this source actually support the workflow?
I do not treat "online" as enough. A source is useful only if it can produce artifacts that survive review: manifest IDs, stored papers, warnings, abstract coverage, and a repeatable fallback story. Deterministic mock data stays in the release path because it is the regression harness, not a toy source.
Chapter II · The Evidence Packet
Turn retrieval into a compact contract the model cannot bypass
Release 1.0 breaks the biomedical workflow into tools. The system no longer hides the evidence pipeline behind one large "answer" action. It exposes independently callable tools for multi-pass literature search, batch extraction, coverage-gap analysis, evidence-packet construction, answer trace lookup, Obsidian export, and provenance graph export.
run_multi_pass_literature_search Plans focused searches, executes bounded passes, dedupes papers, and records manifests.
extract_evidence_batch Turns papers into claim, entity, method, limitation, confidence, and evidence-span objects.
analyze_coverage_gaps Marks subquestions as covered, weak, conflicted, missing, or source-limited.
build_evidence_packet Selects a compact packet that synthesis must use, with selected and dropped evidence trace.
get_answer_trace Replays classify, plan, retrieve, extract, packet, draft, audit, revise, and finalize steps.
export_provenance_graph Exports a PROV/OpenLineage-style graph while redacting prompts, secrets, and raw provider output.
I split the workflow into these tools because scientific software needs interruptibility. If the answer looks wrong, I do not want to debug one long prompt. I want to ask: did the planner miss a subquestion, did retrieval stop too early, did extraction lose the limitation, did packet selection drop a conflict, or did synthesis overclaim after audit? Each tool boundary gives me one place to inspect and one place to test.
The tool names also carry a product idea. The agent is allowed to be helpful, but the
workflow has to be legible enough for a dashboard, a CLI, a test, or another agent to drive
the same service layer. Project tools such as
record_project_paper_decision and watch tools such as
watch_research_topic sit beside the core evidence tools. The result feels less
like a single chat and more like a research desk.
The middle layer starts with a planner. One research question is decomposed into focused
retrieval subquestions: background, support, refute, mechanism, limitation, and recent
evidence. Each subquestion is executed through the controlled search_literature
tool, deduped by stable paper ID, and tied back to a retrieval manifest.
After retrieval, the system builds a coverage matrix. The matrix asks: which subquestions are covered, weak, missing, conflicted, or source-limited? Weak coverage can trigger one bounded follow-up pass, but the loop has to persist a stop reason. It may stop because coverage is sufficient, source limits were reached, policy blocked retrieval, or no useful follow-up remained.
This is the part I care about most. A normal RAG pipeline retrieves once, stuffs text into a prompt, and hopes the model notices what is missing. Here, missingness is a first-class object. The system can say, "mechanism is covered, refutation is weak, limitations are source-limited," and it can persist why it stopped instead of pretending the search was complete.
The core artifact is the evidence packet. It contains selected papers, evidence IDs, supported and conflicting claims, limitations, coverage rows, gap decisions, warnings, and manifest references. It is the contract between retrieval/extraction and synthesis. The model receives a curated packet, not raw search noise, duplicate abstracts, memory snippets, or untraceable summaries.
The packet selection has a deliberately modest mathematical layer. The
submodular_greedy strategy tries to preserve coverage and reduce redundancy,
but it protects conflict and limitation evidence instead of dropping it for a cleaner
summary. Optimization is useful only if it does not optimize away the parts a reviewer most
needs to see.
- Planner: converts a broad biomedical question into traceable subquestions.
- Search: returns papers, warnings, coverage metrics, and retrieval manifests.
- Extractor: converts abstracts into entities, claims, methods, limitations, confidence, and spans.
- Packet builder: selects a compact, traceable evidence set for synthesis and audit.
Chapter III · Audit, Trace, and Release Gates
Make every strong claim pay rent
Synthesis is only allowed to draft from the evidence packet. After that, the answer is decomposed into atomic claims. Citation audit checks whether each claim is supported, partially supported, contradicted, insufficiently supported, irrelevant, or uncited. The logic audit looks for semantic overclaims, such as treating an association as causation or converting indirect animal evidence into human clinical certainty.
The logic layer is intentionally not "LLM judge says yes or no." Optional LLM parsing can help convert claim and evidence text into typed frames, but deterministic rules remain the verifier of record. The system can also export symbolic logic facts, so the same audit path can later feed Datalog, Prolog, or solver-style experiments.
Revision is then a system responsibility, not a writing preference. Unsupported or overclaimed statements should be removed, softened, limited, abstained from, or refused. Optional LLM verifier and revision stages can help, but deterministic audit remains the verifier of record. The LLM may assist the workflow; it does not become the trust anchor.
Release 1.0 also standardizes tool responses with release-tool-envelope-v1.
Every tool returns ok, result, warnings,
errors, error_code, trace, stable IDs, and metadata.
A clinical boundary failure, source policy block, budget failure, unknown run ID, or
export-path violation is still a schema-valid output. Downstream agents and dashboards can
handle failure as structured state instead of parsing prose.
The envelope is one of the most practical details in the release. It gives every tool a
risk level, source policy, side-effect declaration, runtime expectation, IDs, traces, and
next allowed actions. If a PubMed call is blocked, the caller sees
source_policy_blocked. If a request is too large, it sees
budget_exceeded. If an export path is unsafe, it sees
export_path_blocked. Failure becomes an API surface.
Observability gets the same treatment. The dashboard Trace view exposes planning, retrieval, extraction, packet selection, draft, audit, revision, memory effects, budget snapshots, step telemetry, Obsidian export, and provenance graph output. The provenance graph links answer runs, papers, evidence, retrieval manifests, packets, audits, revisions, tools, activities, and agents while redacting prompts, provider raw responses, secrets, and API keys.
Obsidian export follows the same rule. It is one-way reviewer output, not a new source of
truth. Exported notes carry metadata such as source_of_truth=biomed_sqlite and
imported_as_evidence=false. The point is to support human review without
letting reviewer notes leak back into the evidence channel.
Research Watch is where the system starts to look beyond one answer run. A topic keeps include terms, exclude terms, preferred methods, schedule, threshold, snapshots, and paper-level push or skip decisions. Each decision carries a relevance score and a rationale. Repeated checks dedupe by stable paper identity instead of pushing the same item again.
Watch does not pretend to be an autonomous scientist. It records why a new paper deserved attention or why it was skipped. Later, those decisions can shape project context, review queues, and exports, but they still do not become evidence unless they point back to retrieved papers and extracted spans.
What this design is really about
The project is less about answering biomedical questions and more about forcing the answer into a reviewable workflow. The same service layer supports agent tools, FastAPI routes, the dashboard, tests, and evaluation, so the system has fewer places to drift.
Evaluation follows that line. The release gate checks citation coverage, schema validity, refusal behavior, retrieval repeatability, manifest validity, multi-pass coverage, packet traceability, audit trace completeness, revision success, memory boundary preservation, structured error validity, Obsidian export safety, provenance graph validity, and prompt-injection boundary success. It is not biomedical expert review, but it proves the engineering contract is testable.
The longer-term version is a research workbench where every generated statement carries a provenance trail: which query found the paper, which span supported the claim, which audit rule accepted or rejected it, which revision changed it, which export produced a reviewer note, and which eval gate would fail if the behavior regressed. The agent is not replacing expert review. It is making the review surface sharper.
The small choices add up. Stable IDs make artifacts addressable. Stop reasons make silence explainable. Structured errors make failure composable. One-way export keeps review notes useful without corrupting the evidence path. Advisory math makes optimization visible without giving it authority. The design is the accumulation of those constraints.
The next version I would want is not mainly a bigger model. I would rather widen this evidence contract: Europe PMC as another structured adapter, full-text and PDF ingestion behind the same provenance gates, better visualization for provenance graphs, formal argumentation for conflicting claims, calibrated uncertainty, causal claim typing, topic-drift detection for Research Watch, and eventually a hypergraph view of papers, claims, entities, methods, limitations, audits, and reviewer decisions.
The project I want to show is not an agent that sounds confident about papers. It is a research workflow where evidence can be found, narrowed, cited, challenged, revised, exported, and rerun. The model still matters, but the reliability comes from the system around it.