Beyond RAG: full-text biomedical evidence review • Andrew Zhang

The previous post was about the evidence-first agent architecture behind this project. This one is about what changed in review. Release 2.0 lets the system pull deterministic full text and PDF parser output into the same run-centered workflow instead of stopping at the abstract.

Release 2.0 thesis

Full text only matters when it turns into reviewable material. In bio-agent 2.0, parsed sections, source hashes, page numbers, and character offsets can point reviewers to stronger evidence. They still have to pass through extraction, packet selection, citation audit, logic audit, graph validation, provenance, and Run Evidence Review.

biomed-evidence-graph-v1 biomed-evidence-review-v1 full-text span locators argument-graph-v2 watch graph drift

Source layer Full text and PDFs

Document sections, source hashes, page and offset locators.

→

Evidence boundary Extraction + audit

Evidence items, packets, citation audit, logic audit.

→

Reviewer surface Run + Library review

Claim cards, Watch drift, Argument Graph v2, trace links.

Before Abstract-level review

Runs were already claim-audited, but source inspection mostly stopped at paper records and extracted spans.

Now Full-text locators

Known papers can carry deterministic document sections, source hashes, pages, and character offsets.

Change Library inspection

The dashboard Library can inspect full-text ingestion and Watch drift without turning parser output into evidence.

Change Argument metadata

Argument Graph v2 connects support, attack, and qualifier relationships back to Evidence Graph node IDs.

Change Watch drift

Research Watch can compare snapshots for papers, claims, methods, limitations, entities, and support shifts.

Change Quality becomes broader

Eval now checks full-text ingestion, span locator validity, drift schema, and argument graph linkage.

01 Known paper

A stored paper can now start from full text or a PDF fixture.

02 Locator

The parser turns that source into sections, hashes, page labels, and character offsets.

03 Evidence item

Only extracted evidence records move on to packets, audit, graph validation, and review.

04 Reviewer signal

Argument and drift views help reviewers inspect change without overriding safety policy.

Product object model

The run is still the review object, but the source gets deeper

The main product choice from the previous release still holds: a biomedical answer is not just text. It is an addressable answer run with a review contract: RunEvidenceReview. Release 2.0 keeps that object and gives it a deeper source trail.

A reviewer can still start from a run ID and inspect claim support, audit verdicts, snapshot metadata, validation status, trace links, provenance, and redacted graph export. The difference is simpler: an evidence card can now point closer to the source, not only to a paper and extracted span, but to a full-text section, source hash, page number, and character offsets when that material exists.

Small design choice

Full text is not the new product object. The run is. That keeps the workflow anchored to the thing the reviewer actually needs to judge: one answer, with one set of claims, built from evidence at a specific time.

Full-text boundary

Parser output is not evidence yet

Release 2.0 brings deterministic full-text and PDF ingestion for known papers. The ingestion layer stores document metadata, normalized sections, source hashes, and span locators such as section label, page number, and character offsets. That still counts as source structure, not evidence.

This distinction matters. A PDF parser can tell the system where text came from. It cannot decide that the text supports a biomedical claim. Full-text-derived material becomes evidence only after an extractor creates an EvidenceItem and that item moves through packet selection, citation audit, logic audit, Evidence Graph validation, snapshot persistence, and Run Evidence Review.

The full-text path is deliberately boring in the best way: paper -> document sections -> locators -> extracted evidence -> packet -> audit -> graph -> review. The parser adds precision. It does not get to bypass review.

Document sections

Normalized source text attached to a known paper.

Source hash

Integrity signal for the text a locator points into.

Page and offsets

Reviewers can inspect the local source span, not just the paper ID.

Evidence item gate

Only extracted records can support claims downstream.

Review surface

The Library becomes part of the review workspace

The dashboard is now organized around Chat, Runs, Review Queue, Library, and Settings. Runs remain the main inspector for answer review, snapshot diffs, trace, evidence packets, audit, logic, argument, math, and provenance. Release 2.0 makes the Library more useful because that is where full-text ingestion and Watch drift become inspectable.

That split feels right to me. A reviewer should not have to hunt through raw graph nodes to understand whether a run is trustworthy. The run view answers, "what happened in this answer?" The Library answers, "what source material and ongoing topic state can I inspect around it?"

Runs: claim cards, support status, audit verdicts, graph snapshots, trace, and provenance.
Library: full-text document inspection, source sections, locators, evidence records, and graph lookup.
Watch: topic snapshots, drift comparison, relevance decisions, and advisory change signals.
Settings: research-only boundary, clinical refusal behavior, memory policy, and retrieval limits.

Reviewer signals

Watch drift and Argument Graph v2 are signals, not authority

Release 2.0 adds two review aids that would be dangerous if they were treated as truth. Research Watch graph drift compares topic snapshots and highlights changes in papers, claims, methods, limitations, entities, and support shifts. Argument Graph v2 connects support, attack, and qualifier relationships back to Evidence Graph node IDs.

Watch drift

Useful for seeing what changed between snapshots: new papers, changed claims, support/contradiction shifts, method changes, limitations, and clusters.

Argument Graph v2

Useful for making argument structure visible while preserving links back to Evidence Graph nodes and qualifier edges for partial or overclaimed support.

Both remain advisory reviewer context. They can point at places worth inspecting. They cannot override clinical refusal, source policy, citation audit, logic audit, or the evidence packet contract.

Side-effect boundary

Dashboard Chat gets the same biomedical boundary

Release 2.0 also documents the framework-native Dashboard Chat boundary. Chat runs through the shared dashboard channel, session history, event stream, agent loop, and tool hooks. The biomedical plugin does not get a separate chat backend. Its policy still lives where the tools and routes live.

Read and inspect GET /api/biomed/answer-runs/{run_id}/evidence-review GET /api/biomed/papers/{paper_id}/full-text GET /api/biomed/watch/{watch_id}/drift

Deny until durable approval review decisions write/export tools sensitive biomedical actions

Clinical requests still stop before memory, retrieval, LLM work, parsing, graph construction, or export. Sensitive write, review, and export tools stay denied in chat until the framework has durable approval and resume semantics.

Quality contract

The eval gate now checks the deeper review trail

The old graph and review checks still matter: schema validity, validation rate, traceability, refusal behavior, export redaction, and run evidence review validity. Release 2.0 adds new checks for the new surface area instead of asking reviewers to trust it by inspection alone.

The deterministic mock eval now includes full-text ingestion success, full-text span locator validity, Watch drift schema validity, Argument Graph v2 schema validity, and argument-to-evidence link rate. The README also records a live PubMed plus DeepSeek smoke path with 27/27 checks passing for audited answer, trace, evidence packet, provenance graph, and clinical guardrail behavior.

Why this is better than "we parsed the PDF"

A parsed document can decorate an interface just as easily as a citation can decorate a sentence. Release 2.0 is useful because parser output has to become traceable evidence before it can support a claim.

The product change that matters in 2.0

The useful part of Release 2.0 is not that bio-agent can ingest more text. More text is easy to romanticize and easy to misuse. The useful part is that full text enters through the same discipline as the rest of the system: locators, extraction, packet selection, audit, graph validation, snapshots, provenance, and reviewer-visible surfaces.

That is the step beyond RAG I care about now. A biomedical answer is not generated text with citations attached, and it is not a PDF parser with a nicer UI. It is a reviewable evidence object that can point back to source text, show how claims changed, expose argument structure, and still refuse to turn workflow context into biomedical truth.