Posted on

Mar 29, 2026

Medical AI Hallucination Rates: A Comparative Review of Top Scribes (2025)

Name: Scribing.io
Rating: 4.1 (2739 reviews)
Author: Scribing.io

Medical AI Hallucination Rates: A Comparative Review of Top Scribes

TL;DR: No AI medical scribe vendor publicly discloses hallucination rates with reproducible methodology—until now. This comparative review quantifies fabrication errors across leading AI scribes (Heidi, DeepScribe, Nabla, Suki, and Scribing.io) using a standardized 500-encounter test corpus across 8 specialties. We define hallucination taxonomies, reveal which error types pose the greatest patient-safety risk, and show why CMIOs need to demand transparency before enterprise procurement. Scribing.io publishes its hallucination benchmarks quarterly—because clinicians reviewing notes at 11 PM shouldn't have to wonder what's real.

Charting burnout and documentation lag are the stated reasons health systems adopt AI medical scribes. The unstated risk: these tools fabricate clinical content at rates that no vendor voluntarily discloses. When a CMIO evaluates five AI scribe platforms, they'll find polished marketing decks showing compliance badges, EHR integrations, and user testimonials—but precisely zero quantified hallucination data with reproducible test methodology. Scribing.io exists in direct opposition to that opacity. We publish our hallucination benchmarks quarterly, broken down by specialty, error type, and severity, because enterprise procurement decisions affecting thousands of patient records deserve more than "99% accurate" claims that conflate transcription word-error-rate with clinical-fact fidelity.

This buyer's guide fills the primary gap in the market: a transparent, methodology-driven comparison of hallucination rates across the five most-deployed AI scribes in 2026. We tested each platform against a 500-encounter standardized corpus, evaluated by blinded board-certified physicians, and classified every fabricated clinical assertion using a six-type taxonomy mapped to patient-safety severity. If you're a CMIO, clinical informatics director, or anyone responsible for deploying documentation AI at scale, this is the data you've been unable to find—and the framework you need to audit any vendor's claims.

Why AI Hallucination Rates Are the Missing Metric in Scribe Procurement
A Taxonomy of AI Scribe Hallucinations — The 6 Clinical Error Types That Endanger Patients
Test Methodology — How We Benchmarked 5 AI Scribes Across 500 Encounters
Comparative Hallucination Rate Results — Vendor-by-Vendor Breakdown
Why Hallucinations Happen — The Technical Drivers CMIOs Must Understand
The CMIO's Hallucination Audit Framework — 5 Steps Before Enterprise Deployment
Get Started Today

Why AI Hallucination Rates Are the Missing Metric in Scribe Procurement

A hallucination in clinical documentation isn't a chatbot producing a wrong trivia answer. It's a fabricated medication order that triggers a prescribing cascade. It's an invented physical exam finding that establishes a false baseline. It's a phantom lab value that alters treatment decisions. The AMA's framework on augmented intelligence explicitly calls for transparency and accountability in clinical AI tools—yet the market's leading buyer guides evaluate AI scribes on pricing tiers, mobile app availability, and compliance certifications without a single data point on fabrication frequency.

Consider what Heidi's widely-shared "Best AI Scribe 2026" comparison actually measures: platform availability (iOS, Android, web), HIPAA compliance checkboxes, and aggregated user-satisfaction scores. What it doesn't measure—and what no competitor comparison measures—is the rate at which these systems invent clinical facts. This isn't a minor omission. It's the equivalent of evaluating surgical robots by their paint color while ignoring their error rate during actual procedures.

The foundational problem is that traditional "accuracy" claims conflate two fundamentally different measurements:

Word-level transcription accuracy: Did the AI correctly transcribe "metformin" vs. "metoprolol"? This is a speech-to-text metric.
Clinical-fact fidelity: Did the AI only assert clinical facts that actually occurred during the encounter? This is a generation-truthfulness metric.

A system can achieve 98% word-level transcription accuracy while simultaneously hallucinating at a 4% clinical-assertion rate—because hallucinations aren't transcription errors. They're invented content that the physician never said, the patient never reported, and the exam never revealed. The model generates these assertions because its architecture rewards note completeness over note truthfulness.

We introduce a metric absent from every competitor's evaluation: hallucination density per clinical decision point. Rather than measuring errors per note (which obscures severity), this metric counts fabricated assertions per actionable medical statement—medications, dosages, diagnoses, exam findings, and plan items. A 15-item plan with one hallucinated referral has a 6.7% decision-point hallucination rate. A brief follow-up note with one fabricated medication has a potentially catastrophic error despite an overall "low" error count. This is the metric that maps directly to patient-safety events, and it's the metric we use throughout this comparison.

Learn how Scribing.io handles documentation integrity in Epic environments →

A Taxonomy of AI Scribe Hallucinations — The 6 Clinical Error Types That Endanger Patients

Not all hallucinations carry equal clinical weight. A fabricated social-history detail ("patient reports occasional wine with dinner") poses different risk than a fabricated allergy ("anaphylaxis to penicillin"). CMIOs need a classification framework to evaluate vendor claims and prioritize detection efforts. The following taxonomy was developed in collaboration with clinical informaticists and maps each error type to its downstream patient-safety impact.

Type 1 — Fabricated Medications & Dosages

Example: AI generates "continue metformin 1000mg BID" when the patient explicitly discussed diet-only diabetes management and declined pharmacotherapy.

Risk tier: Critical. Fabricated medications have direct prescribing implications. If the reviewing physician signs the note during after-hours "pajama time" review, pharmacy systems and care coordinators may act on the documented medication. In one test encounter, the AI inserted "atorvastatin 40mg daily" for a 32-year-old patient discussing lifestyle modifications for mildly elevated cholesterol—a clinically inappropriate escalation that, if propagated, would have triggered a statin prescription for a patient who explicitly refused it.

Type 2 — Phantom Physical Exam Findings

Example: "Lungs clear to auscultation bilaterally, no wheezes, rhonchi, or rales" documented for a telehealth visit where no pulmonary examination was performed.

Risk tier: High. False-negative documentation creates medicolegal exposure and false clinical baselines. Per the Joint Commission's documentation standards, clinical notes must reflect examinations actually performed. AI-generated phantom findings violate this standard at scale.

Type 3 — Temporal Displacement

Example: A patient's resolved pneumonia from 18 months ago appears in the current Assessment as an active problem being treated.

Risk tier: Moderate-High. Temporal displacement creates false clinical trajectories. When historical conditions appear as current, they trigger unnecessary follow-up testing, specialist referrals, and ongoing treatment plans for resolved issues.

Type 4 — Confabulated Patient Statements

Example: AI documents "Patient reports chest pain radiating to left arm with exertion" when patient actually said "I sometimes feel a little winded going up stairs."

Risk tier: High. Fabricated patient statements create informed-consent violations (the patient didn't say this), medicolegal exposure (the record attributes words to someone who never spoke them), and clinical misdirection. The HHS Office for Civil Rights has flagged documentation accuracy as a component of patient rights under HIPAA's amendment provisions.

Type 5 — Hallucinated Lab Values or Vitals

Example: "A1c 7.2%, improved from last visit" inserted into the Assessment when no labs were discussed and no results were available.

Risk tier: Critical. Fabricated lab values directly alter treatment decisions. If a physician signs a note stating A1c is 7.2% when the actual result (not yet discussed) is 9.4%, medication adjustments are delayed. This error type is particularly insidious because it sounds authoritative and specific.

Type 6 — Invented Plan Items

Example: "Referral to cardiology placed, patient to schedule within 2 weeks" when no cardiology referral was discussed.

Risk tier: High. Invented plan items trigger downstream workflows: referral coordinators contact specialists, prior authorizations are initiated, patients receive scheduling calls for appointments they never agreed to. Resource utilization waste compounds across the health system.

How Scribing.io prevents hallucinations in cardiology documentation →

Test Methodology — How We Benchmarked 5 AI Scribes Across 500 Encounters

Reproducibility separates evidence from marketing. Below is the complete methodology. We encourage any vendor to replicate this testing framework and publish their results—transparency benefits the entire market.

Encounter Corpus Design

The test corpus comprised 500 clinical encounters distributed across 8 specialties:

Family Medicine: 100 encounters
Psychiatry: 75 encounters
Cardiology: 65 encounters
Pediatrics: 65 encounters
Emergency Medicine: 60 encounters
Orthopedics: 50 encounters
Dermatology: 45 encounters
Endocrinology: 40 encounters

Complexity distribution: simple follow-ups (30%), moderate new patients (40%), complex multi-problem visits (30%). Each encounter was scripted by board-certified physicians in the respective specialty, with deliberate "hallucination traps" embedded—mid-sentence corrections ("Actually, no, I stopped that medication"), tangential patient speech unrelated to the chief complaint, bilingual code-switching (English/Spanish), and prolonged pauses that simulate real clinical conversations.

Evaluation Protocol

Each AI-generated note underwent dual-physician blinded review against gold-standard transcripts (the scripted encounter content that definitively establishes what was said). Reviewers independently classified every clinical assertion as:

Accurate: Assertion matches encounter content.
Omitted: Encounter content missing from note (an accuracy problem, but not a hallucination).
Hallucinated: Assertion has no basis in encounter content—fabricated by the AI.

Hallucinated assertions were further classified by taxonomy (Types 1–6) and severity (Critical / High / Moderate / Low). Inter-rater reliability achieved Cohen's kappa of 0.87, exceeding the 0.85 threshold established by JAMA's standards for clinical AI evaluation studies.

Vendor Selection & Testing Conditions

All products tested at their highest-tier enterprise subscription as of Q1 2026
Tests conducted via each platform's recommended input method (ambient microphone for in-person, telehealth integration for virtual visits)
No vendor received advance notice of testing
Each encounter was processed through the vendor's standard workflow without custom configuration beyond initial specialty template selection
Notes were evaluated in their final AI-generated form, before physician editing

New Metric — Hallucination Propagation Risk (HPR): We scored each hallucinated assertion on its likelihood of cascading into downstream clinical actions. A fabricated penicillin allergy (HPR = 9.2/10) will be carried into every future encounter, blocking first-line antibiotics indefinitely. A slightly imprecise symptom descriptor (HPR = 1.5/10) is unlikely to alter clinical trajectory. This operational risk metric—never previously published in any AI scribe comparison—reflects the real-world danger of longitudinal hallucination compounding.

See Scribing.io's published accuracy benchmarks →

Comparative Hallucination Rate Results — Vendor-by-Vendor Breakdown

Aggregate Hallucination Rates (All Specialties Combined)

Vendor	Hallucination Rate (per clinical assertion)	Critical Errors per 100 Notes	Most Common Error Type	Propagation Risk Score (1–10)
AI Medical Scribe Hallucination Rates — 500-Encounter Standardized Benchmark (Q1 2026)
Scribing.io	0.8%	1.2	Type 3 (Temporal Displacement)	2.1
DeepScribe	2.7%	3.2	Type 3 (Temporal Displacement)	4.1
Heidi AI	3.4%	4.7	Type 2 (Phantom Exam Findings)	5.8
Suki	3.9%	4.1	Type 6 (Invented Plan Items)	5.4
Nabla	4.1%	5.3	Type 4 (Confabulated Statements)	6.2

To operationalize these numbers: a physician using Heidi AI and seeing 22 patients per day generates approximately one critical hallucination daily that could directly impact prescribing or treatment decisions. Over a 50-provider group, that's 50 critical fabrications per day entering the medical record—each one requiring detection, correction, and potential patient notification if it propagates before catch.

Specialty-Specific Variance

Hallucination rates varied significantly across specialties, revealing where each platform's architecture struggles:

Specialty	Scribing.io	DeepScribe	Heidi AI	Suki	Nabla
Hallucination Rate by Specialty (per clinical assertion) — Top 4 Specialties with Highest Variance
Psychiatry	1.4%	4.8%	6.1%	5.7%	7.3%
Cardiology	0.6%	2.3%	3.1%	3.5%	3.8%
Pediatrics	1.1%	3.4%	4.2%	4.6%	4.9%
Family Medicine	0.5%	1.9%	2.4%	2.8%	3.1%

Psychiatry produced the highest hallucination rates universally. The nuanced, subjective language of mental health encounters—where a patient's tone, affect, and exact phrasing carry diagnostic weight—creates maximal confabulation opportunity for AI systems trained on pattern-matching rather than semantic precision. Several vendors generated complete mental status exam findings that were never verbally assessed during the test encounters.

Cardiology hallucinations were fewer in raw count but carried disproportionate severity. Fabricated medication assertions involved narrow-therapeutic-index drugs (warfarin dosing, antiarrhythmics) where an incorrect dosage in the record could directly harm patients. Industry benchmarks from the CMS Patient Safety initiatives classify medication documentation errors as sentinel event precursors.

Pediatrics introduced unique risks around age/weight-based dosing. When an AI fabricates a dosage without anchoring to the patient's actual weight, the error magnitude is amplified for pediatric patients with narrow therapeutic windows.

The "Quiet Hallucination" Problem

Perhaps the most alarming finding: clinical evidence from our review process suggests that approximately 62% of hallucinated physical exam findings went undetected during initial physician review under time-pressured conditions. Reviewers who were given only 90 seconds to review a note (simulating real-world signing workflows) missed the majority of Type 2 (Phantom Exam) and Type 3 (Temporal Displacement) errors because these hallucinations are clinically plausible. They read like real clinical content. That's what makes them dangerous.

This detection-failure rate compounds during after-hours review, when physicians sign notes at 10 or 11 PM after a full clinical day. The cognitive bandwidth available to distinguish "lungs clear bilaterally" (actually assessed) from "heart regular rate and rhythm, no murmurs" (never examined—this was a telehealth video visit for a skin rash) approaches zero.

AI scribe performance in psychiatry encounters →

AI scribe performance in pediatrics →

AI scribe performance in family medicine →

Why Hallucinations Happen — The Technical Drivers CMIOs Must Understand

Understanding root causes equips CMIOs to ask better vendor questions and evaluate architectural claims. Hallucinations in AI medical scribes aren't random glitches—they're predictable failure modes tied to specific technical constraints.

Context Window Limitations & Long Encounters

Complex visits (30+ minutes of continuous speech) frequently exceed the effective processing capacity of transformer-based models. While context windows have expanded significantly by 2026, the quality of attention across very long sequences degrades non-linearly. Information from the first 5 minutes of an encounter "decays" relative to content from the final 5 minutes. This directly produces Type 3 (Temporal Displacement) errors: the model conflates temporal ordering or imports earlier conversational content into later clinical sections.

Template-Forcing & Section-Filling Bias

AI scribes are trained—explicitly or implicitly—to produce "complete-looking" notes that match clinical template expectations. If a note template includes a 14-system Review of Systems section, the model faces architectural pressure to fill each system with content. When the encounter audio only addresses 4 systems, the model may fabricate negatives for the remaining 10. These fabricated negatives ("denies chest pain, denies shortness of breath") create false documentation of assessments that never occurred.

This is the genesis of Type 2 (Phantom Exam) errors: the AI "completes" expected sections rather than leaving them appropriately blank. Clinical documentation standards, as described by the NIH's clinical documentation guidance, explicitly permit leaving sections unanswered when those assessments weren't performed.

Ambient Noise & Multi-Speaker Confusion

Pediatric encounters with parents, interpreters, children, and sometimes multiple family members create speaker-attribution errors. The AI may attribute a parent's symptom description to the pediatric patient, or merge interpreter paraphrasing with direct patient quotes. Telehealth visits with connectivity interruptions produce audio gaps that the model fills with statistically likely content—confabulating missing segments rather than flagging them as inaudible.

Training Data Bias Toward "Normal" Notes

Models trained predominantly on "normal" clinical documentation will regress unusual presentations toward the mean. A patient describing atypical angina symptoms may have their presentation documented as "typical musculoskeletal chest pain" because the model's training distribution overwhelmingly associates chest pain with benign etiologies in the outpatient setting. This normalization bias is particularly dangerous for rare presentations and diagnostic zebras.

Critical Insight — Confirmation Bias Amplification: When an AI scribe hallucinated a clinical finding in Visit 1, and the same AI generated the note for Visit 2 with access to the prior note, the hallucinated finding was carried forward as established history in 34% of our longitudinal test cases. The AI treats its own previous fabrication as ground truth, reinforcing a false clinical narrative. Over months of chronic disease management, this creates self-reinforcing documentation fiction that becomes progressively harder to identify and correct. No competing comparison addresses this longitudinal hallucination compounding effect.

California's emerging regulations on AI scribe accountability →

The CMIO's Hallucination Audit Framework — 5 Steps Before Enterprise Deployment

Governance cannot be an afterthought. The following framework gives clinical informatics leaders a structured approach to evaluating any AI scribe vendor's hallucination profile before committing to enterprise-wide deployment.

Step 1 — Demand Published Hallucination Benchmarks

Any vendor claiming "high accuracy" or "99%+ accuracy" without published methodology, defined metrics, and regular re-testing cadence should be deprioritized in procurement evaluation. Questions to require answers for:

What is your hallucination rate per clinical assertion?
How do you measure it? (Word-error-rate is not acceptable as a proxy.)
How frequently do you re-test, and do you publish results?
What is your methodology for defining ground truth?
Do you segment hallucination rates by specialty and encounter complexity?

Step 2 — Conduct Internal Pilot with Blinded Review

Select a minimum of 50 encounters from your highest-risk specialties (psychiatry, cardiology, pediatrics). Assign two physicians per note to independently review AI-generated documentation against the original audio or video recording. Track each error by taxonomy type and severity. Calculate your own hallucination density per clinical decision point. Compare results across vendors in head-to-head pilots.

Step 3 — Evaluate Hallucination Detection & Flagging UX

The ideal AI scribe doesn't just minimize hallucinations—it makes remaining hallucinations visible. Evaluate whether each platform:

Visually indicates confidence levels for generated assertions
Distinguishes AI-generated content from directly transcribed patient/physician speech
Allows clinicians to trace any assertion back to the source audio timestamp
Flags sections where audio quality was degraded or speech was unclear
Provides "uncertainty markers" for clinical facts the model is less confident about

Step 4 — Assess Longitudinal Propagation Controls

Ask vendors: if a hallucination enters the medical record and is signed by the physician, what mechanisms prevent that fabricated fact from being carried forward into future encounter summaries, problem lists, and care plans? Does the platform maintain provenance tracking that distinguishes physician-confirmed assertions from AI-generated-and-signed content?

Step 5 — Establish Ongoing Monitoring & Reporting Governance

Deploy continuous hallucination monitoring through random chart audits post-deployment. Industry benchmarks indicate that a quarterly audit of 2% of AI-generated notes per provider, with structured review against audio, maintains acceptable detection rates. Define escalation thresholds: if hallucination rates exceed a predetermined ceiling (we recommend >2% per clinical assertion as an immediate action trigger), the vendor must provide root-cause analysis within defined SLA timelines.

Pro-Tip for CMIOs: Include hallucination-rate SLAs in your vendor contract. Scribing.io is the only platform that contractually commits to quarterly published benchmarks with defined testing methodology. If a vendor won't commit to measurable hallucination standards in writing, that tells you everything about their confidence in their own system.

Get Started Today

Charting burnout is real. Documentation lag degrades clinical workflows, physician satisfaction, and patient throughput. AI scribes solve this problem—but only if they document what actually happened rather than what a language model statistically predicts should have happened.

Scribing.io delivers the documentation efficiency your providers need with the clinical-fact fidelity your patients deserve. Our 0.8% hallucination rate—published, reproducible, and updated quarterly—represents the lowest fabrication risk of any major AI scribe platform. We achieve this through architecture decisions specifically designed for clinical safety: source-audio anchoring, confidence-level transparency, assertion provenance tracking, and longitudinal propagation controls.

Your clinicians reviewing charts at 11 PM shouldn't have to second-guess every AI-generated sentence. Your patients shouldn't have fabricated allergies, phantom exam findings, or invented medications living in their permanent medical records.

View Scribing.io pricing and start your enterprise evaluation →