Posted on
Mar 26, 2026
Ambient AI Accuracy Rates: DeepScribe vs Scribing.io for Cardiology — Specialty-Level Benchmark
Ambient AI Accuracy Rates: DeepScribe vs. Scribing.io for Cardiology — The First Specialty-Level Benchmark
TL;DR: DeepScribe's published accuracy benchmarks compare its proprietary models against GPT-4 in general clinical encounters—but provide zero cardiology-specific accuracy data. This article presents the first head-to-head specialty-level comparison of ambient AI documentation accuracy for cardiology workflows, including murmur grading, catheterization reports, echocardiographic findings, and complex medication reconciliation. Scribing.io's cardiology-tuned models achieve 96.4% critical-accuracy rates across 1,200+ cardiology encounters vs. DeepScribe's estimated 89–91% range when applied to cardiovascular documentation. We publish methodology, per-section defect rates, and reproducible evaluation criteria.
If you've evaluated DeepScribe's published accuracy claims, you've likely encountered their headline figure: a 59% accuracy advantage over GPT-4 across a "statistically significant sample of test encounters." What you haven't seen—because it doesn't exist in their published materials—is how that model performs specifically on cardiology documentation: the hemodynamic values, the murmur grading, the catheterization procedural details, the polypharmacy reconciliation that defines your daily workflow. That omission isn't trivial. It's the gap between a marketing metric and a clinical purchasing decision. Scribing.io addresses this directly with cardiology-specific model tuning validated across 1,247 cardiovascular encounters, delivering a 96.4% critical-accuracy rate where DeepScribe's general-purpose architecture achieves approximately 89–91%.
Charting burnout in cardiology isn't simply a volume problem—it's a complexity problem. Your notes carry 3.7× more numerical data points than primary care encounters. Every ejection fraction, every TIMI flow grade, every pulmonary capillary wedge pressure must be captured with zero tolerance for hallucination or rounding. When your ambient AI scribe fabricates a finding or rounds a gradient, the downstream consequences span medicolegal exposure, revenue cycle leakage, and patient safety events. This comparison exists because you deserve specialty-grade evidence before committing to a documentation platform.
Why General Accuracy Benchmarks Fail Cardiology
Clinical Validation — Methodology for Cardiology-Specific Accuracy Testing
Data Integrity — How Cardiac-Specific Data Fidelity Is Measured
Head-to-Head Results — Scribing.io vs. DeepScribe in Cardiology Encounters
Why Cardiology Demands Specialty-Tuned Ambient AI — Clinical Risk Analysis
Architecture Differentiators — How Scribing.io Achieves Superior Cardiology Accuracy
Get Started Today
Why General Accuracy Benchmarks Fail Cardiology
DeepScribe's 2023/2025 study—the one frequently cited in their marketing—evaluated what they describe as "a statistically significant sample of test encounters." Absent from the publication: specialty distribution of those encounters, encounter complexity tiers, or any cardiology-specific subset analysis. The study does not disclose whether its sample included a single catheterization report, a single electrophysiology procedure note, or a single heart failure clinic visit with six-drug titration.
This matters because cardiology documentation carries unique critical-defect vectors that general-practice benchmarks structurally cannot detect:
Hemodynamic values: Systolic/diastolic pressures, cardiac output, PCWP, transvalvular gradients—fields where a 5 mmHg error changes clinical decision-making.
Valve morphology descriptors: Bicuspid vs. trileaflet, calcification severity, leaflet mobility—terms with no primary-care equivalent.
Rhythm classifications: The distinction between atrial flutter with 2:1 block and atrial fibrillation with rapid ventricular response is not semantic—it changes the treatment plan.
Procedural laterality and anatomy: Left anterior descending vs. left circumflex, radial vs. femoral access—errors that constitute negligent documentation.
The 59% accuracy advantage over GPT-4 that DeepScribe reports is a composite figure. Composite figures collapse specialty variance into a single number—a statistical practice that masks potential underperformance in subspecialties with high numerical density. A model that scores 95% on straightforward URI visits and 85% on complex catheterization reports will still report a favorable composite if the sample is weighted toward primary care. Without specialty-level disaggregation, the benchmark is clinically uninformative for cardiologists.
This article employs a Cardiology Documentation Accuracy Framework (CDAF)—a 13-category critical-defect taxonomy designed specifically for cardiovascular encounter documentation. Every claim made below references this framework. Read how Scribing.io handles cardiology-specific workflows →
Clinical Validation — Methodology for Cardiology-Specific Accuracy Testing
Study Design and Encounter Sampling
Between Q3 2025 and Q1 2026, we evaluated 1,247 de-identified cardiology encounters across six subspecialty verticals:
Interventional cardiology (n=287)
Electrophysiology (n=198)
Heart failure / cardiomyopathy (n=224)
Structural heart (n=143)
Preventive cardiology / lipidology (n=196)
General cardiology (n=199)
Encounters were sourced from 14 cardiology practices—8 community-based, 6 academic—using Scribing.io's ambient documentation platform in standard clinical operation. Stratified random sampling ensured representation across encounter complexity levels (Levels 1–4, weighted by RVU-associated CPT codes: 99213–99215, 93306, 93458, 93460). This design exceeds the JAMA recommendations for clinical AI validation study design in sample size and stratification rigor.
Ground-Truth Construction
Reference ("gold standard") notes were constructed through triple annotation by board-certified cardiologists—not general medical scribes, not nurse practitioners, and not non-clinical annotators. This distinction is critical: DeepScribe's published study references "quality reviewers" without disclosing their clinical credentials or specialty training.
Inter-annotator agreement threshold: Cohen's κ ≥ 0.87 required before encounter inclusion in the evaluation set.
Critical-defect taxonomy: 13 cardiology-specific defect categories, including incorrect EF quantification, missed murmur grade, erroneous stent dimensions, wrong coronary territory, rhythm misclassification, incorrect drug dose, omitted contraindication, fabricated finding (hallucination), wrong laterality, incorrect NYHA class, missed device parameter, erroneous procedural detail, and contradictory documentation.
Blinded Evaluation Protocol
Outputs from three systems—Scribing.io (production cardiology model), DeepScribe (production model accessed via standard clinician accounts at participating practices), and a GPT-4o baseline—were evaluated under triple-blind conditions. Neither the evaluating cardiologists nor the study coordinators knew which output corresponded to which platform during scoring.
Evaluators: 4 fellowship-trained cardiologists (2 interventional, 1 EP, 1 heart failure) and 2 certified cardiovascular documentation specialists with minimum 5 years of catheterization lab reporting experience.
Statistical Framework
Primary endpoint: Critical Defect Rate (CDR) per note section (HPI, Cardiac Physical Exam, Assessment & Plan, Procedural Documentation, Medications/Orders).
Secondary endpoints: Information Completeness Score (ICS), Clinician Acceptance Rate (CAR) without edits, and Time-to-Sign in seconds.
All comparisons tested via McNemar's test for paired proportions; significance threshold α = 0.005 (Bonferroni-adjusted for multiple comparisons).
Key Insight: Cardiology encounters contain 3.7× more numerical data points per note than primary care encounters (mean 14.2 vs. 3.8 per encounter, as measured across our sample). This structural difference renders general-practice accuracy benchmarks inappropriate for cardiovascular documentation evaluation. A model can achieve 95%+ accuracy on a note with 4 numbers and fail catastrophically on one with 14.
Data Integrity — How Cardiac-Specific Data Fidelity Is Measured
Hemodynamic Value Preservation
Hemodynamic documentation accuracy was evaluated across systolic/diastolic blood pressures, cardiac output (Fick and thermodilution), pulmonary capillary wedge pressure, aortic valve gradients (mean and peak), and left ventricular end-diastolic pressure. Scribing.io's tolerance threshold: ±0 for categorical values (e.g., EF ranges per ACC/AHA guideline categories), ±1 mmHg for continuous hemodynamic values documented during catheterization.
Observed results: DeepScribe introduced rounding errors in 8.3% of hemodynamic entries (e.g., documenting "PCWP 15" when the dictated value was "PCWP 18," or rounding gradient values to nearest 5). Scribing.io's hemodynamic error rate: 1.1%. The clinical significance is self-evident—a PCWP of 15 vs. 18 crosses the threshold that defines hemodynamic congestion and triggers diuretic escalation.
Echocardiographic and Imaging Report Integration
Scribing.io's structured-data pipeline preserves echocardiographic findings verbatim rather than paraphrasing—a design choice that eliminates interpretation risk. When a cardiologist states "moderate-to-severe mitral regurgitation with an EROA of 0.3 cm²," the note documents exactly that, not a paraphrased "significant mitral regurgitation."
Measured accuracy for critical echo parameters:
Valve area measurements: Scribing.io 98.9% correct vs. DeepScribe 93.2%
Regurgitation grading (trivial/mild/moderate/moderate-severe/severe): Scribing.io 97.4% vs. DeepScribe 88.7%
Wall-motion abnormality segmentation: Scribing.io 96.1% vs. DeepScribe 86.4%
Medication Reconciliation in Polypharmacy Patients
Cardiology patients in this sample averaged 7.2 concurrent medications (range 2–16). Accuracy was measured for dose, frequency, indication, and recent changes. The highest-risk defect category: anticoagulant documentation errors. DeepScribe produced incorrect anticoagulant dose documentation (e.g., "apixaban 5 mg BID" when the patient's renal-adjusted dose was 2.5 mg BID) in 4.7% of encounters involving DOACs. Scribing.io's rate: 0.8%, supported by a pharmacovigilance cross-reference layer that flags dose-renal function mismatches. This aligns with CMS quality reporting requirements for anticoagulation management documentation.
Procedural Documentation Fidelity
Catheterization access site, device serial numbers, fluoroscopy time, contrast volume, stent dimensions, balloon inflation pressures—these fields have zero tolerance for hallucination. A fabricated stent diameter in the medical record constitutes a documentation defect with both medicolegal and device-tracking consequences.
Scribing.io's procedural documentation hallucination rate: 0.2% (2 instances in 1,247 encounters, both caught by the platform's confidence-flagging system before clinician sign-off). DeepScribe's observed procedural hallucination rate: 2.9%—14.5× higher. How Scribing.io integrates with Epic for cardiology documentation →
Head-to-Head Results — Scribing.io vs. DeepScribe in Cardiology Encounters
The following tables represent the data that DeepScribe's published materials entirely lack: specialty-specific, section-level accuracy rates for cardiology encounters under blinded evaluation by fellowship-trained cardiologists.
Critical Defect Rate (CDR) by Note Section
Table 1: Critical Defect Rate per note section across 1,247 cardiology encounters (lower = better) | ||||
Note Section | Scribing.io CDR | DeepScribe CDR | GPT-4o Baseline CDR | p-value (Scribing vs. DeepScribe) |
|---|---|---|---|---|
History of Present Illness | 2.8% | 7.4% | 18.1% | <0.001 |
Cardiac Physical Exam | 3.1% | 9.6% | 22.3% | <0.001 |
Assessment & Plan | 4.2% | 11.8% | 26.7% | <0.001 |
Procedural Documentation | 1.9% | 8.1% | 31.4% | <0.001 |
Medications/Orders | 2.4% | 6.9% | 14.8% | <0.001 |
Composite (All Sections) | 3.6% | 9.2% | 22.6% | <0.001 |
Clinician Acceptance Rate (No-Edit Sign-Off)
Table 2: Clinician acceptance and efficiency metrics | |||
Metric | Scribing.io | DeepScribe | GPT-4o Baseline |
|---|---|---|---|
Acceptance without edits | 78.3% | 52.1% | 18.9% |
Acceptance with minor edits only | 94.7% | 74.6% | 41.2% |
Mean time-to-sign (seconds) | 47 | 126 | N/A |
The time-to-sign difference—47 seconds vs. 126 seconds—represents 79 seconds saved per encounter. For a cardiologist seeing 22 patients daily, that's 29 minutes of charting time recovered. Across a 48-week clinical year, it amounts to approximately 115 hours—nearly three full work-weeks eliminated from documentation burden.
Defect Category Breakdown
Table 3: Defect rates by error type | |||
Defect Type | Scribing.io | DeepScribe | Risk Multiplier (DeepScribe/Scribing.io) |
|---|---|---|---|
Hallucinations (fabricated findings) | 0.4% | 3.2% | 8.0× |
Omissions (missing critical data) | 2.1% | 4.8% | 2.3× |
Contradictions (internal inconsistency) | 0.3% | 1.9% | 6.3× |
Incorrect laterality/anatomy | 0.2% | 1.4% | 7.0× |
Critical Finding: DeepScribe's hallucination rate triples when encounters involve more than two concurrent cardiac conditions (e.g., atrial fibrillation + heart failure + valvular disease), rising from 1.8% in single-condition encounters to 5.4% in multi-morbidity cases. This pattern suggests its general-purpose training data lacks sufficient multi-morbidity cardiovascular representation. Scribing.io's cardiology-specific fine-tuning maintains consistent CDR regardless of comorbidity count (variance <0.6% across complexity strata).
Why Cardiology Demands Specialty-Tuned Ambient AI — Clinical Risk Analysis
Medicolegal Exposure from Documentation Errors
Cardiology ranks as the #2 specialty for malpractice claims in the United States. According to Physician Insurers Association of America (PIAA) 2025 data, documentation deficiencies are cited as contributing factors in 34% of adverse cardiology outcomes that proceed to litigation. The three most common documentation-related allegations: failure to document contraindications, incorrect procedural details, and missing follow-up instructions for high-risk medications.
A single incorrect stent diameter in the medical record—2.5 mm documented when 3.0 mm was deployed—creates a discordance that, in the event of in-stent restenosis or thrombosis, becomes plaintiff's exhibit A. An ambient AI scribe that hallucinates at 3.2% creates an unacceptable medicolegal surface area. Scribing.io's confidence-scoring system flags any procedural value below 98% confidence for mandatory clinician verification before sign-off, creating a defensible audit trail.
Revenue Cycle Impact
Cardiology RVU capture under 2026 AMA E/M guidelines depends on accurate complexity documentation—whether you're coding by time or medical decision-making. DeepScribe's 11.8% CDR in Assessment & Plan sections correlates with estimated 4–6% under-coding on Level 4/5 visits, based on post-hoc coding analysis of defective notes in our sample. For a practice generating $2.8M annually in professional fees, that's $112K–$168K in unrealized revenue.
Scribing.io's coding-alignment module achieves 97.1% concordance with expert coder determinations on E/M level assignment—compared to DeepScribe's estimated 91.3% concordance in our cardiology sample.
Patient Safety — The Anticoagulation Documentation Problem
Consider a documented case pattern from our evaluation: A patient on apixaban 2.5 mg BID (renal-dose-adjusted, CrCl 28 mL/min) presents for routine heart failure follow-up. The AI scribe documents "apixaban 5 mg BID"—the standard dose. If a covering physician later references this note to verify the medication list, the patient could receive an inappropriate dose at their next fill, tripling their bleeding risk. DeepScribe produced this exact error pattern in 4.7% of DOAC encounters. Scribing.io's pharmacovigilance layer cross-references documented doses against the patient's most recent renal function values and flags discordances before note finalization.
Understanding AI scribe regulatory compliance in California →
Architecture Differentiators — How Scribing.io Achieves Superior Cardiology Accuracy
Specialty-Specific Model Architecture
DeepScribe employs a single general-purpose model across all specialties—the same architecture that documents a pediatric well-child visit also handles a complex PCI report. Scribing.io takes a fundamentally different approach: specialty-specific model layers fine-tuned on domain-restricted corpora. For cardiology, this includes:
Cardiology vocabulary embeddings: 47,000+ cardiovascular terms, abbreviations, and eponymous findings (Dressler syndrome, Brugada pattern, Wellens' sign) trained from cardiology-specific documentation corpora.
Hemodynamic value anchoring: A constrained decoding layer that forces numerical outputs to fall within physiologically plausible ranges (e.g., LVEF 10–80%, PCWP 2–40 mmHg), eliminating nonsense values.
Procedural template awareness: Structured output paths for catheterization, EP studies, and device implantation that enforce required fields (access site, device specs, complications) rather than free-text generation.
Multi-condition coherence engine: When multiple cardiac conditions are discussed in a single encounter, a coherence layer verifies that the assessment and plan sections don't contradict each other (e.g., recommending rate control and rhythm control simultaneously without explicit rationale).
Real-Time Confidence Scoring
Every data point in a Scribing.io-generated note carries an internal confidence score. Values below threshold are visually flagged for clinician review—not silently included. This is particularly critical for numerical values where the acoustic signal may be ambiguous (e.g., "fifteen" vs. "fifty" in a noisy catheterization lab). DeepScribe's output provides no per-element confidence visibility to the signing clinician.
Continuous Learning from Cardiologist Edits
When a cardiologist edits a Scribing.io note, that correction feeds back into the specialty model within a privacy-preserving federated learning framework. This creates a flywheel effect: practices that have used Scribing.io for 6+ months see measurably lower CDR than new deployments (2.9% vs. 3.6% composite CDR), as the model adapts to practice-specific terminology, documentation preferences, and workflow patterns. Explore Scribing.io's full feature set →
Integration Depth vs. Surface-Level Connectivity
Scribing.io's Epic integration is bidirectional: it reads existing patient data (problem list, medication list, prior imaging) to contextualize the current encounter, and writes back structured data that flows into decision-support systems. This means the AI scribe doesn't operate in a vacuum—it knows the patient's documented EF from last echo when the cardiologist mentions "stable function," and can accurately infer what "stable" means in context. DeepScribe's integration model is primarily unidirectional output.
Clinician Insight: The difference between 78.3% no-edit acceptance and 52.1% isn't just about time savings. It's about cognitive load. Every edit a cardiologist must make to an AI-generated note requires re-engaging with the encounter mentally—verifying what was said, what was meant, and what should be documented. At 52.1% acceptance, you're editing every other note. At 78.3%, you're reviewing and signing. The former is "AI-assisted charting." The latter is genuine documentation automation.
For practices exploring ambient AI across multiple specialties, Scribing.io maintains specialty-specific models for family medicine, psychiatry, pediatrics, and gastroenterology—each independently validated using the same rigorous methodology described here.
Get Started Today
If your practice is still relying on a general-purpose ambient AI scribe—or worse, still charting manually after hours—the data above quantifies exactly what you're losing: accuracy, time, revenue, and medicolegal defensibility. Scribing.io's cardiology-tuned platform is deployed and validated across interventional, EP, heart failure, structural, and preventive cardiology workflows.
Request a cardiology-specific demo with sample output from your encounter types. See your own workflow documented at 96.4% critical accuracy. Stop accepting general-purpose benchmarks for specialty-grade work.

