Diagnostic AI That Knows What It Doesn't Know
Uncertainty quantification in clinical AI systems — why calibrated confidence matters more than raw accuracy.
Diagnostic AI That Knows What It Doesn't Know
A diagnostic AI system that is 95% accurate sounds impressive until you consider what happens in the remaining 5%. If the system expresses high confidence on cases it gets wrong, clinicians who rely on it will be misled precisely when they most need accurate guidance. If it expresses uniform uncertainty across all cases, it provides no useful signal about when to trust its output. The problem is not accuracy. The problem is calibration — the alignment between a system's expressed confidence and its actual likelihood of being correct.
At Webbeon, we have built uncertainty quantification into the foundation of our clinical AI systems, not as a post-hoc addition but as a first-class design objective. The result is a system that does something most AI diagnostics cannot: it tells clinicians, with quantified precision, when it does not know.
The Overconfidence Problem
The machine learning community has documented the overconfidence problem extensively. Standard neural networks trained with cross-entropy loss produce output distributions that are systematically miscalibrated — they assign high probability to their predictions even when those predictions are wrong. This is a known property of the training objective and architecture, not a bug in any specific implementation.
In low-stakes applications, miscalibration is an inconvenience. In clinical medicine, it is dangerous. A radiologist reviewing an AI-flagged chest X-ray makes different decisions based on whether the system reports 60% versus 95% confidence in a finding. A primary care physician triaging patients based on AI-assisted risk scores allocates resources differently depending on the system's expressed certainty. If those confidence values are unreliable, the AI is not assisting clinical judgment — it is corrupting it.
The standard response to this problem — temperature scaling and related post-hoc calibration methods — is insufficient for clinical deployment. These methods adjust the overall distribution of confidence values to match aggregate accuracy statistics, but they do not provide reliable per-instance uncertainty estimates. A system can be well-calibrated in aggregate (among all cases where it says 90%, approximately 90% are correct) while being catastrophically miscalibrated for specific subpopulations, rare conditions, or atypical presentations — exactly the cases where clinicians need uncertainty information most.
Our Approach: Deep Ensembles with Evidential Reasoning
Our clinical AI systems use a multi-layered uncertainty quantification architecture. The foundation is deep ensembles — we train multiple models with different initializations and data orderings, and treat disagreement between ensemble members as a signal of epistemic uncertainty. This captures the uncertainty that arises from insufficient training data: when the ensemble members disagree, it indicates that the data does not uniquely determine the correct answer.
On top of the ensemble, we apply evidential deep learning, which parameterizes the model's output not as a point prediction with a softmax distribution, but as a distribution over distributions — a Dirichlet distribution whose parameters encode both the predicted class probabilities and the model's uncertainty about those probabilities. This provides a principled separation between aleatoric uncertainty (irreducible noise in the data, such as genuinely ambiguous imaging findings) and epistemic uncertainty (uncertainty due to the model's limited knowledge, which additional data or expertise could resolve).
The distinction matters clinically. High aleatoric uncertainty on an imaging study might indicate that the finding is genuinely ambiguous and requires additional views or modalities. High epistemic uncertainty might indicate that the case falls outside the model's training distribution — a patient population, imaging protocol, or pathology the model has not seen enough of — and should be routed to specialist review without AI input.
"I Don't Know" as a Clinical Feature
We have designed our system's interface to present uncertainty as actionable clinical information, not as a disclaimer. When the system's epistemic uncertainty exceeds a validated threshold, it generates an explicit "insufficient confidence" output that includes a characterization of why the uncertainty is elevated: distributional shift (the case is unlike the training data), feature ambiguity (the relevant findings are equivocal), or conflicting evidence (different aspects of the case point to different conclusions).
In our clinical validation studies, conducted across four hospital systems with over 120,000 diagnostic encounters, we measured the impact of calibrated uncertainty on clinical decision-making. The key finding: when clinicians had access to calibrated uncertainty estimates, their diagnostic accuracy on difficult cases improved by 12% compared to the same AI system without uncertainty information. More importantly, the rate of high-confidence errors — cases where the clinician was confident but wrong — decreased by 28%.
The cases where the system said "I don't know" were disproportionately the cases that mattered most. They were the rare diagnoses, the atypical presentations, the patients whose conditions did not fit standard patterns. By flagging its own limitations, the system redirected clinical attention to precisely the cases that needed it.
The Path Forward
Uncertainty quantification is not a solved problem. Our current methods add computational cost — running an ensemble is more expensive than running a single model — and our evidential learning framework requires careful hyperparameter tuning to avoid pathological behavior in low-data regimes. We are actively researching more efficient approximations, including single-model uncertainty methods that approach ensemble quality at a fraction of the computational cost.
But the fundamental commitment is non-negotiable. A clinical AI system that cannot quantify its own uncertainty is incomplete in a way that no amount of accuracy improvement can remedy. Medicine operates under uncertainty. Its AI tools must do the same — honestly, quantifiably, and with the humility to say when they do not know.