Glossary
Dual Inference Modes
An AI system architecture that supports both fast, low-latency responses for routine queries and slower, more compute-intensive deliberative reasoning for complex tasks — allocating resources based on task demands.
Dual inference modes refers to an AI system's ability to operate in two qualitatively different computational regimes: a fast mode optimized for low latency and routine tasks, and a slow mode that dedicates more compute to deliberation, multi-step reasoning, and uncertainty resolution.
The intuition comes from cognitive science: human cognition is often described as operating via fast, automatic processing and slow, deliberate reasoning. These are not simply faster or slower versions of the same process — they involve different computational strategies.
Why dual modes matter for AI
A single inference mode creates an unavoidable trade-off: a system fast enough for interactive applications either cannot do deep deliberation, or burns unnecessary compute on simple requests. Dual modes allow a system to be both responsive and capable of deep reasoning, allocating resources dynamically.
This has practical implications:
- Latency: Fast mode can return results in milliseconds; slow mode may take seconds or minutes for complex multi-step tasks
- Energy efficiency: Most queries are routine — serving them in fast mode dramatically reduces energy consumption compared to always running at maximum compute
- Capability ceiling: Slow mode can use chain-of-thought reasoning, multi-pass verification, and test-time search strategies unavailable in fast mode
- Safety: Slow mode enables pre-response verification steps that check outputs against behavioral specifications before delivery
How Webbeon implements Dual Inference Modes
Odyssey supports dual inference modes through architecture-level differentiation rather than just inference-time hyperparameters:
Fast mode runs a streamlined forward pass optimized for throughput, using cached key-value states and reduced attention spans. It serves conversational queries, classification, and structured data extraction where speed matters more than extended deliberation.
Slow mode activates extended computation: multi-step chain-of-thought, search over reasoning trees, and verification passes that check intermediate conclusions before proceeding. It handles tasks requiring long-horizon planning, mathematical reasoning, and scientific analysis.
The routing between modes is itself learned — Odyssey develops representations of task complexity that predict when slow-mode deliberation will improve outcomes.
Key facts
- Fast mode latency target: under 200 ms time to first token
- Slow mode used for tasks where extended reasoning demonstrably improves accuracy
- Mode selection is adaptive and can be overridden by application-layer routing based on task type
- Energy cost of slow mode is roughly 8-12x fast mode for comparable context length; used only when the accuracy gain justifies the cost