Designing Inference Hardware from First Principles

The modern AI stack runs on hardware designed for a different era. GPUs were built to rasterize triangles and shade pixels — massively parallel floating-point engines optimized for the graphics pipeline. They turned out to be remarkably useful for training neural networks, and that historical accident shaped the entire industry. But inference is not training. The computational patterns, precision requirements, memory access profiles, and latency constraints of serving a frontier model to millions of users diverge sharply from those of training one. At Webbeon, we decided to stop adapting our workloads to available hardware and start building hardware adapted to our workloads. This article describes why, and the architectural principles guiding our custom silicon program.

Why General-Purpose GPUs Are Suboptimal for Inference

The mismatch between GPUs and inference shows up in three places. First, precision: training requires high numerical precision (FP32 or BF16) to maintain gradient stability across billions of parameter updates. Inference does not. We have demonstrated that our models run at full quality in INT8 for attention computation and INT4 for feedforward layers, with selective FP16 for numerically sensitive operations. A GPU's die area is dominated by FP32 and FP16 arithmetic units that inference does not need — silicon we are paying for in power and cost but not using. Second, memory bandwidth: inference on large autoregressive models is memory-bandwidth-bound, not compute-bound. Each token generation requires reading the model weights from memory, and at 400 billion parameters in INT8, that is 400 GB of data moved per forward pass. The arithmetic intensity — the ratio of computation to memory access — is low, often below 1 FLOP per byte. GPUs are designed for high arithmetic intensity (hundreds of FLOPs per byte), so their compute units sit idle waiting for data during inference. Third, latency: training tolerates latency because it is throughput-oriented. Inference is latency-sensitive — a user waiting for a response perceives every millisecond. GPU architectures optimize for throughput via deep pipelines and large batch sizes, which work against low-latency single-request serving.

Architectural Choices

Our inference accelerator, internally designated W1, is organized around three principles: dataflow execution, memory proximity, and precision flexibility. The chip uses a spatial dataflow architecture rather than the SIMT (Single Instruction, Multiple Threads) model of GPUs. Computation is mapped spatially across an array of processing elements (PEs), with data flowing directly between them via on-chip interconnects rather than being staged through a shared register file. This eliminates the instruction fetch and decode overhead of SIMT and reduces data movement energy by keeping intermediate activations local to the PEs that produce and consume them. The memory system is designed for the bandwidth-bound regime of inference. W1 integrates 96 GB of HBM3E providing 4.8 TB/s of bandwidth, paired with 256 MB of on-chip SRAM distributed across the PE array. The SRAM serves as an activation cache: during autoregressive decoding, the KV-cache for active sequences is held on-chip, eliminating repeated HBM reads for the attention mechanism's key and value tensors. For a 400B-parameter model serving sequences of 8,192 tokens, this reduces HBM traffic during decoding by 73%. Precision is configurable per-layer: the PE array natively supports INT4, INT8, FP8 (E4M3 and E5M2), FP16, and BF16 arithmetic, with per-tensor scaling handled in hardware. The compiler analyzes each layer's sensitivity and assigns the minimum precision that maintains output quality, maximizing throughput without manual quantization tuning.

Target Performance and What It Enables

W1's target performance on our internal benchmark — serving Oracle Class R at INT8 with a batch size of one — is a time-to-first-token of 85 milliseconds and a decode throughput of 210 tokens per second. For comparison, the same model on a current-generation data center GPU achieves approximately 320 ms to first token and 74 tokens per second under the same conditions. The improvement is not marginal; it changes what is possible. At 210 tokens per second, real-time voice conversation with a frontier model becomes viable without speculative decoding hacks. Multi-turn agentic workflows where the model reasons over tool outputs in a tight loop become fast enough for interactive use. And the power envelope — W1 targets 350 watts TDP versus 700 watts for a comparable GPU — means that the same data center power budget serves twice the inference capacity. We are not building hardware to win benchmark competitions. We are building it because the capabilities we want to ship to users — real-time embodied intelligence for Object Class, sub-100ms reasoning for Oracle Class R, continuous decision intelligence for Oracle Class D — require performance and efficiency characteristics that commodity hardware cannot provide at any price.

The Road to Production

Custom silicon is a multi-year commitment with significant risk. W1 is currently in the RTL verification phase, with tapeout planned for late 2026 on a 3nm process. We have validated the architecture extensively through cycle-accurate simulation and FPGA prototyping. The software stack — compiler, runtime, and profiling tools — is under active development and already supports ahead-of-time compilation of our full model suite. Our approach to risk management is straightforward: every capability we plan to ship on W1 must also run, at lower performance, on commodity GPUs. Custom silicon accelerates our roadmap; it does not gate it. But we believe that as AI models grow and inference demand scales, the organizations that control their inference hardware will have a structural advantage — in cost, latency, and energy efficiency — that compounds over time. Designing from first principles is how we intend to build that advantage.

Related Research

2026-02-12

Energy-Efficient AI: The Architecture Decisions That Cut Power by 40%

2026-01-25

The Memory Wall Problem and How Custom Silicon Solves It

Designing Inference Hardware from First Principles

Why General-Purpose GPUs Are Suboptimal for Inference

Architectural Choices

Target Performance and What It Enables

The Road to Production