Near-Memory Computing — Webbeon AI Glossary

Near-memory computing is a hardware paradigm where compute logic is placed physically adjacent to or within memory arrays. The goal is to reduce — or eliminate — the cost of moving data between memory and compute, which is the dominant energy expenditure and performance bottleneck in memory-intensive workloads like neural network inference.

The spectrum of near-memory approaches ranges from processors placed near conventional DRAM packages, to compute logic integrated within HBM stacks, to Processing-in-Memory (PIM) where simple arithmetic operations execute inside the DRAM array itself.

The energy argument

Moving a 32-bit value from off-chip DRAM to a processor register consumes approximately 200 picojoules. Performing a multiply-accumulate on that value consumes approximately 1-2 picojoules. The ratio is striking: in a memory-bound workload, data movement consumes 100x more energy than computation.

Near-memory computing changes this ratio by shortening the path data must travel. At the extreme of processing-in-memory, data moves across micrometers within the memory array rather than millimeters or centimeters between packages. Even a 10x reduction in data movement distance translates to substantial energy savings at scale.

Near-memory in the AI context

Large model inference has specific near-memory requirements:

Weight loading: model weights must be loaded from memory for every forward pass — moving them closer to compute is the primary optimization target
Activation streaming: attention key-value caches grow with context length and must be accessed repeatedly
Precision variability: near-memory units can apply quantization or dequantization at the memory boundary, reducing the bandwidth needed between memory and compute

How Webbeon approaches Near-Memory Computing

Oracle Class W1 incorporates near-memory design principles throughout:

256 MB distributed SRAM within the compute tile array — weights in SRAM require no off-chip access
HBM3E controller logic colocated with memory stack interfaces to minimize transfer overhead
Pipeline stages designed to consume data as it arrives from memory, rather than buffering and then processing

Future Oracle Class generations target tighter near-memory integration, including compute logic within HBM stacks for specific operations.

Key facts

SRAM access costs approximately 5 pJ/bit; DRAM access costs approximately 25 pJ/bit; off-chip DRAM access costs approximately 200+ pJ/bit
Near-memory computing is complementary to bandwidth improvements like HBM3E — both reduce the effective cost of memory access
Processing-in-memory for AI inference is an active research area; commercial deployment at scale remains limited
Webbeon's 40% energy reduction vs. commodity hardware reflects in part the near-memory design principles throughout W1