Glossary
Near-Memory Computing
A hardware paradigm that places compute logic physically close to or within memory arrays — reducing the energy and latency cost of moving data between memory and processors.
Near-memory computing is a hardware paradigm where compute logic is placed physically adjacent to or within memory arrays. The goal is to reduce — or eliminate — the cost of moving data between memory and compute, which is the dominant energy expenditure and performance bottleneck in memory-intensive workloads like neural network inference.
The spectrum of near-memory approaches ranges from processors placed near conventional DRAM packages, to compute logic integrated within HBM stacks, to Processing-in-Memory (PIM) where simple arithmetic operations execute inside the DRAM array itself.
The energy argument
Moving a 32-bit value from off-chip DRAM to a processor register consumes approximately 200 picojoules. Performing a multiply-accumulate on that value consumes approximately 1-2 picojoules. The ratio is striking: in a memory-bound workload, data movement consumes 100x more energy than computation.
Near-memory computing changes this ratio by shortening the path data must travel. At the extreme of processing-in-memory, data moves across micrometers within the memory array rather than millimeters or centimeters between packages. Even a 10x reduction in data movement distance translates to substantial energy savings at scale.
Near-memory in the AI context
Large model inference has specific near-memory requirements:
- Weight loading: model weights must be loaded from memory for every forward pass — moving them closer to compute is the primary optimization target
- Activation streaming: attention key-value caches grow with context length and must be accessed repeatedly
- Precision variability: near-memory units can apply quantization or dequantization at the memory boundary, reducing the bandwidth needed between memory and compute
How Webbeon approaches Near-Memory Computing
Oracle Class W1 incorporates near-memory design principles throughout:
- 256 MB distributed SRAM within the compute tile array — weights in SRAM require no off-chip access
- HBM3E controller logic colocated with memory stack interfaces to minimize transfer overhead
- Pipeline stages designed to consume data as it arrives from memory, rather than buffering and then processing
Future Oracle Class generations target tighter near-memory integration, including compute logic within HBM stacks for specific operations.
Key facts
- SRAM access costs approximately 5 pJ/bit; DRAM access costs approximately 25 pJ/bit; off-chip DRAM access costs approximately 200+ pJ/bit
- Near-memory computing is complementary to bandwidth improvements like HBM3E — both reduce the effective cost of memory access
- Processing-in-memory for AI inference is an active research area; commercial deployment at scale remains limited
- Webbeon's 40% energy reduction vs. commodity hardware reflects in part the near-memory design principles throughout W1