If all investors’ eyes are currently focused on increasing microprocessors’ performances and speeds, a bottleneck is nevertheless emerging in AI training. This is called the memory wall, or the growing speed difference between microprocessors and computer memory, hence leaving the processing units idle until the requested/needed data becomes available for computing.
The recent explosion of large language models (e.g., ChatGPT) is exacerbating this problem that has already been there for decades: the size of this particular type of deep learning (DL) model is increasing at a rate of 240x every 2 years while processing power is growing at a rate of 3.1x/2yrs and memory (including interconnect bandwidth) is scaling at only 1.4x/2yrs.
As data transfers are the main bottleneck in training large AI models, the industry is pursuing different approaches to solve this issue. The first consists in reducing the mathematical precision of the models’ parameters by “tweaking” the optimization algorithms. Lowering the number of bits to code the models’ weights obviously reduces the amount of data to be stored and moved around. It also improves the training speed while maintaining the models’ robustness and accuracy. This algorithmic/data-type solution particularly offers a higher potential for the inference phase where the parameters’ precision can be reduced more than those of the training part.
The second approach to attenuate the memory bottleneck is to put memory closer to the processing units (CPUs, GPUs, and FPGAs) in order to lower latency and data movements. Integrating memory directly into the processing units (this type of memory is called cache) is obviously the best solution to reduce latency. Over the years, the size and structure of cache memory kept on increasing, a trend that will accelerate with the adoption of the chiplet architecture on which future chips will be based. However, this solution is also hitting its limits because the needed memory size to store DL models’ parameters far exceeds the available internal storage capacity as cache memory’s silicon footprint inside CPUs/GPUs is limited.
Semiconductor designers are thus thinking in a reverse manner: embed (part of) the processing unit inside a memory chip. This emerging technology, called In-Memory processing, still eliminates the costly data movements (in terms of time and energy) and offers a significantly higher storage capacity as DRAM/SRAM takes over the logic part in terms of size allocation inside the chip. This solution could be very interesting to accelerate the crucial data pre-processing phase.
Even if the two aforementioned approaches are significantly lowering the height of the memory wall, they will still not be enough to train and run the upcoming AI models requiring trillions of optimized parameters/weights.
A third approach is then to use a distributed and unified memory architecture where every processing unit can share and, most importantly, have direct access to all the data banks. The industry adopted standard CXL (Compute Express Link) was developed for that matter. Distributed architectures are simply unavoidable given the huge amount of data that needs to be processed and stored but they also bring communication bottlenecks between all components and across all levels (servers, server racks, chips-to-chips…). Technological advances in silicon photonics, signal processing (e.g., PAM4 modulation) and communication protocols (like HBM, High Bandwidth Memory) are mitigating these latencies as bi-directional transfer speeds are now reaching speeds in excess of 128GB/sec.
Finally, the fourth and last approach consists in improving the memory chips themselves. 3D semiconductor manufacturing techniques (vertically stacking memory tiles) are helping to increase the capacity and efficiency of memory modules while new and promising technologies based on magnetic fields, holography, or ferroelectricity, just to name a few, could well represent the future of computer memory.
Apart from the obvious computing performance angle, the memory wall also has a negative financial impact. Having hundreds or thousands of very expensive server-class GPUs/CPUs operating only at a fraction of their full potential is not only a waste of CapEx but also OpEx due to the elevated electricity bills. For example, Microsoft estimates that the use of CXL and memory pooling will cut data center costs by 4% to 5%.
For all these reasons, it is clear to us that memory, a commoditized and hence boring subsegment of the semiconductor industry, is becoming “trendy” again. After the inventory correction that plagued the whole memory business in 2022, orders for high bandwidth memory triggered by the “AI gold rush” are expected to jump by almost 60% in the second half of the year with a further expansion of 30% in 2024, a boon for the largest HBM manufacturers SK Hynix, Samsung Electronics and Micron.