A startup called System brains It has shown a prototype phase graphics chip that goes much further than the well-known manufacturers of GPUs that offer two-digit performance improvements, since its latest graphics chip, the Wafer Scale Engine brains (WSE brains), incorporates nothing less than 5600% more transistors regarding the best graphics chip in the market, the Nvidia V100as it is able to offer 21.1 billion transistors regarding the 2,100 million which incorporates the Nvidia chip.
To make this a reality, the startup has managed to solve key technical challenges that no one else has been able to decipher and with it do the first processor to scale a wafer Worldwide.
The WSE Cerebras is the world's first wafer-scale processor. The logical thing is to ask why no one else has done something so obvious, and the reason is that the key technical challenge of cross-line communication was never overcome by anyone else.
The current lithographic equipment is designed to record a multitude of small processors along a wafer; They cannot make a complete processor through a wafer. This means that the plot lines will exist in one way or another and that the individual blocks must be able to communicate through these lines in some way, and this is what Cerebras has solved in order to claim the throne. of the world's first processor with a billion transistors.
The WSE Cerebras occupies an area of 46,225 mm² and houses 1.2 billion transistors. All cores are optimized for workloads related to Artificial Intelligence and the chip consumes 15 KW of power. Since all that energy must also be cooled, this cooling system would require to be as revolutionary as its energy system.
For cooling, the company could turn to an immersion cooling system with the Freon refrigerant in a fast-moving circuit or some more revolutionary method. The energy system would also need to be incredibly robust. According to Cerebras, The chip is approximately 1,000 times faster than traditional systems simply because communication can take place through the writing lines instead of jumping through hoops (interconnections, DIMM, etc.).
The WSE contains 400,000 Dispersed Linear Algebra cores (SLA). Each core is flexible, programmable and optimized for the calculations that support most neural networks. The programming capability ensures that the cores can execute all the algorithms in the field of machine learning constantly changing
The 400,000 cores in the WSE are connected through the Swarm communication fabric in a 2D mesh with a bandwidth of 100 Pb / s. Swarm is a mass communication fabric that offers innovative bandwidth and low latency at a fraction of the energy consumption of traditional techniques used to group graphics processing units. It is fully configurable; The software configures all WSE cores to support the precise communication required to train the user specified model. For each neural network, Swarm provides a unique and optimized communication path.
The WSE has 18 GB of memory built into the chip, all accessible in a single clock cycle, and provides a memory bandwidth of 9 PB / s. This is 3,000x times more capacity and 10,000x times more bandwidth than the leading competitor. More cores and more local memory allows a fast and flexible calculation, with less latency and less energy.
This would allow massive acceleration in AI applications. and reduce training times from months to just a couple of hours. This is truly revolutionary, there is no doubt about it, assuming they can keep their promise and start delivering this to customers shortly. The WSE Brain It is manufactured in a 300mm TSMC wafer using your process 16nm, which means that it is a cutting-edge technology and just a manufacturing process behind giants like Nvidia. Of course, with 84 interconnected blocks that house more than 400,000 cores, the process in which it is manufactured simply doesn't matter.
The performance and binning (frequencies) of the WSE Brain will be very interesting. On the one hand, if you are using the entire wafer as a single die, you will get a 100% yield if the design can absorb defects or 0% if you cannot. Clearly, since the prototypes were made, The design is able to absorb defects. In fact, the CEO stated that the design waits around 1% to 1.5% of functional surface area defects, but this is not a problem, since microarchitecture simply reconfigure the available cores. Further, redundant cores are placed throughout the chip to minimize any loss of performance. There is no information about binning at this time, but it goes without saying that this is the most binnable design in the world.
We are also told that the company had to design its own manufacturing and encapsulated science whereas there are currently no tools designed to handle a wafer scale processor. Not only that, the software had to be rewritten to handle more than 1 billion transistors in a single processor.