Do we really need so many xPUs?

Update: November 17, 2021

In recent years, announcements about new processor architectures have been issued almost every day, with a three-letter acronym – TPU, IPU, NPU. But what really distinguishes them? Are there really so many unique processor architectures, or has something else happened?

In recent years, announcements about new processor architectures have been issued almost every day, and a three-letter acronym-TPU, IPU, NPU has been given. But what really distinguishes them? Are there really so many unique processor architectures, or has something else happened?

In 2018, John L. Hennessy and David A. Patterson gave a Turing lecture entitled “The New Golden Age of Computer Architecture”. They are concerned about the CPU and its development, but this is only a small part of the whole equation. Arteris IP researcher and system architect Michael Frank said: “From a CPU perspective, most of these xPUs are not real processors.” “They are more like a GPU, an accelerator for special workloads. And they have a lot of diversity inside. Machine learning is a type of processor, you can call them machine learning accelerators collectively, but they accelerate the processing part of a variety of.”

The essence of the processor can be boiled down to three things. “In the final analysis, it does return to the instruction set architecture (ISA),” said Manuel Uhm, director of chip marketing at Xilinx. “This defines what you want to do. Then you have I/O and memory, which support the ISA and the tasks it is trying to accomplish. This will be a very interesting time in the future, because we will see more than the last two or three years Time for more innovation and change.”

Many new architectures are not single processors. “What we see is a combination of different types of processors or programmable engines that exist in the same SoC or the same system,” said Pierre-Xavier Thomas, Group Director of Cadence Technology and Strategic Marketing. “Distribute software tasks to different hardware or flexible programmable engines. All processors may share a common API, but the execution domain will be different. Here, you will indeed see different types of processing with different types of characteristics .”

The reality is that most of the names are marketing.

“The point is that people use these names and acronyms for two different purposes,” said Simon Davidmann, CEO of Imperas Software. “One is used to explain the architecture of the processor, such as SIMD (Single Instruction Multiple Data). The other defines the application segment it is addressing. So it can define the processor architecture, or something like a tensor processing unit (TPU) Brand names. They are naming their heterogeneous or homogeneous architectures, not individual processors.”

A bit of history

Forty years ago, things were much simpler. At that time there was a central processing unit (CPU) and there were many variants of it, but they were basically all Turing-complete processors with von Neumann architecture. Each has a different instruction set that makes them more efficient for certain tasks, and there is a lot of discussion about the relative advantages of the complex instruction set (CISC) and the reduced instruction set (RISC).

The emergence of RISC-V has brought a lot of attention to ISA. “People want to understand ISA because it is ISA that defines how optimized the processor is for defined tasks,” said Xilinx’s Uhm. “They can look at the ISA and start the calculation cycle. If an ISA has native instructions and runs at 1 GHz, I can compare it with another processor ISA, which may require two instructions for the same function, but the processor Runs at 1.5 GHz. Which one makes me go further? They do mathematical calculations for important functions.”

There are multiple packaging methods for CPUs. Sometimes IO or memory are placed in the same package. They are called microcontroller units (MCUs).

When modems became popular, digital signal processors (DSPs) appeared, and their difference was that they used Harvard architecture. This separates the command bus from the data bus. Some of them have also implemented SIMD architecture to make data processing more efficient.

The separation of instructions and data is to improve throughput, although it limits some edge programming that can be done, such as self-written programs. “Usually, boundary conditions are not calculations,” Uhm said. “It’s increasingly I/O or memory. The industry is shifting from increasing computing power to ensuring that there is enough data to maintain computing power and maintain performance.”

When a single processor no longer becomes faster, they connect multiple processors together. These processors usually share memory and maintain the concept of Turing completeness for each processor and the entire processor cluster. It doesn’t matter on which core any part of the program is executed, because the result is the same.

The next major development is the graphics processing unit (GPU), which broke the convention because each processing element or pipeline has its own memory and cannot be addressed outside the processor. Because memory is limited, this means that it cannot perform any arbitrary processing tasks, but can only perform tasks that can be placed in the provided memory space.

“For certain types of functions, GPUs are very powerful processors, but their pipelines are very long,” Uhm points out. “These pipelines allow the GPU unit to continuously process data, but at some point, if you have to refresh the pipeline, it will be a huge blow. A lot of latency and uncertainty are built into the system.”

Although many other accelerators have been defined, GPUs—and later general-purpose GPUs (GPGPUs)—defined a programming paradigm and software stack that made them easier to use than previous accelerators. “For many years, certain jobs have been specialized,” said Davidmann of Imperas. “There is a CPU for sequential programs. There is a graphics processor, which focuses on processing data for the Screen and brings us into a highly parallel world. Uses many small processing elements to perform tasks. Now there are machine learning tasks.”

What other construction rules can explain all the new architecture? In the past, processor arrays were usually connected by memory or fixed network topologies (such as mesh or ring). What has recently emerged is the combination of Network on Chip (NoC), which enables distributed heterogeneous processors to communicate in a more flexible way. In the future, they can also communicate without using memory.

“At this time, NoC only carries data,” said Frank of Arteris. “In the future, NoC can be extended to other areas where the communication between accelerators goes beyond data. It can send commands, send notifications, etc. The communication requirements of the accelerator array may be different from the communication requirements of the CPU or standard SoC. However, the network on chip will not Restrict you to a subset. You can optimize and improve performance by supporting the special communication needs of accelerators.”

Implementation architecture

One way of processor differentiation is to optimize for a specific operating environment. For example, the software may run in the cloud, but you can also execute the same software on a micro IoT device. The implementation architecture will be very different and will achieve different operating points in terms of performance, power consumption, cost, or the ability to operate under extreme conditions.

“Some applications are for cloud computing, and now we are bringing them closer to the edge,” Cadence’s Thomas said. “This may be due to latency requirements, or energy or power dissipation, which will require a different type of architecture. You may want to have the exact same software stack to be able to run in two locations. The cloud needs to provide flexibility because it will Receive different types of applications, and must be able to aggregate a large number of users. This requires the hardware on the server to have application-specific capabilities, but one size is not suitable for everyone.”

ML has increased its own requirements. “When using neural networks and machine learning to build intelligent systems, you need to use software frameworks and general software stacks to program the new network and map it to the hardware,” Thomas added. “Then you can adapt the software application to the right hardware from a PPA perspective. This drives the need for different types of processing and processors to be able to meet these needs at the hardware level.”

These requirements are defined by the application. “A company has created a processor for graphics operations,” Frank said. “They optimize and accelerate how to track graphs, and perform operations such as reordering graphs. There are other brute forces that accelerate machine learning, namely matrix multiplication. Memory access is different for every architecture It’s a special problem because when you build an accelerator, the most important goal is to keep it busy. You have to transfer as much data as possible to the ALU because it can be consumed and produced.”

Many of these applications have a lot in common. “They all have some local memory, they have a network on a chip to communicate, and each processor that executes a software algorithm is processing a small piece of data,” Davidmann said. “These jobs are scheduled by operating systems running on more traditional CPUs.”

The tricky part for hardware designers is predicting which tasks they will be required to perform. “Although you will perform similar types of operations in some layers, people are paying attention to the differentiation in the layers,” Thomas said. “In order to be able to process a neural network, several types of processing power are required. This means that you need to be able to process a part of the neural network in some way, and then you may need another type of operation to process another layer. Data movement And the amount of data is also changing layer by layer.”

This differentiation can go beyond data movement. “For genome sequencing, you need to do some processing,” Frank said. “But you can’t use a single type of accelerator to accelerate everything. You have to build a whole set of different accelerators for different pipelines. The CPU becomes the guardian of the management execution process. It sets up, executes DMA, and provides decision-making between the two Process. Understanding and analyzing algorithms and defining how you want to optimize their processing is a complete architectural task.”

Part of the process requires partitioning. “There is no single processor type that can be optimized for each processor task-FPGA is not good, CPU is not good, GPU is not good, DSP is also necessary,” Uhm said. “We created a series of devices that contain all of these, but the difficult part on the customer side is that they have to provide intelligence to determine which parts of the overall system will be targeted at the processor or programmable logic, or in the AI ​​engine. Everyone wants That automatic magic tool, a tool that can immediately decide to put it on the CPU, put it on the FPGA, and put it on the GPU. That tool does not exist today.”

Nevertheless, the CPU will always play a role. “The CPU is needed to execute irregular parts of the program,” Frank said. “The general programmability of the CPU has its advantages. If you have specialized data structures or mathematical operations, it won’t work well. The CPU is a general-purpose processor and it’s not optimized for anything. It’s not good at anything. .”

Change “abstract”

In the past, the hardware/software boundary was defined by the ISA, and the memory was continuously addressable. When there are multiple processors, they are usually memory-aligned.

“Consistency is a contract,” Frank said. “This kind of consistency is very important and will not disappear. But you can imagine that in a data flow engine, consistency is not that important, because you transfer the data moving at the edge directly from one accelerator to another. If you By partitioning the data set, consistency will become an obstacle because it will cost you extra cycles. You must check the information. You must provide updated information.”

This requires a different memory architecture. “You have to think about the memory structure, because you only have so much tightly coupled memory,” Uhm said. “You can access adjacent memory, but you will quickly run out of adjacent memory and cannot be accessed in time. This must be understood in the design. As the tool matures, the tool will begin to learn more. Today, It is done by human intelligence, able to understand the architecture and apply it.”

A higher level of abstraction is also needed. “Some frameworks can map or compile known networks to target hardware,” Thomas said. “You have a set of low-level kernels or APIs that will be used in the software stack and then eventually used by the mapper of the neural network. Below, you may have different types of hardware, depending on what you want to achieve, depending on Your product details. It achieves the same functionality, but it does not use the same hardware or the same PPA trade-offs.”

This puts a lot of pressure on those compilers. “The main question is how do you program the accelerator in the future?” Frank asked. “Have you implemented a hardwired engine like the first generation of GPUs? Or have you built a small programmable engine with your own instruction set? Now you have to program these things separately and put them in Each of them is connected to the data flow to perform tasks. One processor has a certain subset of the entire instruction set, and the other processor has a different subset, and they will all share some overlapping parts of the control flow. You may have some Products with slightly different acceleration capabilities. The compiler or libraries that understand it will map accordingly.”

Summarize

The architecture of the processor has not changed. They still follow the same choices that have existed for the past 40 years. What is changing is the way the chip is constructed. They now contain a large number of heterogeneous processors with memory and communications optimized for a subset of application tasks. Each chip makes a different choice for the processor function and its optimization purpose, the required data throughput, and the data flow that it usually sees.

Every hardware supplier wants to distinguish its own chip from other chips, but it is much easier to promote through branding than to talk about internal technical details. So they gave it a name, called it the first, fastest, and largest, and linked it to specific types of application problems. These three-letter acronyms have become application task names, but they do not define the hardware architecture.

In recent years, announcements about new processor architectures have been issued almost every day, with a three-letter acronym – TPU, IPU, NPU. But what really distinguishes them? Are there really so many unique processor architectures, or has something else happened?

In 2018, John L. Hennessy and David A. Patterson gave a Turing lecture entitled “The New Golden Age of Computer Architecture”. They are concerned about the CPU and its development, but this is only a small part of the whole equation. Arteris IP researcher and system architect Michael Frank said: “From a CPU perspective, most of these xPUs are not real processors.” “They are more like a GPU, an accelerator for special workloads. And they have a lot of diversity inside. Machine learning is a type of processor, you can call them machine learning accelerators collectively, but they accelerate the processing part of a variety of.”

The essence of the processor can be boiled down to three things. “In the final analysis, it does return to the instruction set architecture (ISA),” said Manuel Uhm, director of chip marketing at Xilinx. “This defines what you want to do. Then you have I/O and memory, which support the ISA and the tasks it is trying to accomplish. This will be a very interesting time in the future, because we will see more than the last two or three years Time for more innovation and change.”

Many new architectures are not single processors. “What we see is a combination of different types of processors or programmable engines that exist in the same SoC or the same system,” said Pierre-Xavier Thomas, director of Cadence’s Technology and Strategic Marketing Group. “Distribute software tasks to different hardware or flexible programmable engines. All processors may share a common API, but the execution domain will be different. Here, you will indeed see different types of processing with different types of characteristics .”

The reality is that most of the names are marketing.

“The point is that people use these names and acronyms for two different purposes,” said Simon Davidmann, CEO of Imperas Software. “One is used to explain the architecture of the processor, such as SIMD (Single Instruction Multiple Data). The other defines the application segment it is addressing. So it can define the processor architecture, or something like a tensor processing unit (TPU) Brand names. They are naming their heterogeneous or homogeneous architectures, not individual processors.”

A bit of history

Forty years ago, things were much simpler. At that time there was a central processing unit (CPU) and there were many variants of it, but they were basically all Turing-complete processors with von Neumann architecture. Each has a different instruction set that makes them more efficient for certain tasks, and there is a lot of discussion about the relative advantages of the complex instruction set (CISC) and the reduced instruction set (RISC).

The emergence of RISC-V has brought a lot of attention to ISA. “People want to understand ISA because it is ISA that defines how optimized the processor is for defined tasks,” said Xilinx’s Uhm. “They can look at the ISA and start the calculation cycle. If an ISA has native instructions and runs at 1 GHz, I can compare it with another processor ISA, which may require two instructions for the same function, but the processor Runs at 1.5 GHz. Which one makes me go further? They do mathematical calculations for important functions.”

There are multiple packaging methods for CPUs. Sometimes IO or memory are placed in the same package. They are called microcontroller units (MCUs).

When modems became popular, digital signal processors (DSPs) appeared, and their difference was that they used Harvard architecture. This separates the command bus from the data bus. Some of them have also implemented SIMD architecture to make data processing more efficient.

The separation of instructions and data is to improve throughput, although it limits some edge programming that can be done, such as self-written programs. “Usually, boundary conditions are not calculations,” Uhm said. “It’s increasingly I/O or memory. The industry is shifting from increasing computing power to ensuring that there is enough data to maintain computing power and maintain performance.”

When a single processor no longer becomes faster, they connect multiple processors together. These processors usually share memory and maintain the concept of Turing completeness for each processor and the entire processor cluster. It doesn’t matter on which core any part of the program is executed, because the result is the same.

The next major development is the graphics processing unit (GPU), which broke the convention because each processing element or pipeline has its own memory and cannot be addressed outside the processor. Because memory is limited, this means that it cannot perform any arbitrary processing tasks, but can only perform tasks that can be placed in the provided memory space.

“For certain types of functions, GPUs are very powerful processors, but their pipelines are very long,” Uhm points out. “These pipelines allow the GPU unit to continuously process data, but at some point, if you have to refresh the pipeline, it will be a huge blow. A lot of latency and uncertainty are built into the system.”

Although many other accelerators have been defined, GPUs—and later general-purpose GPUs (GPGPUs)—defined a programming paradigm and software stack that made them easier to use than previous accelerators. “For many years, certain jobs have been specialized,” said Davidmann of Imperas. “There is a CPU for sequential programs. There is a graphics processor, which focuses on processing data for the screen and brings us into a highly parallel world. Uses many small processing elements to perform tasks. Now there are machine learning tasks.”

What other construction rules can explain all the new architecture? In the past, processor arrays were usually connected via memory or fixed network topologies (such as mesh or ring). What has recently emerged is the combination of Network on Chip (NoC), which enables distributed heterogeneous processors to communicate in a more flexible way. In the future, they can also communicate without using memory.

“At this time, NoC only carries data,” said Frank of Arteris. “In the future, NoC can be extended to other areas where the communication between accelerators goes beyond data. It can send commands, send notifications, etc. The communication requirements of the accelerator array may be different from the communication requirements of the CPU or standard SoC. However, the network on chip will not Restrict you to a subset. You can optimize and improve performance by supporting the special communication needs of accelerators.”

Implementation architecture

One way of processor differentiation is to optimize for a specific operating environment. For example, the software may run in the cloud, but you can also execute the same software on micro IoT devices. The implementation architecture will be very different and will achieve different operating points in terms of performance, power consumption, cost, or the ability to operate under extreme conditions.

“Some applications are for cloud computing, and now we are bringing them closer to the edge,” Cadence’s Thomas said. “This may be due to latency requirements, or energy or power dissipation, which will require a different type of architecture. You may want to have the exact same software stack to be able to run in two locations. The cloud needs to provide flexibility because it will Receive different types of applications, and must be able to aggregate a large number of users. This requires the hardware on the server to have application-specific capabilities, but one size is not suitable for everyone.”

ML has increased its own requirements. “When using neural networks and machine learning to build intelligent systems, you need to use software frameworks and general software stacks to program the new network and map it to the hardware,” Thomas added. “Then you can adapt the software application to the right hardware from a PPA perspective. This drives the need for different types of processing and processors to be able to meet these needs at the hardware level.”

These requirements are defined by the application. “A company has created a processor for graphics operations,” Frank said. “They optimize and accelerate how to track graphs, and perform operations such as reordering graphs. There are other brute forces that accelerate machine learning, namely matrix multiplication. Memory access is different for every architecture It’s a special problem because when you build an accelerator, the most important goal is to keep it busy. You have to transfer as much data as possible to the ALU because it can be consumed and produced.”

Many of these applications have a lot in common. “They all have some local memory, they have a network on a chip to communicate, and each processor that executes a software algorithm is processing a small piece of data,” Davidmann said. “These jobs are scheduled by operating systems running on more traditional CPUs.”

The tricky part for hardware designers is predicting which tasks they will be required to perform. “Although you will perform similar types of operations in some layers, people are paying attention to differentiation in the layers,” Thomas said. “In order to be able to process a neural network, several types of processing power are required. This means that you need to be able to process a part of the neural network in some way, and then you may need another type of operation to process another layer. Data movement And the amount of data is also changing layer by layer.”

This differentiation can go beyond data movement. “For genome sequencing, you need to do some processing,” Frank said. “But you can’t use a single type of accelerator to accelerate everything. You have to build a whole set of different accelerators for different pipelines. The CPU becomes the guardian of the management execution process. It sets up, executes DMA, and provides decision-making between the two Process. Understanding and analyzing algorithms and defining how you want to optimize their processing is a complete architectural task.”

Part of the process requires partitioning. “There is no single processor type that can be optimized for each processor task – FPGA is not good, CPU is not good, GPU is not good, DSP is also necessary,” Uhm said. “We created a series of devices that contain all of these, but the difficult part on the customer side is that they have to provide intelligence to determine which parts of the overall system will be targeted at the processor or programmable logic, or in the AI ​​engine. Everyone wants That automatic magic tool, a tool that can immediately decide to put it on the CPU, put it on the FPGA, and put it on the GPU. That tool does not exist today.”

Nevertheless, the CPU will always play a role. “The CPU is needed to execute irregular parts of the program,” Frank said. “The general programmability of the CPU has its advantages. If you have specialized data structures or mathematical operations, it will not work well. The CPU is a general-purpose processor and it is not optimized for anything. It is not good at anything .”

Change “abstract”

In the past, the hardware/software boundary was defined by the ISA, and the memory was continuously addressable. When there are multiple processors, they are usually memory-aligned.

“Consistency is a contract,” Frank said. “This kind of consistency is very important and will not disappear. But you can imagine that in a data flow engine, consistency is not that important, because you transfer the data moving at the edge directly from one accelerator to another. If you By partitioning the data set, consistency will become an obstacle because it will cost you extra cycles. You must check the information. You must provide updated information.”

This requires a different memory architecture. “You have to think about the memory structure, because you only have so much tightly coupled memory,” Uhm said. “You can access adjacent memory, but you will quickly run out of adjacent memory and cannot be accessed in time. This must be understood in the design. As the tool matures, the tool will begin to learn more. Today, It is done by human intelligence, able to understand the architecture and apply it.”

A higher level of abstraction is also needed. “Some frameworks can map or compile known networks to target hardware,” Thomas said. “You have a set of low-level kernels or APIs that will be used in the software stack and then eventually used by the mapper of the neural network. Below, you may have different types of hardware, depending on what you want to achieve, depending on Your product details. It achieves the same functionality, but it does not use the same hardware or the same PPA trade-offs.”

This puts a lot of pressure on those compilers. “The main question is how do you program the accelerator in the future?” Frank asked. “Have you implemented a hardwired engine like the first generation of GPUs? Or have you built a small programmable engine with your own instruction set? Now you have to program these things separately and put them in Each of them is connected to the data flow to perform tasks. One processor has a certain subset of the entire instruction set, and the other processor has a different subset, and they will all share some overlapping parts of the control flow. You may have some Products with slightly different acceleration capabilities. The compiler or libraries that understand it will map accordingly.”

Summarize

The architecture of the processor has not changed. They still follow the same choices that have existed for the past 40 years. What is changing is the way the chip is constructed. They now contain a large number of heterogeneous processors with memory and communications optimized for a subset of application tasks. Each chip makes a different choice for the processor function and its optimization purpose, the required data throughput, and the data flow that it usually sees.

Every hardware supplier wants to distinguish its own chip from other chips, but it is much easier to promote through branding than to talk about internal technical details. So they gave it a name, called it the first, fastest, and largest, and linked it to specific types of application problems. These three-letter acronyms have become application task names, but they do not define the hardware architecture.