Revenge of Cell, the DPU Rises *spawn*

Status
Not open for further replies.

onQ

Veteran
I don't know if that is the route next gen console SoCs would be taken or not, but I would look at next year's highend GPUs, the render backends / ROPs, if they are different and how many there are. Currently, the Fury X has 64 and the 980Ti has 96. The current families of GPU have color compression as well, saving some bandwidth. But neither FuryX nor GM200 are well suited to handle 4K at 60fps with graphic options in current games maxed out, as far as single GPU/card systems. That may change with AMD Greenland and Nvidia GP100, I'm guessing they'll have 128 ROPs, ones that are improved in whatever ways made sense at the time they were engineered. Then the question becomes, can a high end GPU that was designed during 2012-15, released H2 2016, become a high end APU in 2017 and can that be semi-custom'ed into a console APU for a $400 consumer machine by 2019.

I thought moving some parts of the GPU into it's own DPU would make it faster to get custom SoC to the market.

If you think about what Sony is doing now with re projection using compute it would make sense for them to have a DSP for that in their next SoC.
 
I thought moving some parts of the GPU into it's own DPU would make it faster to get custom SoC to the market.

If you think about what Sony is doing now with re projection using compute it would make sense for them to have a DSP for that in their next SoC.

Why, why does it make sense?
 
I thought moving some parts of the GPU into it's own DPU would make it faster to get custom SoC to the market.

If you think about what Sony is doing now with re projection using compute it would make sense for them to have a DSP for that in their next SoC.

Not really the correct use of DSP, but I get what you mean. Not sure splitting a GPU in half is a good idea. It would probably be better to go with a discrete GPU that to split some of the GPU functions into a different chip. Whether discrete GPU or have the GPU and CPU in the same package is the way to go, I do not know.
 
Why, why does it make sense?

Because next gen it won't be reprojecting 60fps 1080p to 120fps 1080p, they might have to reproject 60fps 4k to 240fps 4k or whatever so it would be smart for them to take it off the compute pipeline & let the algorithm run on a DPU.


Not really the correct use of DSP, but I get what you mean. Not sure splitting a GPU in half is a good idea. It would probably be better to go with a discrete GPU that to split some of the GPU functions into a different chip. Whether discrete GPU or have the GPU and CPU in the same package is the way to go, I do not know.

Why use a whole GPU when they can use a DPU for the specialized tasks?

dpu-new.jpg


 
Because next gen it won't be reprojecting 60fps 1080p to 120fps 1080p, they might have to reproject 60fps 4k to 240fps 4k or whatever so it would be smart for them to take it off the compute pipeline & let the algorithm run on a DPU.

Why use a whole GPU when they can use a DPU for the specialized tasks?

No. It won't. It makes absolutely no business sense to spend the money, time, and engineering resources to create an entirely brand new DSP to do what a GPU can already do.
 
No. It won't. It makes absolutely no business sense to spend the money, time, and engineering resources to create an entirely brand new DSP to do what a GPU can already do.

With that type of thinking we wouldn't have TrueAudio, UVD, or VCE because all of that can be done with the CPU or GPU. In fact why did Sony waste money on a secondary chip when all that could have been done with the CPU?
 
If reprojection is an excellent fit for compute (it is), then the silicon you'd spend on a DPU would be better spent on more CUs and let devs do whatever they want with them. That's kinda the problem with custom silicon these days - compute is so fast and fits in with the existing standard system architecture that custom silicon is mostly redundant. It's only worth it for jobs that can be handled more efficiently, and generally that's not image based processing.
 
If reprojection is an excellent fit for compute (it is), then the silicon you'd spend on a DPU would be better spent on more CUs and let devs do whatever they want with them. That's kinda the problem with custom silicon these days - compute is so fast and fits in with the existing standard system architecture that custom silicon is mostly redundant. It's only worth it for jobs that can be handled more efficiently, and generally that's not image based processing.

Actually it is, custom logic is more efficient at ray tracing ,image processing & things like that.

 
Actually it is, custom logic is more efficient at ray tracing ,image processing & things like that.
It is, but unfortunately like Shifty wrote it's not flexible enough and modern engines and their algorithms are constantly evolving with a more pressing need for flexibility.

Developers even bypass GPU upscaler chips nowadays in order to have better results with their own software algorithms, even if it costs them more GPU time. And soon, even ROPs will be bypassed! I know it's only one game but it clearly points towards what game development will need in the future: more hardware flexibility.

I am not sure a FXAA chip included in current consoles (and conceived 5 years ago with antique FXAA code) would still be used in many games nowadays... :nope:
 
Actually it is, custom logic is more efficient at ray tracing ,image processing & things like that.
Not enough to be worth the added complexity.
Developers even bypass GPU upscaler chips nowadays in order to have better results with their own software algorithms, even if it costs them more GPU time. And soon, even ROPs will be bypassed! I know it's only one game but it clearly points towards what game development will need in the future: more hardware flexibility.
It's the very principle of software rendering that onQ has been interested in! One hardware solution for all computation problems. Don't bother putting in lots of custom silicon when a single unified processor can do the work.
 
It is, but unfortunately like Shifty wrote it's not flexible enough and modern engines and their algorithms are constantly evolving with a more pressing need for flexibility.

Developers even bypass GPU upscaler chips nowadays in order to have better results with their own software algorithms, even if it costs them more GPU time. And soon, even ROPs will be bypassed! I know it's only one game but it clearly points towards what game development will need in the future: more hardware flexibility.

I am not sure a FXAA chip included in current consoles (and conceived 5 years ago with antique FXAA code) would still be used in many games nowadays... :nope:


DPU's are also programmable.

http://ip.cadence.com/knowledgecenter/know-ten/dataplane-design

Handling the Difficult Tasks in the SoC Dataplane
Chris Rowen discusses the benefits of dataplane processing with SemIsrael.
Designers have long understood how to use a single processor for the control functions in an SoC design. However, there are a lot of data-intensive functions that control processors cannot handle. That's why designers design RTL blocks for these functions. However, RTL blocks take a long time to design and verify, and are not programmable to handle multiple standards or changes.

Designers often want to use programmable functions in the dataplane, and only Cadence offers the core technology that overcomes the top four objections to using processors in the dataplane:

  1. Data throughput—All other processor cores use bus interfaces to transfer data. Cadence® Tensilica® cores allow designers to bypass the main bus entirely, directly flowing data into and out of the execution units of the processor using a FIFO-like process, just like a block of RTL.
  2. Fit into hardware design flow—We are the only processor core company that provides glueless pin-level co-simulation of the instruction set simulator (ISS) with Verilog simulators from Cadence, Synopysys, and Mentor. Using existing tools, designers can simulate the processor in the context of the entire chip. And we offer a better verification infrastructure over RTL, with pre-verified state machines.
  3. Processing speed—Our patented automated tools help the designer customize the processor for the application, such as video, audio, or communications. This lets designers use Tensilica DPUs to get 10 to 100 times the processing speed of traditional processors and DSP cores.
  4. Customization challenges—Most designers are not processor experts, and are hesitant to customize a processor architecture for their needs. With our automated processor generator, designers can quickly and safely get the customized processor core for their exact configuration.
The Best of CPUs and DSPs with Better Performance
Tensilica processors combine the best of CPUs and DSP cores with much better performance and fit for each application. Where our processors really shine is in the dataplane - doing the hard work, handling complex algorithms, and offloading the host processor. Our processors and DSPs deliver programability, low power, optimized performance, and small core size.
 
Not enough to be worth the added complexity.
It's the very principle of software rendering that onQ has been interested in! One hardware solution for all computation problems. Don't bother putting in lots of custom silicon when a single unified processor can do the work.

Sony,AMD,NVIDIA & others would be fools not to learn from software based rendering & use what they learn to improve the hardware for the next generation.

Also this isn't stopping software based rendering.
 
I would approach this from another direction, asking how much do we think isn't already like this in a GPU?
More so than the CPU component, there are a lot of internal microprocessors that are abstracted into what is called a single GPU.
The illusion that there is a single GPU starts to break down earlier when low-level optimizations come into play than it does for a CPU.

There's a lot of behaviors in the fixed-function pipelines that would be handled by specialized execution loop. The way texturing operations get broken down into multiple memory accesses and filtered back to a single result involves a form of internal sequencing, and the management of ROP cache tiles is another loop that at least for GCN is not readily apparent to the compute portion.

Whether or not these are DSPs per se, I guess some elements share more properties than others.
 
Shifty Geezer said:
Split out to spawn a more directed conversation about DPU, a big and upcoming thing .

I would approach this from another direction, asking how much do we think isn't already like this in a GPU?
More so than the CPU component, there are a lot of internal microprocessors that are abstracted into what is called a single GPU.
The illusion that there is a single GPU starts to break down earlier when low-level optimizations come into play than it does for a CPU.

There's a lot of behaviors in the fixed-function pipelines that would be handled by specialized execution loop. The way texturing operations get broken down into multiple memory accesses and filtered back to a single result involves a form of internal sequencing, and the management of ROP cache tiles is another loop that at least for GCN is not readily apparent to the compute portion.

Whether or not these are DSPs per se, I guess some elements share more properties than others.

The DPU is already being used in AMD's APU/SoC/GPU.
 
The TrueAudio and VCE engines use DSPs or customized cores that are explicitly exposed as such. For the cores that handle things like audio, there is a latency-sensitive aspect that the processing loops internal to the GPU are not as subject to. There still are small execution engines in various parts of the GPU, whether or not a higher-level program can see them.
 
Sony,AMD,NVIDIA & others would be fools not to learn from software based rendering & use what they learn to improve the hardware for the next generation.
In what way? Software rendering solutions are enabled by hardware flexibility. You don't then want to fix the hardware on those solutions to run them faster if that fixation limits flexibility. The progression of modern GPU design came from fixed hardware solutions highly optimised to run the workloads devs were using. Had we stuck with that thinking, we wouldn't be seeing compute and software solutions now.

A DPU as a programmable image processor only makes sense if the GPU can't do the job efficiently enough, and image processing (drawing and moving vertex and pixel data) is exactly what GPUs are best at. Moving, warping and scaling the framebuffer is an ideal solution for GPU using the game's data.

Also this isn't stopping software based rendering.
The point with software rendering is you don't have lots of different hardwares to write for. You have one completely flexible processor and it does whatever you tell it to with no hardware limits. So putting a Myriad 2 in a console, for example, would go against the idea of software solution because it's a dedicated processor who's performance can't be repurposed for something else. And it's this desire for processing power not to go to waste that saw the development of GPGPU and then compute.

edit: as an example of why this a preferred solution, look at the nVidia Shield tablet. It has an 'pressure' (size)-sensitive stylus, enabled via compute. This isn't done with a hardware processor so is less efficient power wise than it could be, but that means that when not using the pen, the whole processing power can be turned to something else.

That said, processors are made up of functional blocks and it can be argued that the inclusion of a particular hardware block transparently accessed via the ISA makes sense if it sees enough utilisation.
 
It's simple, really. With consoles hardware doesn't change, software does. Limiting software with hardware is probably the last thing you want to do.
 
In what way? Software rendering solutions are enabled by hardware flexibility. You don't then want to fix the hardware on those solutions to run them faster if that fixation limits flexibility. The progression of modern GPU design came from fixed hardware solutions highly optimised to run the workloads devs were using. Had we stuck with that thinking, we wouldn't be seeing compute and software solutions now.

A DPU as a programmable image processor only makes sense if the GPU can't do the job efficiently enough, and image processing (drawing and moving vertex and pixel data) is exactly what GPUs are best at. Moving, warping and scaling the framebuffer is an ideal solution for GPU using the game's data.

The point with software rendering is you don't have lots of different hardwares to write for. You have one completely flexible processor and it does whatever you tell it to with no hardware limits. So putting a Myriad 2 in a console, for example, would go against the idea of software solution because it's a dedicated processor who's performance can't be repurposed for something else. And it's this desire for processing power not to go to waste that saw the development of GPGPU and then compute.

edit: as an example of why this a preferred solution, look at the nVidia Shield tablet. It has an 'pressure' (size)-sensitive stylus, enabled via compute. This isn't done with a hardware processor so is less efficient power wise than it could be, but that means that when not using the pen, the whole processing power can be turned to something else.

That said, processors are made up of functional blocks and it can be argued that the inclusion of a particular hardware block transparently accessed via the ISA makes sense if it sees enough utilisation.


Things are not the same as it used to be all the chips will be able to work together & share workloads, if someone has a ray tracing co-processor that can handle lighting better than the CPU & GPU why not add that to the SoC to help with the lighting?
 
Things are not the same as it used to be all the chips will be able to work together & share workloads, if someone has a ray tracing co-processor that can handle lighting better than the CPU & GPU why not add that to the SoC to help with the lighting?

Because, when you are not raytracing, the dedicated raytracing hardware would be idle when that die space could accommodate more general-purpose compute resources that are always going to be useful. The practicality of dedicated-purpose functional blocks is dictated by several factors. Some off the top of my head: The frequency that the type of compute work they are most efficient at is performed, how much more efficiently they perform that work over a more general-purpose compute resource, how much die space and power they require and how much complexity it adds to the programming model.
 
Status
Not open for further replies.
Back
Top