Revenge of Cell, the DPU Rises *spawn*

Status
Not open for further replies.
A DPU is a product composed of a block of DSP and interconnect logic paired with the tools and service for customizing the block and interface. The base concept that goes into TrueAudio block and the similar blocks for the console can fall into that service or one like, and perhaps it actually is one depending on how AMD implemented it. There are certain constraints in terms of absolute power consumption for encode/decode, or latency with audio, that can leave the CPU or GPU as being less ideal.

When it's dealing with the graphics pipeline, there's a name for a region of simple processors tied together with custom memory, interconnects, and IP: GPU. AMD has shown it would happily customize that portion, given its expertise and ownership of that region, versus the DSP blocks where it would not. Why does this custom block of cores that hooks into the graphics fixed and programmable paths change things versus all the other custom cores already there?

I never said that it had to be Cadence/Tensilica it could be AMD who make a chip for offloading some rendering tasks but I think it would make more sense to use a chip that's already being made & will continue to be made & sold to other customers.
 
I never said that it had to be Cadence/Tensilica it could be AMD who make a chip for offloading some rendering tasks but I think it would make more sense to use a chip that's already being made & will continue to be made & sold to other customers.

Like a GPU?
 
I actually think that whereas it is going to be called DPU the use of dedicated units will happens more and more and the interfacing between those units is only to get better. ARM sort of illustrated such usage in the presentation of the Mali-T880.
Onq I share your pov we could see (existing) dedicating units take more relevance in gaming technology. I don't expect "new" units and it has to be units widely used, and I mean widely. The best contender I see is the "image processor" whatever they are called, I speak of those dedicated units manufacturers goes through the pain of developing to process those nice but heavy pictures, /VPU looking at the width of the thing qualcomm came with in their last SOC. If they do it it is because those devices are significantly more efficient at the job than GPU. Along with compression/and decompression I could see those devices put to great use to rework textures, post processing.
I wonder how those "things' could fare at texture procedural generation.
 


AMD's TrueAudio and Unified Video Decoder are ASICs based on Xtensa.

diagram-dataplane-xtensa.png






http://ip.cadence.com/ipportfolio/tensilica-ip/xtensa-customizable

Xtensa Customizable Processors

Make a Dataplane Processor Uniquely Your Own

What are the WOW factors you need in your SoC design? For next-generation mobile devices and home entertainment products, you need efficient, high-performance functional blocks that are programmable to keep up with the latest standards. Use our proven, automated processor generator to customize a Cadence® Tensilica® Xtensa® processor core, and create more competitive and differentiated features with the lowest possible power.

  • Create a single product for multiple markets.
  • Reduce development time and cost by using pre-verified processors instead of custom logic.
  • Extend product life cycles. Change the software to add new functions without a re-spin.
 
Could someone point me to an article on what a DPU is?

Also isn't temporal reprojection a perfect fit for the hardware already in a gpu?

edit - reprojection/temporal reprojection if there's a difference.
 
Given the large number of variations possible between fixed function and fully programmable my questions are:

Specifically when and how can the functions a DPU is capable of be changed or added on to from an initial implementation? On the fly by a running program? As part of a firmware update?

Ok, so I think I found the answer to this one in some of the marketing materials onQ posted. They mention firmware upgrades to alter the DPU programming. The thing is, though, that once a DPU is configured a certain way and released in a product, the degree to which you can alter the programming is going to be limited by the need to maintain compatibility with existing software. It may not technically be fixed-function hardware, but practically it will be.

@onQ You've been using marketing materials designed to highlight all of the positive points of these devices to support your point of view. You see how this alone might make some question whether you have a balanced viewpoint on the subject, right? The AMD slides had some balance, but you seem to have only considered the "pro" slide and ignored the "con" one.
 
Ok, so I think I found the answer to this one in some of the marketing materials onQ posted. They mention firmware upgrades to alter the DPU programming. The thing is, though, that once a DPU is configured a certain way and released in a product, the degree to which you can alter the programming is going to be limited by the need to maintain compatibility with existing software. It may not technically be fixed-function hardware, but practically it will be.

@onQ You've been using marketing materials designed to highlight all of the positive points of these devices to support your point of view. You see how this alone might make some question whether you have a balanced viewpoint on the subject, right? The AMD slides had some balance, but you seem to have only considered the "pro" slide and ignored the "con" one.

I'm not just talking about a DPU I was talking about offloading some tasks to specialized processors but the fact that AMD is already using a DPU for speech recognition, video decoding & so on I used it as an example. it's not a FPGA you're not changing the hardware after you get it on the market but you can update the code that you're running on the chip.
 
I'm not just talking about a DPU I was talking about offloading some tasks to specialized processors but the fact that AMD is already using a DPU for speech recognition, video decoding & so on I used it as an example. it's not a FPGA you're not changing the hardware after you get it on the market but you can update the code that you're running on the chip.

the 360 had those functions running on the EDRAM that allowed some functions to be done with minimal overhead

do you see this as new items or just the GPUs current different operations further split so at console level development you can orchestrate the flow of data through it at an even finer granularity?
 
Could someone point me to an article on what a DPU is?

Also isn't temporal reprojection a perfect fit for the hardware already in a gpu?

edit - reprojection/temporal reprojection if there's a difference.
Except that is referring to cloud.

Here's a better definition from Tensilica themselves: http://www.edn.com/electronics-blogs/other/4307524/Multiprocessing-5-Dataplane-Processor-Units
In short, a DPU is Cell. :runaway:

Actually, SPE's lacked some of the DSP features, despite its detractors loving to refer to SPE's as DSPs.


Remember when I asked the crazy question: can Sony add fixed function pipelines to the PS4 after release? It was because I was reading about something like this after Cerny said some crazy things in his interview.

This is what I was talking about when I asked that question.

Reduce Verification Time and Effort in the Dataplane

You can significantly reduce verification time and effort using an Xtensa processor to map the control FSM to software on the processor instead of RTL for new blocks. An Xtensa processor delivers automatic RTL generation with fine-grained clock gating, saving you from months of design effort in RTL. And DPUs can be reprogrammed to adapt to upgrades and bugs in algorithms—no hardware change required. You can also create datapaths similar to hardwired using multi-cycle, complex functional units, and build custom, high-bandwidth data/control connections to other blocks with predictable latencies.

http://ip.cadence.com/ipportfolio/tensilica-ip/xtensa-customizable


Processors as RTL Alternatives

Processors can be used as alternatives to hand-coded RTL blocks by adding the same datapath elements as implemented in RTL accelerator blocks. These datapath elements include deep pipelines, parallel execution units, task-specific state registers, and wide data buses to local and global memories. This allows Tensilica processors to sustain the same high computation throughput and to support the same data interfaces as RTL hardware accelerators.

However, control of processor datapaths is very different from their RTL counterparts. Cycle-by-cyle control of a processor’s datapaths is not frozen in the hardware FSM’s state transitions. Instead, the FSM is implemented in firmware, which greatly reduces the effort needed to fix an algorithm bug or add new features. In a firmware-controlled FSM, control-flow decisions occur in branches, load and store operations implement memory accesses, and computations become explicit sequences of general-purpose and application-specific instructions.
 
Last edited:
Wouldn't this also make sense of how Microsoft was able to update the Xbox One for 10-bit HEVC?
 
From the Tensilica engineer quote, I think it means more 'data domain' rather than 'control domain'. The idea is processors that can deal with data processing as well as hardware while being programmable in a way beyond DSPs. I think. TBH it's basically a marketing term at this point if no-one else is using it. ;)
 
Except that is referring to cloud.

Here's a better definition from Tensilica themselves: http://www.edn.com/electronics-blogs/other/4307524/Multiprocessing-5-Dataplane-Processor-Units
In short, a DPU is Cell. :runaway:

Actually, SPE's lacked some of the DSP features, despite its detractors loving to refer to SPE's as DSPs.
Once it all said and done it seems to me that the Cell suffered from an unbalance between cpu and SPU and both suffered from being designed at a time when speed demon designs were fashionable. The Cell also embarked quite a few units (8) as if many cores were a must have feature instead of a necessity. Then the Cell shipped without a proper programming model going as far as exposing the underlying hardware. I would think Cell demise came more from the CPU side than the SPU as the PPE were bad and with the bad start nobody within STI dared to iterate on the design.
As time pass the thing that stands the most out of the Cell design is not that it was heterogeneous /ASM design, the memory model for the SPU, the unbalanced between CPU and SPU resources, the software mode (or lack of)l, the bad CPUs, what really stands stands out in the cold now that time has passed is the width of the SPU, I mean 4 wide?
SPUs were indeed not DPS but they failed at being VPUs, they enforce the SIMD programming model while being as narrow as it gets (~) and multi cores programming as they chose multiple narrow units instead of a single or a couple of bigger ones. Looking at the design again it smells as if LIW or VLIW would have been a better match for SPUs architectures.

Anyway long story short I don't see how the Cell by self can be used as an argument against more heterogeneous design as it lackings might not be related to its heterogeneous nature but others designs choices. There are many successful heterogenous design like mobile SOC and it seems like things are headed toward more heterogeneity.

As we are on that topic those detail about qualcomm Hexagone may interest people around.
http://anandtech.com/show/9552/qual...680-dsp-in-snapdragon-820-accelerated-imaging
 
Last edited:
From the Tensilica engineer quote, I think it means more 'data domain' rather than 'control domain'. The idea is processors that can deal with data processing as well as hardware while being programmable in a way beyond DSPs. I think. TBH it's basically a marketing term at this point if no-one else is using it. ;)

But what's the control domain in an soc, cpu, gpu? This one is going over my head.
 
Status
Not open for further replies.
Back
Top