Xenos - invention of the BackBuffer Processing Unit?

Shifty Geezer

uber-Troll!
Moderator
Legend
There's a lot of confusion about Xenos, eDRAM and 256 GB/s, especially amongst the less technical of us (me at least!). As best I can tell it's because the structure of the graphics system hasn't been well explained. I present here my understanding so that the technical bods can dissect and agree/disagree, fundamentally to consider the idea that there is a separate processing entity, the BackBuffer Processing Unit, that has been introduced to the graphics system.

-----

The Xenos system consists of two processing parts. One has unified shaders and performs the usual graphics stuff of assembling poly, texturing and shading (rasterization). The other part performs frequent memory intensive task like Z/stencil rendering, alpha blended polygon rendering, overdraw, etc.

There is a bandwidth of 22.4 GB/s between the GPU's shader unit and system RAM.

There is a bandwidth of 32 GB/s write, 16 GB/s read between Xenos shader processor and the Back Buffer Processor

The BackBuffer Processor has 10 megabytes of fast local storage.

The logic on the BBPU can access this data directly just as any processor can access its local storage, at sufficient bandwidth that it never has to wait. MS have given a figure of 256 GB/s, but to all extents and purposes it can be considered as limitless bandwidth. The logic will never wait.

Code:
                  System RAM

                      DDR
                       |
                       |
                    22.4 GB/s
                       |
                       |
           +-----------+-----------+--------+---------------------+
           |           |           |        |                     |
           |     SHADER:UNIT       |        |   BACKBUFFER        |
           |           |           |        |   PROCESSOR         |
           |   +-------+-------+   | 32GB/s | +-----------------+ |
           |   |               | ------------>|                 | |
           |   |    Unified    |<------------ |   Processing    | |
           |   |    shaders    |   |16 GB/s | |   Logic + 10mb  | |
           |   |               |   |        | |   local storage | |
           |   +---------------+   |        | +-----------------+ |
           |                       |        |                     |
           +-----------------------+--------+---------------------+
If this is right, it raises the question as to why the 256 GB/s bandwidth is talked about (apart from blind marketting speak)? Bandwidth between logic and it's local storage never gets mentioned from what I know. No-one's listing the bandwidth between their level one cache and logic on their CPUs! I guess it might be because eDRAM is (or has been) slower than normal cache memory (SRAM?) so it's speed was a limiting factor. By saying the eDRAM's bandwidth is that fast, MS are indicating that the eDRAM is effectively as fast as SRAM. 10 Mb = 10 Level 1 cache. However, this figures results in a confusing situation where this bandwidth is mixed with the conventional understanding of bandwidth as data transfer between seperate processing units and storage, so it doesn't appear to be a fair figure when applied in that way.

Furthermore, although this bandwidth isn't bandwidth between separate processing units, it does have a beneficial effect in terms of not interferring with RAM -> GPU bandwidth. If the BBPU was not present, it's functions (Z/stencil rendering, alpha blended polygon rendering, overdraw, etc.) would have to be performed in the only available storage large enough, which would be main RAM, and this would cut into the 22.4 GB/s bandwidth.

Therefore the benefits of the BBPU should not be listed as 'bandwidth' but 'bandwidth saved' or 'bandwidth freed'. By not having to work on system RAM, the intensive tasks performed on the BBPU free the system bandwidth for other task like textures, models etc.

-----

Does this sound right? Does this paint an accurate picture of where these numbers come from and what they actually mean?
 
It's like comparing a 128-bit bus and a 256-bit bus. If Sony had implemented a 256-bit bus for GDDR3, wouldn't you be counting the doubled bandwidth against Xenos?

When all GPUs have a free backbuffer that will be the time when we can stop using it as a comparison point between GPU architectures.

Until then it's entirely valid to count the backbuffer's free bandwidth as part of the GPU's overall bandwidth capability.

Even if one naively excludes the EDRAM's internal bandwidth, Xenos still has 22.4 + 32 + 16 = 70.4GB/s compared against RSX which has 22.4 + 15 + 20 = 57.4GB/s (assuming that FlexIO supports both channels simultaneously, which seems a reasonable assumption to make).

I don't know why you're so keen to dismiss the EDRAM's bandwidth.

Jawed
 
It's this bandwidth they're talking about because that is in some way "comparable" to other architectures, where the bandwidth figures are given for the link between (not exclusively) the ROPs and the framebuffer. But that leaves efficiency out of the picture. Those 256GB/s are used in a very inefficient way, because it's "cheap" bandwidth. When limited by external bandwidth, you have to try much harder, but the result is also much higher bandwidth efficiency (but there are other costs).
 
The importance is the distinction that frame buffer bandwidth is often one of the larges consumers of bandwidth, especially so when FSAA is enabled; the EDRAM takes a very large bandwidth burden off that 22.4GB/s system/UMA RAM bandwidth. Although the final frame will exist in the system memory this will only include the finished frame data, so all reading, writing, blending and AA operations are taken care of in the EDRAM/ROP processing chip which results in a fraction of the bandwidth consumed on the 22.4GB/s of UMA RAM.
 
Jawed said:
Even if one naively excludes the EDRAM's internal bandwidth, Xenos still has 22.4 + 32 + 16 = 70.4GB/s compared against RSX which has 22.4 + 15 + 20 = 57.4GB/s (assuming that FlexIO supports both channels simultaneously, which seems a reasonable assumption to make).

You should add 10.8 GB/s for Xenos too.
 
Dave .. can you comment on one of the other comments where it was said that although the Xbox2 can use 5 ALU's at once .. it can only do 5 vertex instructions or 5 pixel instructions at once .. but can't do say 3 pixel and 2 texture at once or visa-versa.

This should certainly create a bottle neck?

Just want your comment. ;)

Maybe would be a good question for them Xbox chaps .. if you don't know?

US
 
I think the entire point of unified shaders is so you can allocate the resources needed (either vertex or pixel ops). So you can do both at the same time, and you'll never have hardware sitting there waiting to be used.
 
Well that's what I would've thought .. except

http://www.beyond3d.com/forum/viewtopic.php?t=23126&postdays=0&postorder=asc&start=40

Jawed said:
rwolf said:
Here is a nasty limitation.

All 48 of the ALUs are able to perform operations on either pixel or vertex data. All 48 have to be doing the same thing during the same clock cycle (pixel or vertex operations), but this can alternate from clock to clock. One cycle, all 48 ALUs can be crunching vertex data, the next, they can all be doing pixel ops, but they cannot be split in the same clock cycle.


Yep, it sounds shit to me. It makes me wonder if dynamic branching is ever going to bring improved performance. Seems unlikely.

Jawed

Well after reading that .. it seems to me that it can't do Pixel and Vertex Operations at the same time. Would be an interesting question to the team?
 
Jawed said:
I don't know why you're so keen to dismiss the EDRAM's bandwidth.
I'm not trying to dismiss it. I'm trying to understand it. Please tell me whenever has the data transfer rate between the logic circuits of a processor and it's local storage been called bandwidth?!

So far no-one's confirmed or denied the concept that there is in essence a seperate processor involved. It DOES benefit bandwidth, which I said. Basically it frees the main bandwidth of all the backbuffer data flow, which would be a lot. I accept that. I think it's a good idea! I think it has a big impact on bandwidth available to the GPU!

They KEY point is that this data-movement saving is not achieved by a huge bandwidth between graphics processor and backbuffer, but by the creation of a seperate set of logic - a seperate processor. If this logic remained on the GPU, the bandwidth to backbuffer would be 32+16 GB/s to the eDRAM. By moving the logic onto this other chip, the bandwidth becomes irrelevant.
 
I'm with Shifty on this.

While it isn't fair to just count the 256 GB/s bandwidth as system bandwidth, it wouldn't be fair also to entirely dismiss it. In another thread Jaws tried to normalize it but I don't think this can be done the way he did it.
All in all I think the best way to understand what's going on and what the benefits of the eDram are, is to count it as something as 'bandwidth saved'.

Btw, Hi all :D
 
Its a moot point directly comparing On-Chip Bandwith to external Bandwith, by the same logic I could add Cells internal Bandwith to "own" other CPUs

Cell:
1 SPU has 3*128Bit read, 1*128Bit Write per Cycle = 143 GB/s read, 47 GB/s write to Local Storage.
7 SPUs = 1 TB/s read, 376 GB/s write.
 
JAD said:
All in all I think the best way to understand what's going on and what the benefits of the eDram are, is to count it as something as 'bandwidth saved'.

Btw, Hi all :D

Welcome to the forums!

I agree, that sounds like that might be a good way to show the benefits of the eDram. However, does the PS3 save any bandwidth too?

Tommy McClain
 
Well, that's the question. But I think they can't really save bandwidth like it is saved with the eDram + on die logic. I think the ps3 will save bandwidth with traditional compression techniques and maybe even some new ones.

Correct me of I'm wrong though.
 
Npl said:
Cell:
1 SPU has 3*128Bit read, 1*128Bit Write per Cycle = 143 GB/s read, 47 GB/s write to Local Storage.
7 SPUs = 1 TB/s read, 376 GB/s write.
I wouldn't count register file bw..but let's do it anyway :)
SPE's register file has 8 ports (6 read, 2 write) so you must double those numbers:
2.15 Terabytes/s on read and 716 Gigabytes/s on write
 
JAD said:
I'm with Shifty on this.
Well I'm glad someone is! :D

All in all I think the best way to understand what's going on and what the benefits of the eDram are, is to count it as something as 'bandwidth saved'.
I'd like to extend my argument a little further in this respect. The eDRAM solution saves an amount of bandwidth demand of system RAM. MS and others are counting this as system bandwidth.

Cache on the CPU also saves bandwidth demand on the system RAM. Often used data items are stored on cache. If this cache wasn't present, the CPU would fetch and store everything to system RAM. The demands would be very great. By using fast local RAM, a huge amount of bandwidth demand is negated.

If the speed of local storage is to be considered as saving on main system data availability, all local storage should be considered. If those bandwidth savings are calculated as the data speed of the local storage, all local storage speeds should be considered as Npl gave example of for Cell.

In essence, existing solutions for bandwidth demand limitation on CPUs are taken for granted, and this approach for GPUs is something new, so I can understand total bandwidth savings by use of all local storage on all processors not to be considered. But in describing the role of the eDRAM, it can only fairly be classed as 'bandwidth saving compared to existing rendering pipelines', unless you really DO want to include every bandwidth saving device in a system architecture as contributing to it's total bandwidth :oops:

That'd result in many terabyte numbers, and the teraflops nonsense is already more than I can cope with! :D
 
While it isn't fair to just count the 256 GB/s bandwidth as system bandwidth, it wouldn't be fair also to entirely dismiss it.
That's what generally everyone did when comparing PS2 to XBox, so why should this be any different?

At any rate, not even the most rabid fanboys had the audacity to make idiotic charts claiming ~50GB/s+(PS2) (or ~30GB/s+(GC)) vs paltry 6.4GB/s of Xbox.

But yeah, as mentioned above, counting SPE local memory bandwith, PS3 is around 800GB/s I think? ...
 
Next thing you know they'll be benchmarking the 2 platforms on how responsive the controllers are.

Now that i think about it, i'm kinda interested in that too, seen how bluetooth devices aren't exactly the fastest or most reliable, from my personal experience.
 
Well here's an example of "hidden" bandwidth being discussed:

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2419&p=2

In the case of the Pentium-D, the caches talk to each other (to keep cache consistency) via a shared 800 MHz bus, just like two single core SMP Xeons. Not only is 800 MHz relatively slow compared to the CPU (3200 MHz), but exchanging information via a bus also increases latency and lowers bandwidth. Latency is increased as the bus may not always be free - one of the CPUs might be using it to transfer data to or from the memory. This half duplex bus can only transmit signals of one device (CPU 1, CPU 2, chipset) at a given moment. Bandwidth is decreased as the cache coherency exchanges need a small amount of time on the bus too.

The Pentium D's cache architecture is a real kludge.

Your point that R500 is actually a dual-processor GPU is salient. The EDRAM chip truly is a second processor.

Jawed
 
Back
Top