ATI Technologies Interview: The Power of the Xbox 360 GPU

http://interviews.teamxbox.com/xbox/1458/The-Power-of-the-Xbox-360-GPU/p1/

What were the goals and challenges that ATI faced in developing the Xbox 360 GPU?

Bob Feldstein: The challenges included creating on schedule a platform that can live for five years without enhancement. Microsoft’s aggressive performance specifications for the system forced ATI to once again think outside the box –in this case, the PC market. After making the breakthrough that we needed by thinking of this product as a console product only, the innovations -- Intelligent Memory, Unified Shader, Modeling Engine -- came more easily. Then the architecture team had to come through in record time to stay ahead of an aggressive implementation team.


Did Microsoft know exactly what it wanted for its GPU or did they just set the goals and you proposed the architecture and technology?

Bob Feldstein: Microsoft set broad goals for the GPU. They were especially concerned with memory bandwidth and overall system performance. They wanted a machine that could take advantage of CPU multi-processing through multi-threading, plus a machine that would be conceptually simple to program while providing head room for developers to stay competitive over the console’s lifetime. Microsoft and ATI did the GPU architectural design, with MS determining the overall performance targets and ATI turning those targets into reality. The Unified Shaders and Intelligent Memory, for example, are direct results of our remarkable collaboration.




Before we continue, we never had the chance to clarify the correct name of the Xbox 360 GPU. Some call it Xenos, others C1. Sometimes it was known as R500. But the rumor was that ATI wanted to avoid that codename because it could make the Xbox 360 GPU look less powerful than ATI’s R520 PC part. So, what is the Xbox 360 GPU codename?

Bob Feldstein: The Xbox GPU had nothing whatsoever to do with the PC products. R500 was never an internal name for the Xbox. Internally we called the GPU, interchangeably, C1 and Xenos. C1 was a code name defined before we had the contract, Xenos was the project name after the contract was won – but C1 stuck in everyone’s minds. Once we started calling it C1, it was hard to change.


wow this article-interview is 4 pages long - I'm just diving into it myself.
 
Last edited by a moderator:
People love numbers. How many transistors does the Xenos have? Can you break that down into the parent and daughter die? Explain some of the numbers that have been mentioned, such as the 2-terabit; the 32GB/sec and 22.4GB/sec bandwidths, etc.

Bob Feldstein: 235 million transistors parent die, 90 million transistors daughter die. Bandwidth for Intelligent Memory is derived from the following events occurring every cycle:

2 Quads of samples/cycle * 4 samples * (4 bytes color + 4 bytes Z)*2 (read and write)*500mhz = 256 gbytes/sec (that is, 2 Terabits/sec).

The 22.4GB/sec link is the connection to main memory (note, incidentally, that all 512MB of Xbox 360 system memory is in one place, which makes accessing it easier from a developer perspective). The GPU is also directly connected to the L2 cache of the CPUs – this is a 24GB/sec link Memory bandwidth is extremely important, which is why we spent so much time on it. Fortunately, designing the system from the ground up gave us the freedom to build incredible bandwidth into the box.
...

The interface to the system’s memory is 128-bit. Isn’t this a bottleneck considering the bandwidth-intensive tasks performed in the GPU? Why was a 128-bit bus selected when PC parts already implement 256-bit buses in their high-end editions?

Bob Feldstein: Excellent question because it gets to the heart of what is right in the system design. We have a great deal of internal memory in the daughter die referred to above. We actually use this memory as our back buffer. In addition, all anti-aliasing resolves, Z-Buffering and Alpha Blending occur within this internal memory. This means our highest bandwidth clients (Z, Alpha and FSAA) occur internally to the chip and don’t need to access main memory. This makes the 128 bit interface to system memory, and the ensuing bandwidth, more than enough for our needs because we are offloading the bandwidth hogs to internal memory.

....

The RSX has a 550 MHz clock speed. Does this 10% clock speed lead over the Xenos GPU necessarily mean that the PlayStation 3 GPU is more powerful than the Xbox 360 GPU? We won’t believe it until we see it, but if true, how is it possible that the PlayStation 3 can output two 1080p video streams simultaneously? That makes the RSX sound more powerful than the Xenos…

Bob Feldstein: No! These are inconsequential numbers that don’t reflect any reality concerning the system performance. The 1080p streams have no bearing on understanding system performance, and the clock speed means little.

Realize that the memory bandwidth is the bottleneck of graphics systems. ATI’s Intelligent Memory provides an almost infinite bandwidth path to memory – meaning that the Unified Shaders will never be stifled in getting work to do. The Sony processor is going to come up against memory bandwidth limitations constantly, negating any small clocking differential.

The Sony 1080p dual outputs are not an indication of performance – at best 1080p is an indication that Sony considered this resolution the sweet spot of the market. The use case of dual 1080p just shows that the RSX has a PC pedigree, and has been cobbled together with the console.

....

Besides developing hardware, ATI always helps developers by releasing tools, source code samples, etc. For example, we heard about a displaced subdivision surfaces algorithm developed by Vineet Goel of ATI Research Orlando. Are you helping Xbox 360 developers leverage the power of the Xenos GPU or is that a Microsoft responsibility?

Bob Feldstein: We have teamed with Microsoft to enable developers. We have had some members of the developer relation and tools teams work directly at developer events and assist in training Microsoft employees. We are ready to help at anytime but the truth is that we have been quite impressed by how Microsoft is handling developer relationships.

We will push ideas like Vineet’s (and he has others) through Microsoft. When we say the Xbox has headroom, it is through the ability to enable algorithms on the GPU, including interesting lighting and surface algorithm like subdivision surfaces. It would be impossible to implement these algorithms in real time without our unique features such as shader export.

Not bad interview, mostly fluff, the timing is interesting
 
I thought the daughter die was touted at 100M?

Too bad he shied away from any comparisons (or statements of power at all) to PC/RSX, but there is some interesting insight into their design choices, even if we've pretty much heard it before.

One thing they did right was build a mystique for this thing, no one seems to know just how capable the thing is (comparitively and in its own right), yet everyone assumes its golden...

Oh and its interesting re: daughter die being removed to increase usable yields (of the main die). Could this be the cause of the shortages; a lack of daughter dies?
 
bloodbob said:
Good thing it has infinity bandwidth now no one can every say its bandwidth limited.

It's effectively infinite bandwidth because the ROPs will hit their fillrate limit before the eDRAM does.

:devilish: ;)
 
aaaaa00 said:
It's effectively infinite bandwidth because the ROPs will hit their fillrate limit before the eDRAM does.

:devilish: ;)

Ahh good thing the eDRAM core has infinite storage capablities.
 
Somewhat direct confirmation from ATI there that they are still working on (and are contracted to) die shrinks of Xenos.
 
Dave Baumann said:
Somewhat direct confirmation from ATI there that they are still working on (and are contracted to) die shrinks of Xenos.
So, this means it will be a Xenos on 80ns with lower power consumption and cheaper to be produced? Sound promising to hear it then. Maybe, we might see refresh X360 with higher clock speed Xenos.... just hope.
 
satein said:
So, this means it will be a Xenos on 80ns with lower power consumption and cheaper to be produced?

While there might be an 80nm parent die, that wouldn't reduce power consumption...it would just be a bit cheaper to produce.

The first major reduction would be to 65nm for both the parent and the daughter. That should indeed reduce power consumption considerably. It's also possible that the plan is to combine the parent and daughter at 65nm, but I'm not sure if that'll be feasible considering the nature of EDRAM fabrication...it might still be too complex for that large of a chip.

Maybe, we might see refresh X360 with higher clock speed Xenos.... just hope.

That wouldn't make sense.
 
Joe DeFuria said:
While there might be an 80nm parent die, that wouldn't reduce power consumption...it would just be a bit cheaper to produce.

Why rule out a possible drop in power consumption?
 
I wonder if increased yields and smaller processes will increase the chances of dropping one of the arrays (4 implemented, 1 dropped for redudancy - seemingly, anyway) so that the die consists of just 3 arrays with no redundancy.

Jawed
 
Jawed said:
I wonder if increased yields and smaller processes will increase the chances of dropping one of the arrays (4 implemented, 1 dropped for redudancy - seemingly, anyway) so that the die consists of just 3 arrays with no redundancy.

Jawed

Good point Jawed. One of these process gens might produce a little 'extra' boost to the chip's costs simply by coupling the shrink with a drop of the redundant array.
 
Does anything happen to the layout during a process shrink? If they were to remove the redundant array, would they rearrange everything?

Edit: does someone have the die shot on-hand?
 
Last edited by a moderator:
Alstrong said:
Does anything happen to the layout during a process shrink? If they were to remove the redundant array, would they rearrange everything?

Edit: does someone have the die shot on-hand?

It would require a re-spin and core revision, but one would imagine at some point the benefits of doing so would outway the costs. That's something that will depend soley on volumes though.
 
xbdestroya said:
Why rule out a possible drop in power consumption?

Because IIRC the only real physical difference between 90 and 80 nm is the distance "between" transistors. The actual transistors themselves are not smaller. The "only" benefit of going from 90 to 80 nm is the size of the final product....cost.

65 nm is the next true process node, lower power consumption per transistor is essentially a given.
 
Jawed said:
I wonder if increased yields and smaller processes will increase the chances of dropping one of the arrays (4 implemented, 1 dropped for redudancy - seemingly, anyway) so that the die consists of just 3 arrays with no redundancy.

Jawed

Do we even know for a fact that there is one fully redundant shader array? I thought that was still questionable.
 
Joe DeFuria said:
Do we even know for a fact that there is one fully redundant shader array? I thought that was still questionable.


There was a die shot going around the forums awhile back.. can't seem to find it. It showed four rectangles in a column IIRC. Presumably, each is a shader array.
 
Joe DeFuria said:
Because IIRC the only real physical difference between 90 and 80 nm is the distance "between" transistors. The actual transistors themselves are not smaller. The "only" benefit of going from 90 to 80 nm is the size of the final product....cost.

65 nm is the next true process node, lower power consumption per transistor is essentially a given.

Well, if that's the case that's the case. Certainly though we've seen power savings on the other 'half-nodes,' such as 110nm most recently. Or is there a difference between the two here? I haven't read up too much on TSMC's 80nm process, but are you saying then that it does not really qualify for 'half-node' status either?

Anyway well if not I suppose the power situation wouldn't improve; in fact transistor density might lead to a worse heat/power profile. Still if yields are good to great at that point, maybe a core revision along with the shrink would allow them a drop in core voltage, giving a power benefit yet.
 
hmm...

RSX is cobbled together in the system...

RSX will hit bandwidth limitations all the time (even with 48GB/s I suppose)...

It thought the link between Xenos and Xenon was 10.8Gb/s read/write not 24GB/s....what about the CPU having access to system memory...

The PPU is trying to take the same market as GPUs...I haven't seen GPUs doing physics in games yet...well...I could wait for a CPU that can accelerate physics just as much on top of all it's other tasks...or a GPU that can blast through visuals and physics at the same level at the same time...

Smells like PR to me.
--------------------------------

Still there seems to be something interesting in what was said...

I didn't know that shaders could use any instruction from the ISA with Xenos. I asked before on this forum and was told X360 devs would still have to use vertex and pixel shaders...but if those shaders can use any instruction then aren't shaders unified at the software level? Or is it more like...a pixel op can now be used with vertex data and visa-versa and there are still distinct shader types? The latter sounds more powerful to me but maybe isn't a flexible as the former. Any thoughts on this? Seem an interesting topic to me. I wonder what the possibilities are.
 
Back
Top