ATI Technologies Interview: The Power of the Xbox 360 GPU

scificube said:
I see it two arrows between Xenos and the CPU. Why?

Between Xenos and main memory I see one arrow. Why?

bandwidths.gif


Double arrow is bi-directional, single arrow is unidirectional data flow.

The above diagram is misleading as it doesn't show any blocking occurring with bus accesses. This was discussed in the thread below. The conclusion was that peak access to GDDR at 22.4 GB/s blocks CPU-Parent_GPU to 10.8 GB/s (not 10.8+10.8)

i.e. 10.8 GB/s from L2 cache + 22.4 GB/s from GDDR is 'peak' (33.2 GB/s aggregate). Just look at the leak diagram form 2004 (bottom right corner '7'), and the more detailed data flow diagram below.

http://www.beyond3d.com/forum/showthread.php?t=23775

012l.jpg


xbox2.gif
 
Dave Baumann said:
Somewhat direct confirmation from ATI there that they are still working on (and are contracted to) die shrinks of Xenos.
Well, duuuuuur.

The quicker they get it done, the quicker Balmer gets closer to his "turn a profit in '07" goal. Half of the "engineers" that were working on Xenos probably were MS accountants...
 
scooby_dooby said:
That jumped otu at me to, but I didn't mention it for fear of looking dumb!

" The 22.4GB/sec link is the connection to main memory ...The GPU is also directly connected to the L2 cache of the CPUs – this is a 24GB/sec link Memory bandwidth is extremely important, "

This is news to me...
Xenos isn't a traditional GPU not only because it's using USA, but it's also a hybrid GPU/Northbridge.

So you have to try to abstract Xenon --> Xenos as Xenon to the GPU and Xenon to the NB. Does it still seem so weird that the NB has direct access to L2 cache?
 
xbdestroya said:
It would require a re-spin and core revision, but one would imagine at some point the benefits of doing so would outway the costs. That's something that will depend soley on volumes though.

There are no more optical shrinks. Haven't been for quite a while. Any shrink requires new layout and the full backend tool flow.

Aaron Spink
speaking for myself inc.
 
Joe DeFuria said:
Because IIRC the only real physical difference between 90 and 80 nm is the distance "between" transistors. The actual transistors themselves are not smaller. The "only" benefit of going from 90 to 80 nm is the size of the final product....cost.

65 nm is the next true process node, lower power consumption per transistor is essentially a given.

In partial processes, the transistors and/or metal stack is changed to result in a smaller die size for a given design. The will generally result in lower power at the same voltages and frequencies.

Aaron Spink
speaking for myself inc.
 
I've read at X86secrets that athlon 64 have 6.8gb/s to the main ram.
Is it up and down or only read and so it should be 13.6Gb/s read+write?
Can we consider the xenon bottlenecked in regard with this figure?
 
liolio said:
I've read at X86secrets that athlon 64 have 6.8gb/s to the main ram.

I can't remember but if it has DDR2 at 400 Mhz, effective 800 Mhz, and on a 64bit bus then,

64bit/8 x 0.8 Ghz ~ 6.4 GB/sec

liolio said:
Is it up and down or only read and so it should be 13.6Gb/s read+write?

That sounds aggregate, bidirectional at 6.4 GB/sec

liolio said:
Can we consider the xenon bottlenecked in regard with this figure?

Which figure?
 
The A64 (in most confgurations) will be running 200MHz DDR (400 effective) on a 128-bit bus, just an FYI.
 
Has Dave (or 'you' Dave if youre reading :) ) ever gotten an official confirmation on any of these bandwidth numbers from ATI or are we going on 'leaked' and 'leaked leaked' documents only?
 
Jaws said:
I can't remember but if it has DDR2 at 400 Mhz, effective 800 Mhz, and on a 64bit bus then,

64bit/8 x 0.8 Ghz ~ 6.4 GB/sec



That sounds aggregate, bidirectional at 6.4 GB/sec



Which figure?

In regard to the bandwith avaible in a current pc cpu (like Athlon64/k8)
Can we consider the bandwidth available for xenon to ram is a bottleneck?
i hope it's better like that...
 
Besides developing hardware, ATI always helps developers by releasing tools, source code samples, etc. For example, we heard about a displaced subdivision surfaces algorithm developed by Vineet Goel of ATI Research Orlando. Are you helping Xbox 360 developers leverage the power of the Xenos GPU or is that a Microsoft responsibility?

Bob Feldstein: We have teamed with Microsoft to enable developers. We have had some members of the developer relation and tools teams work directly at developer events and assist in training Microsoft employees. We are ready to help at anytime but the truth is that we have been quite impressed by how Microsoft is handling developer relationships.

We will push ideas like Vineet’s (and he has others) through Microsoft. When we say the Xbox has headroom, it is through the ability to enable algorithms on the GPU, including interesting lighting and surface algorithm like subdivision surfaces. It would be impossible to implement these algorithms in real time without our unique features such as shader export.

Can someone please explaing the benefits of subdivision surfaces and why they've been so difficult to implement in real-time? How will the memexport help in this situation?
 
scificube said:
I've always wondered how is it that IBM whipped up a solution that was nearly as fast as Flexio without the time and resources put into the effort like Rambus did. I don't think they actually accomplished that.
Actually it looks quite a bit like the elastic bus on the 970 family of processors. In its fastest incarnation on an Apple machine, it is 1.35 GHz at 32-bits width in either direction, or 5.4GB/s unidirectional (x2 for PC comparison purposes since there is another one like it going in the other direction).
The approach has the inherent drawback that it can't adapt to data flow direction, but on the other hand it doesn't have to deal with bus turnaround for instance, and it gains advantages in frequency. Overall, it's quite nice for talking to a northbridge , arguably superior to what goes into the P4s, or the upcoming Merom/Conroes. (Unarguably so at these bandwidths, even if real life performance doesn't match the theoretical.)

Can anyone confirm or deny my hypothesis regarding the connection? I'm not very interested in that CPU, so I haven't made an effort to stay on top of the information surrounding it. If it is a close relative to the 970 design, there are some pretty good descriptions available.
 
Last edited by a moderator:
I think it's more of a matter of production cost rather than bandwidth for these bus technologies. FlexIO is geared toward mass production for consumer electronics while the IBM bus technology looks like more heavy-duty and server oriented.
 
liolio said:
In regard to the bandwith avaible in a current pc cpu (like Athlon64/k8)
Can we consider the bandwidth available for xenon to ram is a bottleneck?
i hope it's better like that...

Peak data flow to/from XeCPU is,

~ 22.4 (GDDR3) + 10.8 (Read L2 Cache) ~ 33.2 GB/sec

Peak data flow to/from A64 is,

~ 6.4 (DDR) + 8 (PCI-e x16) ~ 14.4 GB/sec

So looking at the dataflow to/from both these CPUs it's clear that XeCPU has more breathing space tha A64. Also given that the backbuffer resides in eDRAM for X360, this should also add to the breating space. IMO, I think the X360s natural bottleneck would be around the 1 MB cache being shared accross 3 cores. Getting 6 threads to share that cache is asking for cache thrashing. I know you can lock the caches, but that still leaves very little per core/thread...

EDIT: Just wanted to add that texture data bandwidth would be shared/ in contention with GDDR3 and XeCPU, so that would be a potential bottleneck too...
 
Last edited by a moderator:
Jaws said:
IMO, I think the X360s natural bottleneck would be around the 1 MB cache being shared accross 3 cores.
Data flow in cpu can circumvent L2cache (to prevent cache thrashing), I read that in x360 patent.
 
Lysander said:
Data flow in cpu can circumvent L2cache (to prevent cache thrashing), I read that in x360 patent.

L2 cache is involved in intra-processor data flow. The point being, 1MB sounds too little to prevent this...
 
Hardknock said:
Can someone please explaing the benefits of subdivision surfaces and why they've been so difficult to implement in real-time? How will the memexport help in this situation?

Anybody know anything about this? Should I make a new topic? :p
 
Subdivision surfaces are a type of HOS that's quite widely used in the CGI industry. It has got very few restrictions: it allows arbitrary topology so you can easily model T-junctions and thus, complex organic surfaces, it's also easy to UV map, and the modeling toolset is quite evolved by now.

I can only guess about the programming issues. One of the probable glitches is that if you can't do adaptive tesselation, then the polygon count will quadruple with each additional level of subdivision, and that's quite a lot if you want to change between different levels of detail. But adaptive tesselation is pretty hard with them, the only really good way I know to do it is PRMan's micropolygons.
With that said, several Nvidia techdemos used subdivs, but used the CPU for the tesselation. Xenos has hardware support for a different kind of HOS as well, so it will have to rely on the XCPU for it and I bet that's still quite slow.

Another thing to consider is that while the topology is arbitrary, it's still better to keep it rather organized to get good results. Which usually means more polygons then absolutely neccessary... And all that it does is a smoother silhouette and shading, but no extra details - you'd need displacement mapping for that. So it isn't a feature with great rewards in itself. On this upcoming generation, displacement on characters is quite unlikely though, so I believe that most developers will choose low-poly models with normal mapping instead. Or maybe do something with the tesselator in the Xenos; King Kong is supposed to use it on the T-rex but I have yet to see it.
 
Subdivision surfaces (which I think is simply the creation of extra vertices by linear interpolation between existing vertices - dunno though!) is a separate process from what memexport supports. Subdivision surfaces are supported directly by a piece of fixed function hardware in Xenos, as far as I can tell.

Memexport is, in this instance, about writing new vertex data (i.e. coordinates, colour, orientation etc.) which a shader will have derived from a list of vertices (or objects). The shader could use a subdivision algorithm, but it's my understanding it's more likely to use higher order surface type algorithms, or to work out level-of-detail for tessellation. Additionally you'll get animation and skinning shaders implemented on the GPU, which works out the appearance of models and writes out (memexport) a final version of all the vertex data itself - something normally done by the CPU. Xenos then reads that data back in for actual rendering - perhaps multiple times, e.g. for shadowing algorithms, etc.

Sadly it's hard to get any devs who know about geometry/vertex shading under Xenos (or DX10) to comment in depth about this.

I'm sure if you search for memexport under my name you'll get a pile of stuff - but I'm out of my depth in general... Search for DX10 and geometry or "stream out" (or "streamout"), too.

Jawed
 
Back
Top