New article on Xbox360 GPU

Jawed · Sep 20, 2005

Shifty Geezer said:
I was saying the CPU isn't connected to anything other then BIU, not that the BIU connects to nothing other than CPU.

Fair dos, Daphne.

And I'm not really arguing any point about architecture other than the diagram isn't clear. eg. The CPU has 21.4, 21.6 or whatever GB/s to BIU, but unclear data BW from BIU to elsewhere. And it shows 16 GB\s each for textures and vertices which is 32 GB/s, more than the RAM BW. So obviously these are peak figures.

Which is the normal way to read a system diagram.

But dependencies aren't obvious. eg. Does Vertex data from the CPU eat into the 22.4 GB/s RAM BW, or does it pass through a different line (XPS in the diagram) and thus not eat into the RAM BW?

It's my assertion that CPU-GPU data passes entirely through the XPS Line Buffer, and therefore does not eat into the 22.4GB/s RAM bandwidth.

There's nothing on the diagram that indicates the contrary. The XBox2 leak diagram indicates that the GPU can accept 33.2GB/s and nothing on this diagram contradicts that.

Jawed

Jawed · Sep 20, 2005

Jaws said:
Errm...I've just stated on several occasions that there are no 'crossed' arrows that represent bandwidth to the memory controller that's greater than 22.4 GB/s and I've also stated the 'text' from the 'xenon leak' from last year. I'm not pulling this out of thin air...

XPS Line Buffer is not on the "FSB". Neither is the IOC.

That's your mistake as far as I can tell. CPU accesses to/from RAM obviously interfere with GPU accesses to RAM. But CPU writes directly to the GPU have an independent access path to the GPU which is via the XPS Line Buffer.

Jawed

Jawed · Sep 20, 2005

Shifty Geezer said:
If so, that'll be an effective BW of...

10.6 RAM to CPU (code and data)
22.4-10.6 = 11.8 RAM to GPU (texture, vertex etc.)
10.6 CPU to GPU via XPS (procedural data)

for 33.4 GB/s, or something like that.

Yes, that's one scenario. Similar to the one I posted earlier. And by the way, it really is 10.8, not 10.6.

But if XB360 has this direct data feed from CPU to GPU without going thtough RAM, why haven't MS said something of the sort?

Are you kidding? What do you think XPS is all about? What do you think the locked L2 with GPU-direct access is all about?

We know PS3 has a dedicated CPU<>GPU data flow, and MS have been quick to identify other data bandwidths they have such as eDRAM data. Why not this dedicated line that effectively frees up 10 GB/s if really present?

It's been there all along. Not a surprise to anyone who's been paying attention.

Jawed

j^aws · Sep 20, 2005

Jawed said:
XPS Line Buffer is not on the "FSB". Neither is the IOC.

That's your mistake as far as I can tell. CPU accesses to/from RAM obviously interfere with GPU accesses to RAM. But CPU writes directly to the GPU have an independent access path to the GPU which is via the XPS Line Buffer.

Jawed

I'm not even disagreeing with this. It's the aggregate 22.4 GB/s b/w *only* being stated to the memory hub. This is what's suggesting that 22.4 GB/s is being shared by all of the above as I stated in the last page...

Shifty Geezer · Sep 20, 2005

Jawed said:
Are you kidding? What do you think XPS is all about? What do you think the locked L2 with GPU-direct access is all about?

Locked cache is to use it as an efficient buffer instead of cache thrashing. It's optional and doesn't need to be confined to GPU writes. Locking cache doesn't inherently mean direct data flows to GPU.

It's been there all along. Not a surprise to anyone who's been paying attention.

Well an awful lot of people haven't being paying attention then

. I certainly don't remember diagrams or charts showing 256 GB/s eDRAM BW + 32/16 GB/s GPU<>eDRAM + 22.4 GB/s GPU <> RAM <> RAM + 10 GB/s CPU <> GPU. I don't remember even Dave's article pick up on direct data feeds from CPU either.

pipo · Sep 20, 2005

london-boy said:
Am i the only one to have noticed the BIG sign saying "Microsoft Confidential" at the bottom of that picture? Not that i really care, but it's not very confidential now is it...

Yeah. Bit strange when you think of it. You can't post a link to a scan because of copyright issues, but this is fine...

Whatever.

Oh, and that CPU<>GPU connection thingy is really old Shifty.

Dave's article said:
As the CPU is going to be using Xenos to handle all its memory transfers, the connection between the two has 10.8GB/s of bandwidth both upstream and downstream simultaneously. Additionally the Xenos graphics processor is able to directly lock the cache of the CPU in order to retrieve data directly from it without it having to go to system memory beforehand. The purpose of this is that one (or more, if wanted) of the three CPU cores could be generating very high levels of geometry that the developer doesn't want to, or can't, preserve in the memory footprints available on the system when in use. High-resolution dynamic geometry such as grass, leaves, hair, particles, water droplets and explosion effects are all examples of one type of scenario that the cache locking may be used in.

Jawed · Sep 20, 2005

If the XPS Line Buffer wasn't in that diagram, I'd have some sympathy with the idea that there's a bottleneck along the 22.4GB/s datapath caused by CPU-GPU data.

As it is, UMA creates a distinct bottleneck for all GPU<>RAM and CPU<>RAM accesses.

The way I read it, CPU-RAM accesses will be low bandwidth during general rendering - mostly data (textures) streaming in off DVD and world updates (physics, AI).

Dynamic graphics data will travel from RAM-CPU, be "processed" (e.g. animated) and go from CPU-GPU via the XPS Line Buffer.

Separately, the GPU will be taking the incoming vertex data and tessellating it - so writing it to RAM. And then re-reading it (directly from RAM, not via the CPU) to perform the final phase of tessellation and rasterisation/pixel shading.

Jawed

j^aws · Sep 20, 2005

Jawed said:
If the XPS Line Buffer wasn't in that diagram, I'd have some sympathy with the idea that there's a bottleneck along the 22.4GB/s datapath caused by CPU-GPU data.

As it is, UMA creates a distinct bottleneck for all GPU<>RAM and CPU<>RAM accesses.

The way I read it, CPU-RAM accesses will be low bandwidth during general rendering - mostly data (textures) streaming in off DVD and world updates (physics, AI).

Dynamic graphics data will travel from RAM-CPU, be "processed" (e.g. animated) and go from CPU-GPU via the XPS Line Buffer.

Separately, the GPU will be taking the incoming vertex data and tessellating it - so writing it to RAM. And then re-reading it (directly from RAM, not via the CPU) to perform the final phase of tessellation and rasterisation/pixel shading.

Jawed

The other thing to note is that the diagram doesn't show any latencies, so even if the 22.4 GB/s is shared by all, there should be different latencies involved from cache reads and GDDR reads etc...

Jawed · Sep 20, 2005

Shifty Geezer said:
Locked cache is to use it as an efficient buffer instead of cache thrashing. It's optional and doesn't need to be confined to GPU writes. Locking cache doesn't inherently mean direct data flows to GPU.

But the GPU is a principal client of the locked-cache algorithm.

Well an awful lot of people haven't being paying attention then . I certainly don't remember diagrams or charts showing 256 GB/s eDRAM BW + 32/16 GB/s GPU<>eDRAM + 22.4 GB/s GPU <> RAM <> RAM + 10 GB/s CPU <> GPU.

You've just described the leak diagram, well done:

At least you're paying attention now!

I don't remember even Dave's article pick up on direct data feeds from CPU either.

This is as close as Dave gets, it seems:

Additionally the Xenos graphics processor is able to directly lock the cache of the CPU in order to retrieve data directly from it without it having to go to system memory beforehand. The purpose of this is that one (or more, if wanted) of the three CPU cores could be generating very high levels of geometry that the developer doesn't want to, or can't, preserve in the memory footprints available on the system when in use. High-resolution dynamic geometry such as grass, leaves, hair, particles, water droplets and explosion effects are all examples of one type of scenario that the cache locking may be used in.

http://www.beyond3d.com/articles/xenos/index.php?p=03

Jawed

one · Sep 20, 2005

Goto and Nishikawa write nothing about XPS and Goto's diagram just ignores XPS as you see.

blakjedi · Sep 20, 2005

Why can the 3D core read more than any other device can transfer to it (33.2 GB/s) ? 22.4 GB/s= max read from Memory + 10.8GB/s = max read from CPU = 33.2 GB/s.

So the GPU can read directly from CPU and RAM simultaneously at 33.2 GB/s... no bottleneck there, but then the B/W to RAM is completely saturated.. new bottleneck. If the Ram had a read B/W of 44.8GB/s then there would be no bottlenecks in the system the way its currently designed...

Shifty Geezer · Sep 20, 2005

Jawed said:
You've just described the leak diagram, well done:

At least you're paying attention now!

Okay, my ignorance. From the earlier

10.8GB/s of bandwidth both upstream and downstream simultaneously. Additionally the Xenos graphics processor is able to directly lock the cache of the CPU in order to retrieve data directly from it without it having to go to system memory beforehand

I wrongly understood that to be across RAM IO. The leaked specs definitely shows the direct CPU<>GPU connection, though the CEDEC diagram seems to show only one way XPS contrary to 10.8GB/s of bandwidth both upstream and downstream simultaneously.

Jawed · Sep 20, 2005

blakjedi said:
Why can the 3D core read more than any other device can transfer to it (33.2 GB/s) ? 22.4 GB/s= max read from Memory + 10.8GB/s = max read from CPU = 33.2 GB/s.

The GPU is the primary user of bandwidth in any modern gaming system (console or PC).

In a PC you'll find that the CPU has access to only 4-6GB/s of real world bandwidth. Whereas the GPU has 30GB/s+.

So the GPU can read directly from CPU and RAM simultaneously at 33.2 GB/s... no bottleneck there, but then the B/W to RAM is completely saturated.. new bottleneck. If the Ram had a read B/W of 44.8GB/s then there would be no bottlenecks in the system the way its currently designed...

I think you're forgetting the 32GB/s of pixel bandwidth to the EDRAM.

The GPU only needs RAM write bandwidth for:

phase 1 tessellation data (vertex data)
other GPU generated data (e.g. physics calcs if the GPU is used as a physics co-processor)
finally rendered frame - when the frame is finished in the EDRAM unit, it has to go into RAM before it can be shown on screen

Jawed

Jawed · Sep 20, 2005

Shifty Geezer said:
Okay, my ignorance. From the earlier
I wrongly understood that to be across RAM IO. The leaked specs definitely shows the direct CPU<>GPU connection, though the CEDEC diagram seems to show only one way XPS contrary to 10.8GB/s of bandwidth both upstream and downstream simultaneously.

Careful, the leak shows CPU<>NB at 10.8GB/s each way.

The leak does not explicitly indicate the XPS Line Buffer. The only hint is the 33.2GB/s bandwidth from NB to GPU.

Jawed

Rockster · Sep 20, 2005

blakjedi said:
Why can the 3D core read more than any other device can transfer to it?

I think what you're missing is that the CPU is generating data procedurally. XPS (XBox procedural synthesis). You aren't just moving data around from place to place; but are instead using a small amount of input data, from which the CPU generates a large amount of output data.

Here's more on XPS that might help: http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars/2

blakjedi · Sep 20, 2005

Not completely OT but different... In PS3 the dot product calculations (51+ Billion) includes both the Cell and RSX work. We know that XeCPU can generate 9 bllion dot products with dedicated hardware... is there any way to determine how many DP the Xenos can generate so we get a better comparable number? Great thread BTW. This thread is exemplary of the original reason I started reading B3D. And the Jaws/Jawed thing never gets old.

Chisholm · Sep 20, 2005

Gerry said:
Am I the only person who, whenever I come across a technical discussion between Jaws and Jawed (and there are a few), imagines a schizophrenic hunched over their keyboard, holding a bizarre conversation amongst themselves in two completely different accents?

Maybe it's just me...

No, it's also me. It may have to do as well with the fact that their post count is almost identical.

Actually, they might just be twins. If that was the case, it would only be a matter of finding out who's the evil one. ;-)

PeterT · Sep 20, 2005

Jawed said:
The GPU is the primary user of bandwidth in any modern gaming system (console or PC).

In a PC you'll find that the CPU has access to only 4-6GB/s of real world bandwidth. Whereas the GPU has 30GB/s+.

Very true, but current PCs don't have 3 cores with VMX units chewing through vector data.

I believe that -- if the developers mange to use the "6 thread" CPU to its full potential -- the unified memory, or rather its bandwidth will be one of the more bothersome bottlenecks on X360.

Rather OT: My personal favourite would still be a PS3-like (Cell) architecture with Xenos, though more for the reason that Xenos is more interesting and flexible for GPGPU tasks than any belief that it will be faster than RSX in average gaming situations.

Jawed · Sep 20, 2005

You need to read that ArsTechnica article linked above.

The bottleneck you allude to would be arrived at simply by entirely ignoring the procedural synthesis feature that links Xenon and Xenos - it's there to save vast amounts (10s of GB) of bandwidth. 10.8GB/s from CPU-GPU is the equivalent of 21.6GB/s CPU-RAM-GPU.

I fully expect Cell/RSX to work the same way, for what it's worth. Though I'm not aware of confirmation, as yet, that vertex/texture data from Cell can be sent to RSX without going via RAM (either XDR or GDDR3). But the 20GB/s bandwidth from Cell to RSX sounds ideal.

Jawed

j^aws · Sep 20, 2005

blakjedi said:
Not completely OT but different... In PS3 the dot product calculations (51+ Billion) includes both the Cell and RSX work. We know that XeCPU can generate 9 bllion dot products with dedicated hardware... is there any way to determine how many DP the Xenos can generate so we get a better comparable number?...

Yes it's off-topic!

Xenos has 48 vec4 units, each capable of a dot product per cycle.

Xenos ~ 48*0.5 GHz ~ 24 GigaDots/sec

X360 ~ 9+24 ~ 33 GigaDots/sec

http://www.beyond3d.com/forum/showthread.php?t=20094

New article on Xbox360 GPU

Jawed

Jawed

Jawed

j^aws

Shifty Geezer

uber-Troll!

pipo

Jawed

j^aws

Jawed

one

Unruly Member

blakjedi

Shifty Geezer

uber-Troll!

Jawed

Jawed

Rockster

blakjedi

Chisholm

PeterT

Jawed

j^aws

Similar threads