Xenos - invention of the BackBuffer Processing Unit?

Jawed said:
nAo said:
Many GPUs architecture can be seen as a (non symmetric) multiprocessors architecture:
Nvidia calls NV40 a 3 processors architecture: vertex shaders + pixel shaders + ROPs.
ROPs even run at memory clock not GPU core clock, and it's a given ROPs have their local store/cache that hold some pixel tiles.

You can go further and describe each quad-pipeline as a processor core. ;)

What's interesting is that in R300 and up, the quad-pipelines are each able to run a different shader.

In NV40 it seems that the same shader is running on all pipelines.

So R420 is, for example, 4-way MIMD, whereas NV40 is 16-way SIMD.

I think the vertex pipes in both architectures are purely independent and parallel.

Hope I've got that all correct - wouldn't want to start a second off-forum flame-war in only two posts!

Jawed
Where did you get the info about NV40 only processing 1 shader per clock over its 4 quads? I had understood that each quad processed its own discrete set of pixels, but that the quads were assigned a group of pixels on a case-by-case basis from a triangle, unlike R3/4xx, which sets up its pixel shaders with tiles.
 
Jawed said:
Nitehawk - I think the IHVs use simulators for this kind of stuff.

Jawed

Do you think they use simulators to actually simulate the chip itself, or build statistical performance models? I'm curious, because if they actually simulated the chip itself I'd be worried that the underlying system would color the performance of the simulated part.

Nite_Hawk
 
This page:

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=7

implies that all quads will render a single triangle.

What's not clear is how soon a new triangle can be rendered by spare quads. It would make sense that (I'm wrong:) each quad has separate instruction decode and spare quads can start the new triangle immediately.

But I'm not sure. It would be interesting to find out for sure.

One of the curiosities of NV40 is that dynamic branching isn't at the quad level (as far as experimenters can tell)...

Jawed
 
NV40 pixel pipelines work on batches of ~1000 pixel, I doubt all the quads are working on the same triangle, performances on small triangles would be quite poor
 
Nite_Hawk said:
Do you think they use simulators to actually simulate the chip itself, or build statistical performance models? I'm curious, because if they actually simulated the chip itself I'd be worried that the underlying system would color the performance of the simulated part.

Nite_Hawk

http://3dcenter.org/artikel/2005/03-31_a_english.php

At several stages during product development, performance is measured on both software and hardware simulators, to make sure we are on the right track. As soon as we have chips, we start measuring the real performance of the part. Often, at this point, we are not certain of the final clocks, so we explore performance results across a variety of clock configurations.

Jawed
 
nAo said:
NV40 pixel pipelines work on batches of ~1000 pixel, I doubt all the quads are working on the same triangle, performances on small triangles would be quite poor
Do you know how a batch is constructed?

Jawed
 
When counting bandwidth is it reasonable to separate bandwidth into functional units?

For example

Logic to Logic : CPU -> GPU is an example

Logic to RAM: CPU -> Main Ram is an example

These bandwidth assessments represent "badnwidth hops" between functional unit types (memory and logic).

The assessment of counting the internal memory of the SPUs doesn't work here because it is CPU work that normally would not require the use of external bandwidth under any circumstances. For example no one is counting the bandwith internal to X360 CPU cores and the L1/L2 caches.

However under the structure of Xenos, work that would normally be done utilizing external bandwith being used (a "hop") is internalized and made faster. The hard part is that now that some logic is nested... do you count the bandwidth based on hops all the way through?

CPU -> [GPU] is one "bandwidth hop" but then there are additional bandwidth hops represented as [GPU (Shader -> [eDRAM -> Logic])]

Somehow or other we can justify the internal bandwidth hop from Shader to eDRAM (48 GB) but not the additional nested bandwidth hop (eDRAM memory -> eDRAM Logic) ?

Think of it like this: if there was a small 10MB cache separate from GPU, CPU, and Main RAM used to do this work, you WOULD count the bandwidth it took to access it... because it encapsulates "a hop" in the traditional sense.

Just because its unconventional doesnt mean you dont count it.
 
blackjedi said:
For example
So what you're saying is that all existing examples of what you call "internalized bandwith" aren't valid to count well... because .... well because ... we just don't count them...
But in R500 we must make an exception to the above rule because... well.. because you or someone else(Microsoft?) says so...

Ok just wanted to clear this up.

Anyway even by this logic (aka, CPU's don't count or whatever), noone explained to me why GS page refill bandwith doesn't count - it's pretty much the same thing as the R500 high number.
 
blakjedi said:
The assessment of counting the internal memory of the SPUs doesn't work here because it is CPU work that normally would not require the use of external bandwidth under any circumstances.
It would require external bandwidth if it didn't have local storage. The introduction of local storage means we don't count the bandwidth the processors need, though they are consuming data at however many bytes/s. Why should that be different for GPU's?

When counting bandwidth is it reasonable to separate bandwidth into functional units?
I would agree. In such cases [eDRAM -> Logic], why do you count eDRAM and Logic as two separate functional units?

If you count hops, why does the backbuffer processing on local storage count as a hop ? If that is a hop, why isn't it a hop from level 1 cache to CPU logic?
 
Jawed said:
Nite_Hawk said:
Do you think they use simulators to actually simulate the chip itself, or build statistical performance models? I'm curious, because if they actually simulated the chip itself I'd be worried that the underlying system would color the performance of the simulated part.

Nite_Hawk

http://3dcenter.org/artikel/2005/03-31_a_english.php

At several stages during product development, performance is measured on both software and hardware simulators, to make sure we are on the right track. As soon as we have chips, we start measuring the real performance of the part. Often, at this point, we are not certain of the final clocks, so we explore performance results across a variety of clock configurations.

Jawed

Thanks for the link, that was interesting reading. :)

I'd be really interested in talking to Greg more about their testing Methodology. I wonder how hard he is to track down.

Nite_Hawk
 
I'm slowly starting to view R500 a bit as a hybrid IMR/TBDR of some kind. Sure, it doesn't have units to do Z layout of one tile while shading another. And it can't remove opaque overdraw completely, or do order-independent transparency. But it can do a very fast Z first pass and then remove most of the opaque overdraw (limited by hierZ granularity), do blending and AA without apparent cost, and it renders into a large "on-chip" tile. Though, we don't know how it handles the "binning" yet, i.e. how it handles triangles that pass tile borders, whether they're going through VS again or not, etc.
 
Shifty Geezer said:
If you count hops, why does the backbuffer processing on local storage count as a hop ? If that is a hop, why isn't it a hop from level 1 cache to CPU logic?

blakjedi said:
Think of it like this: if there was a small 10MB cache separate from GPU, CPU, and Main RAM used to do this work, you WOULD count the bandwidth it took to access it... because it encapsulates "a hop" in the traditional sense.

I think it is normal to do that back buffer work in main ram (like in the RSX). Under that circumstance you count the bandwidth used to access the RAM. Just because you use specialized RAM with a high-speed bandwidth and a fairly wide interconnect away from Main RAM why shouldn't it be counted? Its a functional work unit also. I'm not necessarily invested in it the outcome or the answer, I just like posing the question :D

Its quite the opposite with the local SPUs because no one counts the bandwidth associated with CPU logic and local memory stores such as L1 and L2 and its normal not to do that...

Some how or other, this makes sense to me but its not getting across very well.
 
Some thoughts...

People seem to try to dismiss the advantages of Xenos's EDRAM just because an MS PR person made the wrong choice of counting it into the system bandwith. Yeah, he was wrong - so does that suddenly remove this as an advantage?

Summing up bandwith in the way presented here is obviously not the ideal approach, so why beat a dead horse from both sides? Stop it and move to more interesting topics :)

I'm sure the console devs here have ideas about average bandwith requirements for actual ingame graphics. Like, we probably have such an amount of opaque overdraw, and such an amount of transparent overdraw and so on. We choose a resolution like 720p 2x AA to make the field as even as possible, and caclulate the probable backbuffer/framebuffer bandwith utilization for the two systems. We'll then see how much the RSX has to spend from its bandwith, how much is left; and how that actually compares to the ammount of traffic that Xenos needs to copy the backbuffer out to the main memory.
Then we can repeat for other resolutions like 1080i/p, move up to 4X AA and so on... this would be a lot more profitable to the forum then the flamewar that you seem to get into about opinions :)
 
Laa-Yosh said:
Some thoughts...

People seem to try to dismiss the advantages of Xenos's EDRAM just because an MS PR person made the wrong choice of counting it into the system bandwith. Yeah, he was wrong - so does that suddenly remove this as an advantage?

Summing up bandwith in the way presented here is obviously not the ideal approach, so why beat a dead horse from both sides? Stop it and move to more interesting topics :)

I'm sure the console devs here have ideas about average bandwith requirements for actual ingame graphics. Like, we probably have such an amount of opaque overdraw, and such an amount of transparent overdraw and so on. We choose a resolution like 720p 2x AA to make the field as even as possible, and caclulate the probable backbuffer/framebuffer bandwith utilization for the two systems. We'll then see how much the RSX has to spend from its bandwith, how much is left; and how that actually compares to the ammount of traffic that Xenos needs to copy the backbuffer out to the main memory.
Then we can repeat for other resolutions like 1080i/p, move up to 4X AA and so on... this would be a lot more profitable to the forum then the flamewar that you seem to get into about opinions :)

As there is no real need for the xbox360 to do any kind of compression between the on die logic and the eDram because it has kind of unlimited "bandwidth" between them, you can bet that RSX will use as much compression as it can to save bandwidth. Can we, given this, calculate the backbuffer bandwidth utilization? I figure that with a more complex scene the compression would be less effective.

Can anyone comment on this?
Maybe we can calculate bandwidth utilization without compression and then sort of make an educated guess at how effective compression can be?
 
Unknown Soldier said:
I've asked Dave to ask if 3Dc or higher is used in the R500. I've also asked for any clarification if Fast14 was used in the process of the Xbox2.

An excellent question, can't wait to hear back on that one!
 
Xmas said:
I'm slowly starting to view R500 a bit as a hybrid IMR/TBDR of some kind. Sure, it doesn't have units to do Z layout of one tile while shading another. And it can't remove opaque overdraw completely, or do order-independent transparency. But it can do a very fast Z first pass and then remove most of the opaque overdraw (limited by hierZ granularity), do blending and AA without apparent cost, and it renders into a large "on-chip" tile. Though, we don't know how it handles the "binning" yet, i.e. how it handles triangles that pass tile borders, whether they're going through VS again or not, etc.
How does PowerVR handle triangles that cross tile boundaries? I would imagine they are shaded once for each tile and clipped.
 
where are we getting the edram to xenos bandwidth from ? I know the edram to the rest of the edram chip logic is 256gb . But where did the figure for the edram to xenos come from ?
 
jvd said:
where are we getting the edram to xenos bandwidth from ? I know the edram to the rest of the edram chip logic is 256gb . But where did the figure for the edram to xenos come from ?

Presumably from the Tech report diagram.

http://techreport.com/etc/2005q2/xbox360-gpu/block.gif

Pretty impressive but somewhat unfortunate that the MS marketing types use that number to fill in their aggregate bandwidth for the whole system when its architecturally an internal bus between 2 elements of the GPU rather than the rest of the system like to CPU or memory but that's what sells hype i guess. But that's been discussed elsewhere.
 
jvd said:
where are we getting the edram to xenos bandwidth from ? I know the edram to the rest of the edram chip logic is 256gb . But where did the figure for the edram to xenos come from ?

The bandwidth between the Parent die (i.e. Shader Logic) and the Daughter die (i.e. eDRAM with backbuffer logic for Z/Alpha/Stencil) is said to be 32GB/s read and 16GB/s are figures derived from the Xbox Block Diagram leak from last year.

I have not seen that information in any of the new material posted from MS. I guess Dave will be able to confirm what the bandwidth is when he gets his snazzy report done for us :D
 
Back
Top