Xenos - invention of the BackBuffer Processing Unit?

nAo · May 23, 2005

Many GPUs architecture can be seen as a (non symmetric) multiprocessors architecture:
Nvidia calls NV40 a 3 processors architecture: vertex shaders + pixel shaders + ROPs.
ROPs even run at memory clock not GPU core clock, and it's a given ROPs have their local store/cache that hold some pixel tiles.

Titanio · May 23, 2005

Jawed said:
Your point that R500 is actually a dual-processor GPU is salient. The EDRAM chip truly is a second processor.

Statements like this should be qualified - there are many eyes on this board looking purely for fanboy console war fodder on other boards. It's it's own processor, but it's very limited to a specific task (framebuffer tasks), not a second equivalent to the parent GPU. Such logic is on pretty much every GPU, from what I know of the eDram in R500, they've just moved it out onto it's own chip closer to the eDram to save bandwidth.

Jawed · May 23, 2005

Npl said:
Its a moot point directly comparing On-Chip Bandwith to external Bandwith, by the same logic I could add Cells internal Bandwith to "own" other CPUs

Cell:
1 SPU has 3*128Bit read, 1*128Bit Write per Cycle = 143 GB/s read, 47 GB/s write to Local Storage.
7 SPUs = 1 TB/s read, 376 GB/s write.

This kind of evaluation is entirely valid in my opinion. Cell is an ultra-high performance digital signal processor. Very small blocks of data being processed incredibly quickly to produce a mind-boggling aggregate.

Jawed

Jawed · May 23, 2005

nAo said:
Many GPUs architecture can be seen as a (non symmetric) multiprocessors architecture:
Nvidia calls NV40 a 3 processors architecture: vertex shaders + pixel shaders + ROPs.
ROPs even run at memory clock not GPU core clock, and it's a given ROPs have their local store/cache that hold some pixel tiles.

You can go further and describe each quad-pipeline as a processor core.

What's interesting is that in R300 and up, the quad-pipelines are each able to run a different shader.

In NV40 it seems that the same shader is running on all pipelines.

So R420 is, for example, 4-way MIMD, whereas NV40 is 16-way SIMD.

I think the vertex pipes in both architectures are purely independent and parallel.

Hope I've got that all correct - wouldn't want to start a second off-forum flame-war in only two posts!

Jawed

Shifty Geezer · May 23, 2005

Titanio said:
Jawed said:

Your point that R500 is actually a dual-processor GPU is salient. The EDRAM chip truly is a second processor.

Click to expand...

Statements like this should be qualified - there are many eyes on this board looking purely for <bleep> console war fodder on other boards. It's it's own processor, but it's very limited to a specific task (framebuffer tasks), not a second equivalent to the parent GPU. Such logic is on pretty much every GPU, from what I know of the eDram in R500, they've just moved it out onto it's own chip closer to the eDram to save bandwidth.

Yes, but we are making progress!

It's agreed that the eDRAM isn't local storage for the GPU per se, but a subset of the GPU. It's connected to the GPU via a bus of 48 GB/s, which is fast, but not the same as actually being embedded on the GPU. Presumably this decision was made to improve yields, as embedding the backbuffer features with 10 Mbs along with the unified shaders would have been difficult.

@ nAo : In the cases of other GPU's being consider three processors, how do those processors share data? Do they have an actual bus system with restricted data flow?

Fafalada · May 23, 2005

Jawed said:
This kind of evaluation is entirely valid in my opinion. Cell is an ultra-high performance digital signal processor. Very small blocks of data being processed incredibly quickly to produce a mind-boggling aggregate.

The problem with this is two fold - one, how do you count this into "system" bandwith figures - simply adding the numbers together like those Xenon charts did is complete nonsense.
Second, if you count these, the question immediately comes up why we don't count bandwith from on-demand loaded caches into equation as well.

Or to drive my GS example one step further - "hidden" bandwith is there too.
48GB/sec is the bandwith between page buffer and ROPs. The eDram->page buffer bandwith is actually much higher - 150GB/sec.
Surely we should count that figure instead?

JAD · May 23, 2005

Fafalada said:
While it isn't fair to just count the 256 GB/s bandwidth as system bandwidth, it wouldn't be fair also to entirely dismiss it.

Click to expand...

That's what generally everyone did when comparing PS2 to XBox, so why should this be any different?

At any rate, not even the most rabid fanboys had the audacity to make idiotic charts claiming ~50GB/s+(PS2) (or ~30GB/s+(GC)) vs paltry 6.4GB/s of Xbox.

Yeah, but making the same mistake again this time again would be really stupid right, as there are real benefits. I've not contributed to the discussions between ps2 vs xbox so I really don't know what was said and what wasn't.

I think that dismissing this bandwidth saving feature here and now because everyone did it 5 years ago is something we shouldn't do around here.

Npl · May 23, 2005

JAD said:
I think that dismissing this bandwidth saving feature here and now because everyone did it 5 years ago is something we shouldn't do around here.

I dont think anyone is dissmissing it, its just the direct comparisons between bandwith of a 10MB "Cache" and the whole Memory that are futile.

Jawed · May 23, 2005

Fafalada - I think the problem with these bandwidths is trying to identify them and work out which are useful to count up.

When AMD doubles the cache on an A64, it bumps up CPU performance by around 10%. That's a good example of increased performance (increased effective bandwidth) accruing from a non-obvious design change. A Newcastle and a Clawhammer core at the same clock speed perform differently because of the difference in cache size.

Damn doesn't that muck-up our ability to "count" system bandwidth? Yep!

Jawed

AzBat · May 23, 2005

DaveBaumann said:
AzBat said:

I agree, that sounds like that might be a good way to show the benefits of the eDram.

Click to expand...

Hey Tommy, remember what I said about this particular figure?

Yeah, _now_ I know what you mean. I always like bigger numbers, but if only they're used correctly. Though I can understand these numbers being kinda sticky.

Tommy McClain

Unknown Soldier · May 23, 2005

I've asked Dave to ask if 3Dc or higher is used in the R500. I've also asked for any clarification if Fast14 was used in the process of the Xbox2.

Do you think MS might have used any of the above in the Xbox2 since they are bandwidth saving features?

btw.. welcome JAD

US

Nite_Hawk · May 23, 2005

AzBat said:
DaveBaumann said:

AzBat said:

I agree, that sounds like that might be a good way to show the benefits of the eDram.

Click to expand...

Hey Tommy, remember what I said about this particular figure?

Click to expand...

Yeah, _now_ I know what you mean. I always like bigger numbers, but if only they're used correctly. Though I can understand these numbers being kinda sticky.

Tommy McClain

Heya AzBat,

Remember the project that we were thinking about a while back to figure out the contribution different hardware made to the speed at which the system could render a scene? I kind of wish I still had time to work on it. It's interesting when looking at this kind of stuff.

Nite_Hawk

AzBat · May 23, 2005

Nite_Hawk said:
Heya AzBat,

Remember the project that we were thinking about a while back to figure out the contribution different hardware made to the speed at which the system could render a scene? I kind of wish I still had time to work on it. It's interesting when looking at this kind of stuff.

Nite_Hawk

Yeah, I remember. That would be helluva a project. Not sure it would work in a closed console system, but it still could be useful for the PC market. Not as useful as 2 or 3 years ago though. It seems right now things a pretty even with the only 2 players out there. Basically make decisions based on price and brand preference.

Tommy McClain

Bohdy · May 23, 2005

I agree that the eDram is important, but that you can't just add the bandwidth to that of main memory.

A good place to start is to figure out how the RSX does equvalent functions to the eDram logic, and how much bandwidth that would typically consume, then subtract it from the PS3 bandwidth figure for comparison. Or if the PS3 would not do some of those functions, then consider it as points towards the X360.

nelg · May 23, 2005

Unknown Soldier said:
Well that's what I would've thought .. except

http://www.beyond3d.com/forum/viewtopic.php?t=23126&postdays=0&postorder=asc&start=40

Jawed said:

rwolf said:

Here is a nasty limitation.

All 48 of the ALUs are able to perform operations on either pixel or vertex data. All 48 have to be doing the same thing during the same clock cycle (pixel or vertex operations), but this can alternate from clock to clock. One cycle, all 48 ALUs can be crunching vertex data, the next, they can all be doing pixel ops, but they cannot be split in the same clock cycle.

Click to expand...

Click to expand...

Yep, it sounds shit to me. It makes me wonder if dynamic branching is ever going to bring improved performance. Seems unlikely.

Jawed

Click to expand...

Well after reading that .. it seems to me that it can't do Pixel and Vertex Operations at the same time. Would be an interesting question to the team?

[guessing] I would suspect that the level of granularity in which this would be a problem is unlikely. The design seems well thought out and I would assume that it would be very unlikely that such a limitation would have been overlooked. GPU are all about hiding latency. The Arbiters job is to keep utilization of the ALUs high, so it would most likely be responsible to insure that such (pathological?) cases would have minimal effect. [/guessing]

Nite_Hawk · May 23, 2005

AzBat said:
Nite_Hawk said:

Heya AzBat,

Remember the project that we were thinking about a while back to figure out the contribution different hardware made to the speed at which the system could render a scene? I kind of wish I still had time to work on it. It's interesting when looking at this kind of stuff.

Nite_Hawk

Click to expand...

Yeah, I remember. That would be helluva a project. Not sure it would work in a closed console system, but it still could be useful for the PC market. Not as useful as 2 or 3 years ago though. It seems right now things a pretty even with the only 2 players out there. Basically make decisions based on price and brand preference.

Tommy McClain

I actually think for a closed system it would be really interesting if ATI/MS or nVidia/Sony did something like this internally to figure out where their bottlenecks are coming from. In all of these threads we keep talking about whether or not the PS3 is going to be bandwidth starved with respect to the framebuffer throughput required and how effective the edram on the ATI part is going to be. All of these things are going to be immensely dependent on how effective their compression is and how much overdraw there is. ATI for instance, solved the problem with their ROP/edram solution. We don't know what nVidia's solution is (if they even have one), but we also don't really know if they need one.

I think it would be immensely interesting to see tests done on something like the PS3 where the speeds of the spus, number of spus, gpu speed, and various bus speeds can all be manipulated to see how changing them changes performance. Does the PS3 even need a edram solution like ATIs? Does ATI really need the edram solution or could they have gotten away with less (like upping the main memory throughput)? I think these are the kinds of questions that such a system could help us answer. I wonder if something like this is already in place?

Nite_Hawk

PC-Engine · May 23, 2005

I see another advantage of ATI's solution when shrinking to 65nm.

Jawed · May 23, 2005

nelg - I agree with your guess, which is why I'm dubious that all 48 ALUs are running the same instruction.

On the other hand if all 48 ALUs are running different instructions, that necessitates a huge amount of instruction decode logic for the whole GPU, whereas in previous designs this logic only occurred relatively few times (once per quad in the pixel shaders, and once per vertex pipeline (?)). It also increases the amount of program counter logic (every ALU requires a private counter) and means that the shader state memory has to be partitioned per ALU, too, so that register accesses are incoherent.

In other words 48 ALUs with no grouping into quads is really messy.

Perhaps it's 16 ALUs per group, so there are three different threads concurrently executing on 16 ALUs each. Sounds like a reasonable compromise to me, but still with a high branch misprediction cost.

So this matter of SIMD versus n-way MIMD is intriguing...

Jawed

London Geezer · May 23, 2005

PC-Engine said:
I see another advantage of ATI's solution when shrinking to 65nm.

Who's fabbing the chip?

PC-Engine · May 23, 2005

london-boy said:
PC-Engine said:

I see another advantage of ATI's solution when shrinking to 65nm.

Click to expand...

Who's fabbing the chip?

Doesn't matter.

Xenos - invention of the BackBuffer Processing Unit?

nAo

Nutella Nutellae

Titanio

Jawed

Jawed

Shifty Geezer

uber-Troll!

Fafalada

JAD

Npl

Jawed

AzBat

Agent of the Bat

Unknown Soldier

Nite_Hawk

AzBat

Agent of the Bat

Bohdy

nelg

Nite_Hawk

PC-Engine

Jawed

London Geezer

PC-Engine

Similar threads