Xenos - invention of the BackBuffer Processing Unit?

Many GPUs architecture can be seen as a (non symmetric) multiprocessors architecture:
Nvidia calls NV40 a 3 processors architecture: vertex shaders + pixel shaders + ROPs.
ROPs even run at memory clock not GPU core clock, and it's a given ROPs have their local store/cache that hold some pixel tiles.
 
Jawed said:
Your point that R500 is actually a dual-processor GPU is salient. The EDRAM chip truly is a second processor.

Statements like this should be qualified - there are many eyes on this board looking purely for fanboy console war fodder on other boards. It's it's own processor, but it's very limited to a specific task (framebuffer tasks), not a second equivalent to the parent GPU. Such logic is on pretty much every GPU, from what I know of the eDram in R500, they've just moved it out onto it's own chip closer to the eDram to save bandwidth.
 
Npl said:
Its a moot point directly comparing On-Chip Bandwith to external Bandwith, by the same logic I could add Cells internal Bandwith to "own" other CPUs

Cell:
1 SPU has 3*128Bit read, 1*128Bit Write per Cycle = 143 GB/s read, 47 GB/s write to Local Storage.
7 SPUs = 1 TB/s read, 376 GB/s write.

This kind of evaluation is entirely valid in my opinion. Cell is an ultra-high performance digital signal processor. Very small blocks of data being processed incredibly quickly to produce a mind-boggling aggregate.

Jawed
 
nAo said:
Many GPUs architecture can be seen as a (non symmetric) multiprocessors architecture:
Nvidia calls NV40 a 3 processors architecture: vertex shaders + pixel shaders + ROPs.
ROPs even run at memory clock not GPU core clock, and it's a given ROPs have their local store/cache that hold some pixel tiles.

You can go further and describe each quad-pipeline as a processor core. ;)

What's interesting is that in R300 and up, the quad-pipelines are each able to run a different shader.

In NV40 it seems that the same shader is running on all pipelines.

So R420 is, for example, 4-way MIMD, whereas NV40 is 16-way SIMD.

I think the vertex pipes in both architectures are purely independent and parallel.

Hope I've got that all correct - wouldn't want to start a second off-forum flame-war in only two posts!

Jawed
 
Titanio said:
Jawed said:
Your point that R500 is actually a dual-processor GPU is salient. The EDRAM chip truly is a second processor.

Statements like this should be qualified - there are many eyes on this board looking purely for <bleep> console war fodder on other boards. It's it's own processor, but it's very limited to a specific task (framebuffer tasks), not a second equivalent to the parent GPU. Such logic is on pretty much every GPU, from what I know of the eDram in R500, they've just moved it out onto it's own chip closer to the eDram to save bandwidth.
Yes, but we are making progress! :D It's agreed that the eDRAM isn't local storage for the GPU per se, but a subset of the GPU. It's connected to the GPU via a bus of 48 GB/s, which is fast, but not the same as actually being embedded on the GPU. Presumably this decision was made to improve yields, as embedding the backbuffer features with 10 Mbs along with the unified shaders would have been difficult.

@ nAo : In the cases of other GPU's being consider three processors, how do those processors share data? Do they have an actual bus system with restricted data flow?
 
Jawed said:
This kind of evaluation is entirely valid in my opinion. Cell is an ultra-high performance digital signal processor. Very small blocks of data being processed incredibly quickly to produce a mind-boggling aggregate.
The problem with this is two fold - one, how do you count this into "system" bandwith figures - simply adding the numbers together like those Xenon charts did is complete nonsense.
Second, if you count these, the question immediately comes up why we don't count bandwith from on-demand loaded caches into equation as well.

Or to drive my GS example one step further - "hidden" bandwith is there too.
48GB/sec is the bandwith between page buffer and ROPs. The eDram->page buffer bandwith is actually much higher - 150GB/sec.
Surely we should count that figure instead? ;)
 
Fafalada said:
While it isn't fair to just count the 256 GB/s bandwidth as system bandwidth, it wouldn't be fair also to entirely dismiss it.
That's what generally everyone did when comparing PS2 to XBox, so why should this be any different?

At any rate, not even the most rabid fanboys had the audacity to make idiotic charts claiming ~50GB/s+(PS2) (or ~30GB/s+(GC)) vs paltry 6.4GB/s of Xbox.

Yeah, but making the same mistake again this time again would be really stupid right, as there are real benefits. I've not contributed to the discussions between ps2 vs xbox so I really don't know what was said and what wasn't.

I think that dismissing this bandwidth saving feature here and now because everyone did it 5 years ago is something we shouldn't do around here.
 
JAD said:
I think that dismissing this bandwidth saving feature here and now because everyone did it 5 years ago is something we shouldn't do around here.

I dont think anyone is dissmissing it, its just the direct comparisons between bandwith of a 10MB "Cache" and the whole Memory that are futile.
 
Fafalada - I think the problem with these bandwidths is trying to identify them and work out which are useful to count up.

When AMD doubles the cache on an A64, it bumps up CPU performance by around 10%. That's a good example of increased performance (increased effective bandwidth) accruing from a non-obvious design change. A Newcastle and a Clawhammer core at the same clock speed perform differently because of the difference in cache size.

Damn doesn't that muck-up our ability to "count" system bandwidth? Yep!

Jawed
 
DaveBaumann said:
AzBat said:
I agree, that sounds like that might be a good way to show the benefits of the eDram.

Hey Tommy, remember what I said about this particular figure? ;)

Yeah, _now_ I know what you mean. I always like bigger numbers, but if only they're used correctly. Though I can understand these numbers being kinda sticky. ;)

Tommy McClain
 
I've asked Dave to ask if 3Dc or higher is used in the R500. I've also asked for any clarification if Fast14 was used in the process of the Xbox2.

Do you think MS might have used any of the above in the Xbox2 since they are bandwidth saving features?

btw.. welcome JAD

US
 
AzBat said:
DaveBaumann said:
AzBat said:
I agree, that sounds like that might be a good way to show the benefits of the eDram.

Hey Tommy, remember what I said about this particular figure? ;)

Yeah, _now_ I know what you mean. I always like bigger numbers, but if only they're used correctly. Though I can understand these numbers being kinda sticky. ;)

Tommy McClain

Heya AzBat,

Remember the project that we were thinking about a while back to figure out the contribution different hardware made to the speed at which the system could render a scene? I kind of wish I still had time to work on it. It's interesting when looking at this kind of stuff. :)

Nite_Hawk
 
Nite_Hawk said:
Heya AzBat,

Remember the project that we were thinking about a while back to figure out the contribution different hardware made to the speed at which the system could render a scene? I kind of wish I still had time to work on it. It's interesting when looking at this kind of stuff. :)

Nite_Hawk

Yeah, I remember. That would be helluva a project. Not sure it would work in a closed console system, but it still could be useful for the PC market. Not as useful as 2 or 3 years ago though. It seems right now things a pretty even with the only 2 players out there. Basically make decisions based on price and brand preference. ;)

Tommy McClain
 
I agree that the eDram is important, but that you can't just add the bandwidth to that of main memory.

A good place to start is to figure out how the RSX does equvalent functions to the eDram logic, and how much bandwidth that would typically consume, then subtract it from the PS3 bandwidth figure for comparison. Or if the PS3 would not do some of those functions, then consider it as points towards the X360.
 
Unknown Soldier said:
Well that's what I would've thought .. except

http://www.beyond3d.com/forum/viewtopic.php?t=23126&postdays=0&postorder=asc&start=40

Jawed said:
rwolf said:
Here is a nasty limitation.

All 48 of the ALUs are able to perform operations on either pixel or vertex data. All 48 have to be doing the same thing during the same clock cycle (pixel or vertex operations), but this can alternate from clock to clock. One cycle, all 48 ALUs can be crunching vertex data, the next, they can all be doing pixel ops, but they cannot be split in the same clock cycle.


Yep, it sounds shit to me. It makes me wonder if dynamic branching is ever going to bring improved performance. Seems unlikely.

Jawed

Well after reading that .. it seems to me that it can't do Pixel and Vertex Operations at the same time. Would be an interesting question to the team?

[guessing] I would suspect that the level of granularity in which this would be a problem is unlikely. The design seems well thought out and I would assume that it would be very unlikely that such a limitation would have been overlooked. GPU are all about hiding latency. The Arbiters job is to keep utilization of the ALUs high, so it would most likely be responsible to insure that such (pathological?) cases would have minimal effect. [/guessing]
 
AzBat said:
Nite_Hawk said:
Heya AzBat,

Remember the project that we were thinking about a while back to figure out the contribution different hardware made to the speed at which the system could render a scene? I kind of wish I still had time to work on it. It's interesting when looking at this kind of stuff. :)

Nite_Hawk

Yeah, I remember. That would be helluva a project. Not sure it would work in a closed console system, but it still could be useful for the PC market. Not as useful as 2 or 3 years ago though. It seems right now things a pretty even with the only 2 players out there. Basically make decisions based on price and brand preference. ;)

Tommy McClain

I actually think for a closed system it would be really interesting if ATI/MS or nVidia/Sony did something like this internally to figure out where their bottlenecks are coming from. In all of these threads we keep talking about whether or not the PS3 is going to be bandwidth starved with respect to the framebuffer throughput required and how effective the edram on the ATI part is going to be. All of these things are going to be immensely dependent on how effective their compression is and how much overdraw there is. ATI for instance, solved the problem with their ROP/edram solution. We don't know what nVidia's solution is (if they even have one), but we also don't really know if they need one.

I think it would be immensely interesting to see tests done on something like the PS3 where the speeds of the spus, number of spus, gpu speed, and various bus speeds can all be manipulated to see how changing them changes performance. Does the PS3 even need a edram solution like ATIs? Does ATI really need the edram solution or could they have gotten away with less (like upping the main memory throughput)? I think these are the kinds of questions that such a system could help us answer. I wonder if something like this is already in place?

Nite_Hawk
 
nelg - I agree with your guess, which is why I'm dubious that all 48 ALUs are running the same instruction.

On the other hand if all 48 ALUs are running different instructions, that necessitates a huge amount of instruction decode logic for the whole GPU, whereas in previous designs this logic only occurred relatively few times (once per quad in the pixel shaders, and once per vertex pipeline (?)). It also increases the amount of program counter logic (every ALU requires a private counter) and means that the shader state memory has to be partitioned per ALU, too, so that register accesses are incoherent.

In other words 48 ALUs with no grouping into quads is really messy.

Perhaps it's 16 ALUs per group, so there are three different threads concurrently executing on 16 ALUs each. Sounds like a reasonable compromise to me, but still with a high branch misprediction cost.

So this matter of SIMD versus n-way MIMD is intriguing...

Jawed
 
Back
Top