Where did you get the info about NV40 only processing 1 shader per clock over its 4 quads? I had understood that each quad processed its own discrete set of pixels, but that the quads were assigned a group of pixels on a case-by-case basis from a triangle, unlike R3/4xx, which sets up its pixel shaders with tiles.Jawed said:nAo said:Many GPUs architecture can be seen as a (non symmetric) multiprocessors architecture:
Nvidia calls NV40 a 3 processors architecture: vertex shaders + pixel shaders + ROPs.
ROPs even run at memory clock not GPU core clock, and it's a given ROPs have their local store/cache that hold some pixel tiles.
You can go further and describe each quad-pipeline as a processor core.
What's interesting is that in R300 and up, the quad-pipelines are each able to run a different shader.
In NV40 it seems that the same shader is running on all pipelines.
So R420 is, for example, 4-way MIMD, whereas NV40 is 16-way SIMD.
I think the vertex pipes in both architectures are purely independent and parallel.
Hope I've got that all correct - wouldn't want to start a second off-forum flame-war in only two posts!
Jawed
Jawed said:Nitehawk - I think the IHVs use simulators for this kind of stuff.
Jawed
Nite_Hawk said:Do you think they use simulators to actually simulate the chip itself, or build statistical performance models? I'm curious, because if they actually simulated the chip itself I'd be worried that the underlying system would color the performance of the simulated part.
Nite_Hawk
At several stages during product development, performance is measured on both software and hardware simulators, to make sure we are on the right track. As soon as we have chips, we start measuring the real performance of the part. Often, at this point, we are not certain of the final clocks, so we explore performance results across a variety of clock configurations.
Do you know how a batch is constructed?nAo said:NV40 pixel pipelines work on batches of ~1000 pixel, I doubt all the quads are working on the same triangle, performances on small triangles would be quite poor
So what you're saying is that all existing examples of what you call "internalized bandwith" aren't valid to count well... because .... well because ... we just don't count them...blackjedi said:For example
It would require external bandwidth if it didn't have local storage. The introduction of local storage means we don't count the bandwidth the processors need, though they are consuming data at however many bytes/s. Why should that be different for GPU's?blakjedi said:The assessment of counting the internal memory of the SPUs doesn't work here because it is CPU work that normally would not require the use of external bandwidth under any circumstances.
I would agree. In such cases [eDRAM -> Logic], why do you count eDRAM and Logic as two separate functional units?When counting bandwidth is it reasonable to separate bandwidth into functional units?
Jawed said:Nite_Hawk said:Do you think they use simulators to actually simulate the chip itself, or build statistical performance models? I'm curious, because if they actually simulated the chip itself I'd be worried that the underlying system would color the performance of the simulated part.
Nite_Hawk
http://3dcenter.org/artikel/2005/03-31_a_english.php
At several stages during product development, performance is measured on both software and hardware simulators, to make sure we are on the right track. As soon as we have chips, we start measuring the real performance of the part. Often, at this point, we are not certain of the final clocks, so we explore performance results across a variety of clock configurations.
Jawed
Shifty Geezer said:If you count hops, why does the backbuffer processing on local storage count as a hop ? If that is a hop, why isn't it a hop from level 1 cache to CPU logic?
blakjedi said:Think of it like this: if there was a small 10MB cache separate from GPU, CPU, and Main RAM used to do this work, you WOULD count the bandwidth it took to access it... because it encapsulates "a hop" in the traditional sense.
Laa-Yosh said:Some thoughts...
People seem to try to dismiss the advantages of Xenos's EDRAM just because an MS PR person made the wrong choice of counting it into the system bandwith. Yeah, he was wrong - so does that suddenly remove this as an advantage?
Summing up bandwith in the way presented here is obviously not the ideal approach, so why beat a dead horse from both sides? Stop it and move to more interesting topics
I'm sure the console devs here have ideas about average bandwith requirements for actual ingame graphics. Like, we probably have such an amount of opaque overdraw, and such an amount of transparent overdraw and so on. We choose a resolution like 720p 2x AA to make the field as even as possible, and caclulate the probable backbuffer/framebuffer bandwith utilization for the two systems. We'll then see how much the RSX has to spend from its bandwith, how much is left; and how that actually compares to the ammount of traffic that Xenos needs to copy the backbuffer out to the main memory.
Then we can repeat for other resolutions like 1080i/p, move up to 4X AA and so on... this would be a lot more profitable to the forum then the flamewar that you seem to get into about opinions
Unknown Soldier said:I've asked Dave to ask if 3Dc or higher is used in the R500. I've also asked for any clarification if Fast14 was used in the process of the Xbox2.
How does PowerVR handle triangles that cross tile boundaries? I would imagine they are shaded once for each tile and clipped.Xmas said:I'm slowly starting to view R500 a bit as a hybrid IMR/TBDR of some kind. Sure, it doesn't have units to do Z layout of one tile while shading another. And it can't remove opaque overdraw completely, or do order-independent transparency. But it can do a very fast Z first pass and then remove most of the opaque overdraw (limited by hierZ granularity), do blending and AA without apparent cost, and it renders into a large "on-chip" tile. Though, we don't know how it handles the "binning" yet, i.e. how it handles triangles that pass tile borders, whether they're going through VS again or not, etc.
jvd said:where are we getting the edram to xenos bandwidth from ? I know the edram to the rest of the edram chip logic is 256gb . But where did the figure for the edram to xenos come from ?
jvd said:where are we getting the edram to xenos bandwidth from ? I know the edram to the rest of the edram chip logic is 256gb . But where did the figure for the edram to xenos come from ?