NV40 vs R420 Extreme Pipelines

Rockster

Regular
Have you ever read a review on some new graphics technology, and wondered how something compared to the previous generation? Only to refer back to the original review and the topic wasn't covered then.

Driving home I was thinking about Dave's NV4x preview, and how he went out of his way to explain how the pipelines work on quads, and its inherent inefficiencies around polygon edges. Perhaps Dave included this point for comparison with other architectures. As polygons get smaller and counts go up, the number of unused pipelines goes up.

What if the R420 had pipelines that truly worked on a per-pixel level? What if the dispatcher could send 3 pixels from 1 polygon to three pipelines, 2 pixels from another to 2 other pipelines, and 11 pixels from a larger polygon to the remaining pipelines? Certainly seems like a good way to boost performance. Perhaps all the pipelines could share a single cache pool and/or pool of TMU's.

How much performance could you expect to gain? What are the potential drawbacks? Am I right? Doubt we could even get anyone who knows to post a smiley face.
 
Rockster said:
Have you ever read a review on some new graphics technology, and wondered how something compared to the previous generation? Only to refer back to the original review and the topic wasn't covered then.

Driving home I was thinking about Dave's NV4x preview, and how he went out of his way to explain how the pipelines work on quads, and its inherent inefficiencies around polygon edges. Perhaps Dave included this point for comparison with other architectures. As polygons get smaller and counts go up, the number of unused pipelines goes up.

What if the R420 had pipelines that truly worked on a per-pixel level? What if the dispatcher could send 3 pixels from 1 polygon to three pipelines, 2 pixels from another to 2 other pipelines, and 11 pixels from a larger polygon to the remaining pipelines? Certainly seems like a good way to boost performance. Perhaps all the pipelines could share a single cache pool and/or pool of TMU's.

How much performance could you expect to gain? What are the potential drawbacks? Am I right? Doubt we could even get anyone who knows to post a smiley face.
:D
 
I don't think that would help with performance, personally. I can see a number of benefits to processing in quads:

1. Texturing efficiency: by ensuring the same texture is being operated on by four pixels, texture cache becomes easier to handle.
2. Memory bandwidth efficiency: more pixels rendered in the same spatial location ensures more memory accesses in the same memory location. By accessing data in chunks, it becomes easier to fill the memory bus.
3. Cache efficiency: with only one set of pixel shader instructions required for four separate pixels, you don't need as much cache for optimal performance.

I'm not sure the increased number of active pipelines due to not having to operate on quads would help.
 
It makes sense once the majority of triangles are on the order of magnitude of a quad or few. If you're drawing big polygons, only a small number of pixels will be on the edge, and therefore, the efficiency loss is small. If you are drawing triangles on the order of 1 pixel, it's a big win.

You could probably get some statistics on this by taking current games, and measuring the percentage of edge pixels vs the total number of pixels modulo the overdraw.
 
If you are drawing triangles on the order of 1 pixel, it's a big win.

Once you get to that point consistently wouldn't you just drop pixel shader all together ?
 
>> 1. Texturing efficiency: by ensuring the same texture is being operated on by four pixels, texture cache becomes easier to handle.

What if the same texture cache is shared by all the pipelines?

>> 2. Memory bandwidth efficiency: more pixels rendered in the same spatial location ensures more memory accesses in the same memory location. By accessing data in chunks, it becomes easier to fill the memory bus.

I don't think I understand why this would be any different than the current case. Pixel and memory locations wouldn't be any different, only the number in flight. Are you saying it's better to leave the pipes idle?

>> Cache efficiency: with only one set of pixel shader instructions required for four separate pixels, you don't need as much cache for optimal performance.

Why wouldn't this be the same as well, but rather than always having the send the same instruction to four pixels, you could send the same instructions to any number of pixels. Unless you're worried about needing to cache more than four instructions, in which case I agree that more cache is better.

Isn't the reason Intel and AMD spend all those transistors on branch prediction, out-of-order execution, large caches, etc., etc. in an effort to keep all the execution units busy. Surely there must be tangible benefits to be had there. I think Matrox estimated fragments composed 7-10% of screen space in titles they evaluated back in 2002. Could newer games push 15-20%. Even a 10% clock-for-clock advantage would be noticable.
 
Rockster said:
>> 1. Texturing efficiency: by ensuring the same texture is being operated on by four pixels, texture cache becomes easier to handle.

What if the same texture cache is shared by all the pipelines?
Won't help if different pipelines are operating on different textures.

>> 2. Memory bandwidth efficiency: more pixels rendered in the same spatial location ensures more memory accesses in the same memory location. By accessing data in chunks, it becomes easier to fill the memory bus.

I don't think I understand why this would be any different than the current case. Pixel and memory locations wouldn't be any different, only the number in flight. Are you saying it's better to leave the pipes idle?
This is the current case. And yes, that's exactly what I'm saying. Now, as DemoCoder pointed out, this may not be the case for very long, but I claim that today, most of the screen is covered by relatively large triangles, and so the performance hit from having some pipelines inactive is smaller than the performance hit from losing memory bandwidth efficiency.

And don't forget that moving into the future, operating on quads allows architectures to approximate partial derivates of various values. This is a required component of pixel shader 3.0, so I don't see quads going anywhere for a while.
 
Won't help if different pipelines are operating on different textures.
I didn't explain that very well. There was some earlier discussion on grouping tmu's by quads to achieve single cycle trilinear. I'm asking if its possible to arbitrarily group TMU's, storing their samples in a unified cache from which all the pipelines could read. Normally the groupings would be quads, but capable of smaller groups down to a single pixel. Sounds challenging, and like it increases the cache requirements, but that's what I was suggesting. Even the NV4x has two different levels of pixel cache. L2 shared with all pipelines, and L1 shared per quad. My scenario would change the requirement from a seperate L1 cache from per quad to per pipe.
And don't forget that moving into the future, operating on quads allows architectures to approximate partial derivates of various values.
True and an excellent point.

What do you infer from the "extreme" pipes reference? Is the 'X' in X800 for extreme?
 
1)You can't have different textures in a single batch, let alone inside a single triangle. Batches are (or rather should be) relatively large. Texture cache efficiency shouldn't be much of a problem. Eg adjacent tris coming from a triangle strip have sufficient overlap in their texel fetch requirements, even though there's technically an edge between them. Same for framebuffer locality.

2)If lots of single-pixel triangles are rendered, you can rightfully call that an extreme load situation, and it's just to be expected that performance will have to take a dive. The potential performance loss is well bounded, with the worst case being 75%. That's not great, but it may just be something you can live with.
 
I would equate the X as in "Xtreme pipelines" refereing to Shader processing capabilities. or Extreme Shader processing per pipeline. 8)
 
I doubt that we will get games soon were every damn polygon is (less than) one pixel on screen. And with the shrinking polygons we probably will see higher resolutions used or even back to super sampling. ;)
 
zeckensack said:
1)You can't have different textures in a single batch, let alone inside a single triangle. Batches are (or rather should be) relatively large. Texture cache efficiency shouldn't be much of a problem. Eg adjacent tris coming from a triangle strip have sufficient overlap in their texel fetch requirements, even though there's technically an edge between them. Same for framebuffer locality.

Thats not actually correct, ignoring the fact that dynamic flow control can now allow completly different texturing "paths" on a per pixel basis, use of texture pages means that within a single batch a large range of unique textures can be addressed without changing "base" texture.

John.
 
I would equate the X as in "Xtreme pipelines" refereing to Shader processing capabilities. or Extreme Shader processing per pipeline.
Are you suggesting 16x2 or >2 ALU's per pipe? Wouldn't there be more to gain by increasing the processing width than depth. Especially considering today's short shader dominance. Or are you referring to features? Which makes sense.
 
Of course what you need is a pipeline that doesn't need to worry about spatial alignment when writing a pixel out to memory because it has an effective write cache efficiency of 100% in that department. That way the only thing that not rendering with pixel quads will effect is texture cache efficiency. But that could be got around by precaching all the texels required in the same spatial location as said triangle(s). Say, precache all the texels required in a 32x16 block around the current triangle, for example :rolleyes:
 
JohnH said:
zeckensack said:
1)You can't have different textures in a single batch, let alone inside a single triangle. Batches are (or rather should be) relatively large. Texture cache efficiency shouldn't be much of a problem. Eg adjacent tris coming from a triangle strip have sufficient overlap in their texel fetch requirements, even though there's technically an edge between them. Same for framebuffer locality.

Thats not actually correct, ignoring the fact that dynamic flow control can now allow completly different texturing "paths" on a per pixel basis, use of texture pages means that within a single batch a large range of unique textures can be addressed without changing "base" texture.

John.
Why would a mesh rendered with packed textures use texture cache differently to its properly rendered version? The visual goal is the same, after all. Packing won't allow you to switch subtextures in the middle of a triangle, only at the edges.
PTs will cost you a lot more vertex processing to boot, so whatever efficiency loss there is due to a couple of mismatching edge texels, will likely be masked away by the increased vertex load.
Frankly, packed textures are so full of issues that I don't think anyone should even consider using them.

Re dynamic branching, it doesn't appear to perform all that well anyway. 9 clock cycles for a minmal data dependent branch, if I read that correctly. I'd be surprised if branching gets fast enough to be a viable optimization technique ...
 
Concerning this test, we don't know why it didn't goes that well, it could be:
- 6800 architecture
- drivers (as stated in the article)
- program
- a combination of above.
 
I have a suspicion (due to rumors on R42x stating it will have 6 vertex pipelines - the same number NV40 features BTW) that the new part of R420 is likely to be vertex shader 3.0 support (whether full or not I'm not sure). This probably explains the "extreme pipeline" talk.
 
DemoCoder said:
If you are drawing triangles on the order of 1 pixel, it's a big win.
Not really because triangle setup becomes the bottleneck. Except on long shaders. Based on it's peak performance spec R300 can only process one triangle per clock so for pixel sized triangles it's likely those pixel/quad pipelines are idle much of the time regardless of how they're organized.
 
radar1200gs said:
the new part of R420 is likely to be vertex shader 3.0 support (whether full or not I'm not sure). This probably explains the "extreme pipeline" talk.
There was a lot of talk around the same time the first "R420 won't have PS 3.0 support!" rumors started circulating that ATi would have partial PS 3.0 support, but they wouldn't have branching.

From my limited understanding, ATi is going to be supporting the most important features of PS 3.0 that will be used in games first, but they won't be supporting the full standard. (I just never wanted to write that publically since it sounds SOOOO much like what the nVidia enthusiasts were saying last round to justify 16FP that it even sounds a bit hypocritical to hear myself saying it, but that's what I think.)

I didn't know if you were aware or not, so please don't think I'm flaming you as I'm not I'm only trying to pass along some info. :)
 
Back
Top