Clock for Clock

Geo

Mostly Harmless
Legend
So, it's been roughly five months since the introduction of X800 and six months since 6800. Do we understand yet, at a detail level, why NV gets better performance than ATI clock-for-clock this generation?
 
Here is a recount of NV40's architectural strengths in relation to R420 with regards to pixel processing/shading abilities.

I believe NV40 also sports a shader scheduler which dynamically arranges instructions within the hardware itself, allowing it to acquire a level of independence from the instruction compiler. Here is an excerpt taken from the Extremetech NV40 preview:
One of the biggest issues confronting the designers of a super-wide architecture like the one found in GeForce 6800 Ultra is to keep all of those vertex and pixel shader units busy. That job falls to the 6800's Shader Instruction Dispatcher, which looks at the incoming compiled shader code and allocatges registers and processor units to best handle the execution of that code.

When asked about this, Tony Tamasi noted that the Shader Instruction Dispatcher:

…"manages scheduling of the shader at a hardware level. This unit opportunistically schedules work on available execution units, in concert with, and in fact beyond, what the compiler is capable of doing. It performs instruction decoding, scheduling, and dispatch at the hardware level. It can opportunistically book all available resources, including handling branch processing and register utilization."
The scheduler in conjunction with the superscalar/parallel abilities of its pipelines should give a good indication as to why it is efficient per clock. I am not aware of dual-issue capable pipelines or a dynamic instruction scheduler in R420, since such capabilities have not been mentioned.
 
Interesting. Thanks. I don't have too much trouble believing that when you get up to 16pipes/6vs that better scheduling, particularly at the hardware level, could play a significant part. Tho you'd think ATI would have had a leg up there as they had more pipes last generation, and hence more time, opportunity, and incentive to shine in that area.
 
It's kind of obvious ATi sort of passed at new features this gen in favor of maximizing profits/minimizing engineering overhead by riding the R300 wave, and they honestly could get away with it. But I expect to see at least a (better) hardware scheduler with their true next gen (R5x0), the chips they've apparently been focusing their R&D on.
 
geo said:
Interesting. Thanks. I don't have too much trouble believing that when you get up to 16pipes/6vs that better scheduling, particularly at the hardware level, could play a significant part.
No, I think it has to do with the more functional units per pipeline than with the number of pipelines. It's pretty easy to parallelize pixels, but it's not so easy to keep all units in a single pipeline active at the same time.
 
I'd say that part of the reasoning is the split. In raw instructions per pipeline there probably isn't much difference between the two, however the instruction distribution is. R300 has one primary ALU, with the full instruction set, and one small ALU with a very limited instruction set whereas NV40 seems to have taken NV3x's primary ALU and distributed the instructions between the two (with one(?) duplicated) which will give more opportunities to execute two instructions per cycle (even before we get to co-issue scalar ops and free FP16 instuctructions, etc.).
 
IIRC, it has been shown that ATI R3xx chips subdivide the screen into 16x16pixel tiles, with each quad of pixel pipelines being able to write only to a fixed set of tiles; if this is still the case in R420, then there is a potential source of inefficiency there if the load balancing between the tile sets isn't 100% perfect. Also IIRC, Nvidia has stated that the Geforce6 chips do not have such a limitation - they just hand pixel quads from scan-conversion to the pipelines as the pipelines become available to receive them, without binding specific screen locations to specific pipelines.

If this is in fact true, then the Nvidia chip should gain a bit of efficiency over the ATI chip, perhaps 3-10% or so.
 
The locality of the tiles asigned to each quad is going to be close (i.e. next to each other), so I doubt that the load required for them is going to be that different to cause an issue - you may have some variation by th end of the frame, but I'd doubt it would be anything near 10%.
 
While the total number of pixels rendered per quad isn't going to differ a great deal (~1-2%) over the frame as a whole, there will be large local variations along e.g. the edge of an object, where one quad may suddenly have to deal with 100s of pixels more from the object than the neighboring quads. Unless there is buffering in the system to allow the 4 quads to operate deeply out of sync (hundreds of pixels, dozens of polygons, or perhaps even more) relative to each other, these local variations will have a much greater impact on effective performance than the total number of pixels for a frame as a whole might indicate.
 
I'm not sure whether there is anything tying the quads to one another - its not like they have shared texture or instruction caches, as these are individual to the quad pipelines (they aren't using an L1/L2 texture cache for instance). The only thing that I can think of that may stall quads working on different areas would be the quantity of cache between the setup and dispatch where a triangle is spanning multiple tiles.
 
DaveBaumann said:
I'm not sure whether there is anything tying the quads to one another - its not like they have shared texture or instruction caches, as these are individual to the quad pipelines (they aren't using an L1/L2 texture cache for instance).
Then you will end up fetching texels near tile borders multiple times, once for each quad, consuming extra bandwidth.
The only thing that I can think of that may stall quads working on different areas would be the quantity of cache between the setup and dispatch where a triangle is spanning multiple tiles.
Unless the average triangle is very small, most triangles will span multiple tiles, with different pixel counts in different tiles. For good load balancing, you may need to buffer quite many such triangles, which gets expensive after a while.
 
arjan de lumens said:
Then you will end up fetching texels near tile borders multiple times, once for each quad, consuming extra bandwidth.

I chatted with Sireric about that a while back, overall they felt there wasn't any gains / losses between either systems.

Unless the average triangle is very small, most triangles will span multiple tiles, with different pixel counts in different tiles. For good load balancing, you may need to buffer quite many such triangles, which gets expensive after a while.

Yes, and given the frequency that triangles will span multiple tiles I should imagine that this is accounted for in some method.
 
arjan de lumens said:
DaveBaumann said:
I'm not sure whether there is anything tying the quads to one another - its not like they have shared texture or instruction caches, as these are individual to the quad pipelines (they aren't using an L1/L2 texture cache for instance).
Then you will end up fetching texels near tile borders multiple times, once for each quad, consuming extra bandwidth.
I'm assuming we're still talking about 16x16 pixel tiles? In this case, border pixels are a relatively small percentage of the total pixels for the quad (64 border pixels vs. 256 total pixels in the quad, assuming the triangle covers an entire tile, and even then you probably won't have to re-fetch all of the texels).

Now, I really don't know if ATI does this, but there is a potentially large benefit from dividing the screen up into tiles. That is, if an architecture not only divided triangles up into tiles, but actually sent a few subsequent triangles through the pipeline into this cache before rendering each tile, there could be a large savings in performance for small (or thin) triangles. I expect architectures will want to start doing this sort of thing as triangle sizes get small, as rendering purely by triangle will require much larger amounts of cache for optimal memory accesses than if multiple triangles for one tile are rendered before moving on to the next tile.

Unless the average triangle is very small, most triangles will span multiple tiles, with different pixel counts in different tiles. For good load balancing, you may need to buffer quite many such triangles, which gets expensive after a while.
Depends on the type of geometry we're talking about. With world geometry or shadow volumes, I can believe this. But with skinned meshes it's probably not true.
 
The R3xx and the R4xx have a rather interesting way of tiling things. Our setup unit sorts primitives into tiles, based on their area coverage. Some primitives fall into 1 tile, some into a few, some cover lots. Each of our backend pixel pipes is given tiles of work. The tiles themselves are programmable in size (well, powers of 2), but, surprisingly, we haven't found that changing their size changes performance that much (within reason). Most likely due to the fact that with high res displays, most primitives are large. There is a sweet spot in the performance at 16, and that hasn't changed in some time. Even the current X800 use 16, though I think we need to revisit that at some point in the future. Possibly on a per application basis, different tile sizes would benefit things. On our long list of things to investigate.

Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own. Each pipe is a full MIMD and can operate on different polygons, and, in fact, can be hundreds of polygons off from others. The downside of that is memory coherence of the different pipes. Increasing tile size would improve this, but also requires larger load balancing. Our current setup seems reasonably optimal, but reviewing that, performance wise, is on the list of things to do at some point. We've artificially lowered the size of our load balancing fifos, and never notice a performance difference, so we feel, for current apps, at least, that we are well over-designed.

In general, we have no issues keeping all our units busy, given the current polygons. I could imagine that if you did single pixel triangles in one tile over and over, that performance could drop due to tiling, but memory efficiency would shoot up, so it's unclear that performance overall would be hurt. The distribution of load accross all these tiles is pretty much ideal, for all the cases we've tested. Super tiling is built on top of this, to distribute work accross multiple chips.

As well, just like other vendors, we have advance sequences that distribute alu work load to our units, allocate registers and sequence all the operations that need to be done, in a dynamic way. That's really a basic requirement of doing shading processing. This is rarely the issue for performance.

Performance issues are still very texture fetch bound (cache efficiency, memory efficiency, filter types) in modern apps, as well as partially ALU/register allocation bound. There's huge performance differences possible depending on how your deal with texturing and texture fetches. Even Shadermark, if I recall correctly, ends up being texture bound in many of its cases, and it's very hard to make any assumptions on ALU performance from it. I know we've spent many a time in our compiler, generating various different forms of a shader, and discovering that ALU & register counts don't matter as much as texture organization. There are no clear generalizable solutions. Work goes on.
 
arjan de lumens said:
While the total number of pixels rendered per quad isn't going to differ a great deal (~1-2%) over the frame as a whole, there will be large local variations along e.g. the edge of an object, where one quad may suddenly have to deal with 100s of pixels more from the object than the neighboring quads. Unless there is buffering in the system to allow the 4 quads to operate deeply out of sync (hundreds of pixels, dozens of polygons, or perhaps even more) relative to each other, these local variations will have a much greater impact on effective performance than the total number of pixels for a frame as a whole might indicate.

Beyond3d loves its synthetic benchmarks. Perhaps someone needs to write a benchmark capable of putting the described scenario to the test (shouldn't be too difficult - just create a scene that has varying geometry and texture loads in different, adjustable areas of the screen).
 
sireric said:
The R3xx and the R4xx have a rather interesting way of tiling things. Our setup unit sorts primitives into tiles, based on their area coverage. Some primitives fall into 1 tile, some into a few, some cover lots. Each of our backend pixel pipes is given tiles of work. The tiles themselves are programmable in size (well, powers of 2), but, surprisingly, we haven't found that changing their size changes performance that much (within reason). Most likely due to the fact that with high res displays, most primitives are large. There is a sweet spot in the performance at 16, and that hasn't changed in some time. Even the current X800 use 16, though I think we need to revisit that at some point in the future. Possibly on a per application basis, different tile sizes would benefit things. On our long list of things to investigate.

Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own. Each pipe is a full MIMD and can operate on different polygons, and, in fact, can be hundreds of polygons off from others. The downside of that is memory coherence of the different pipes. Increasing tile size would improve this, but also requires larger load balancing. Our current setup seems reasonably optimal, but reviewing that, performance wise, is on the list of things to do at some point. We've artificially lowered the size of our load balancing fifos, and never notice a performance difference, so we feel, for current apps, at least, that we are well over-designed.

In general, we have no issues keeping all our units busy, given the current polygons. I could imagine that if you did single pixel triangles in one tile over and over, that performance could drop due to tiling, but memory efficiency would shoot up, so it's unclear that performance overall would be hurt. The distribution of load accross all these tiles is pretty much ideal, for all the cases we've tested. Super tiling is built on top of this, to distribute work accross multiple chips.

As well, just like other vendors, we have advance sequences that distribute alu work load to our units, allocate registers and sequence all the operations that need to be done, in a dynamic way. That's really a basic requirement of doing shading processing. This is rarely the issue for performance.

interesting stuff. I might be wrong but the system R3xx and R4xx are using sounds pretty similar what some other vendor planned (and even designed, but never got it on full scale production) one generation earlier. ;)

the funny thing is, this is not a first time when I discover similarities between R3xx and the chip that sits on my bookshelf. Still, I think that it is just co-incidence that these two have so much same ideas. (There's differencies too, like the whole memory subsystem is completely different, etc.)
 
Back
Top