The scheduler in conjunction with the superscalar/parallel abilities of its pipelines should give a good indication as to why it is efficient per clock. I am not aware of dual-issue capable pipelines or a dynamic instruction scheduler in R420, since such capabilities have not been mentioned.One of the biggest issues confronting the designers of a super-wide architecture like the one found in GeForce 6800 Ultra is to keep all of those vertex and pixel shader units busy. That job falls to the 6800's Shader Instruction Dispatcher, which looks at the incoming compiled shader code and allocatges registers and processor units to best handle the execution of that code.
When asked about this, Tony Tamasi noted that the Shader Instruction Dispatcher:
…"manages scheduling of the shader at a hardware level. This unit opportunistically schedules work on available execution units, in concert with, and in fact beyond, what the compiler is capable of doing. It performs instruction decoding, scheduling, and dispatch at the hardware level. It can opportunistically book all available resources, including handling branch processing and register utilization."
No, I think it has to do with the more functional units per pipeline than with the number of pipelines. It's pretty easy to parallelize pixels, but it's not so easy to keep all units in a single pipeline active at the same time.geo said:Interesting. Thanks. I don't have too much trouble believing that when you get up to 16pipes/6vs that better scheduling, particularly at the hardware level, could play a significant part.
Then you will end up fetching texels near tile borders multiple times, once for each quad, consuming extra bandwidth.DaveBaumann said:I'm not sure whether there is anything tying the quads to one another - its not like they have shared texture or instruction caches, as these are individual to the quad pipelines (they aren't using an L1/L2 texture cache for instance).
Unless the average triangle is very small, most triangles will span multiple tiles, with different pixel counts in different tiles. For good load balancing, you may need to buffer quite many such triangles, which gets expensive after a while.The only thing that I can think of that may stall quads working on different areas would be the quantity of cache between the setup and dispatch where a triangle is spanning multiple tiles.
arjan de lumens said:Then you will end up fetching texels near tile borders multiple times, once for each quad, consuming extra bandwidth.
Unless the average triangle is very small, most triangles will span multiple tiles, with different pixel counts in different tiles. For good load balancing, you may need to buffer quite many such triangles, which gets expensive after a while.
I'm assuming we're still talking about 16x16 pixel tiles? In this case, border pixels are a relatively small percentage of the total pixels for the quad (64 border pixels vs. 256 total pixels in the quad, assuming the triangle covers an entire tile, and even then you probably won't have to re-fetch all of the texels).arjan de lumens said:Then you will end up fetching texels near tile borders multiple times, once for each quad, consuming extra bandwidth.DaveBaumann said:I'm not sure whether there is anything tying the quads to one another - its not like they have shared texture or instruction caches, as these are individual to the quad pipelines (they aren't using an L1/L2 texture cache for instance).
Depends on the type of geometry we're talking about. With world geometry or shadow volumes, I can believe this. But with skinned meshes it's probably not true.Unless the average triangle is very small, most triangles will span multiple tiles, with different pixel counts in different tiles. For good load balancing, you may need to buffer quite many such triangles, which gets expensive after a while.
arjan de lumens said:While the total number of pixels rendered per quad isn't going to differ a great deal (~1-2%) over the frame as a whole, there will be large local variations along e.g. the edge of an object, where one quad may suddenly have to deal with 100s of pixels more from the object than the neighboring quads. Unless there is buffering in the system to allow the 4 quads to operate deeply out of sync (hundreds of pixels, dozens of polygons, or perhaps even more) relative to each other, these local variations will have a much greater impact on effective performance than the total number of pixels for a frame as a whole might indicate.
sireric said:The R3xx and the R4xx have a rather interesting way of tiling things. Our setup unit sorts primitives into tiles, based on their area coverage. Some primitives fall into 1 tile, some into a few, some cover lots. Each of our backend pixel pipes is given tiles of work. The tiles themselves are programmable in size (well, powers of 2), but, surprisingly, we haven't found that changing their size changes performance that much (within reason). Most likely due to the fact that with high res displays, most primitives are large. There is a sweet spot in the performance at 16, and that hasn't changed in some time. Even the current X800 use 16, though I think we need to revisit that at some point in the future. Possibly on a per application basis, different tile sizes would benefit things. On our long list of things to investigate.
Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own. Each pipe is a full MIMD and can operate on different polygons, and, in fact, can be hundreds of polygons off from others. The downside of that is memory coherence of the different pipes. Increasing tile size would improve this, but also requires larger load balancing. Our current setup seems reasonably optimal, but reviewing that, performance wise, is on the list of things to do at some point. We've artificially lowered the size of our load balancing fifos, and never notice a performance difference, so we feel, for current apps, at least, that we are well over-designed.
In general, we have no issues keeping all our units busy, given the current polygons. I could imagine that if you did single pixel triangles in one tile over and over, that performance could drop due to tiling, but memory efficiency would shoot up, so it's unclear that performance overall would be hurt. The distribution of load accross all these tiles is pretty much ideal, for all the cases we've tested. Super tiling is built on top of this, to distribute work accross multiple chips.
As well, just like other vendors, we have advance sequences that distribute alu work load to our units, allocate registers and sequence all the operations that need to be done, in a dynamic way. That's really a basic requirement of doing shading processing. This is rarely the issue for performance.
Well cough it up!Nappe1 said:I discover similarities between R3xx and the chip that sits on my bookshelf...
Luminescent said:Well cough it up! [/quot]Nappe1 said:I discover similarities between R3xx and the chip that sits on my bookshelf...
That would be Nappe's 'axe'...
OICAspork said:Luminescent said:Well cough it up!Nappe1 said:I discover similarities between R3xx and the chip that sits on my bookshelf...
That would be Nappe's 'axe'...