ATI's idea on transistor budgets

Jawed said:
Six threads, yes, but it's always just six. There's 1024 fragments in each thread.
For all intents and purposes, 1 fragment == 1 thread and 1 vertex == 1 thread. This is how people code their shaders. Let's not mix up terminology here.

Jawed said:
While each of G70's quads has its own shader state, the scan conversion assigns fragment-quads to shader-quads in a round-robin fashion. So two adjacent fragment-quads on a single triangle will be shaded by two different shader-quads:

11223344
11223344
556611
556611
2233
2233
44
44
If you think about it for a second, you'll see that this pattern is not very efficient: All L1 texture caches will contain roughly the same data. Instead of having the effect of adding up L1 texture cache sizes, you'll just be taking the max texture cache size as the only cache size of the system.

More practically, if you have 4 quad pipes and each one has a 1 MB L1 texture cache (for the sake of argument), you want your access pattern to turn those 4 L1 caches into effectively a large 4 MB cache. This is much more effective than making all 4 1 MB caches act like just a single 1 MB cache (thus wasting ~50 million transistors).
 
Bob said:
For all intents and purposes, 1 fragment == 1 thread and 1 vertex == 1 thread. This is how people code their shaders. Let's not mix up terminology here.
In deference to RoOoBo, I write "thread" these days instead of "batch" :D XB360/Xenos terminology is "vector", as a matter of interest - but that's hella confusing, I think. "Thread" is common in discussion of R5xx, too.

If you think about it for a second, you'll see that this pattern is not very efficient: All L1 texture caches will contain roughly the same data. Instead of having the effect of adding up L1 texture cache sizes, you'll just be taking the max texture cache size as the only cache size of the system.
I know this isn't the real pattern being used - but the trouble is, what is the real pattern? I was hoping you'd say... I've seen a fancy "snake-walk" somewhere but I can't put my finger on it.

More practically, if you have 4 quad pipes and each one has a 1 MB L1 texture cache (for the sake of argument), you want your access pattern to turn those 4 L1 caches into effectively a large 4 MB cache. This is much more effective than making all 4 1 MB caches act like just a single 1 MB cache (thus wasting ~50 million transistors).
Hmm, I think there's a concensus round here that texture caches are in the KB, not MB. I saw a figure of 8KB for L2 in NV40 the other day (on another webby) - I can imagine that L1 is something like 1KB...

The suggestion was also that L2 contains compressed texture data, while L1 holds the data in un-compressed form. Additionally, how is L1 organised? Is L1 actually distinct per fragment-texturing pipe, or is L1 shared for all fragments in the quad?

Any good info on cache size and organisation?

Jawed
 
Jawed, he's just throwing out a number (1MB) for the sake of discussion. I wouldn't view it as anything other than that.
 
JF_Aidan_Pryde said:
I meant to confirm: given a thread, how many shader cores are working on it at once. For the R580, it should be 12 shader cores (3 quads). For the G70 it's 4 shader cores (1 quad).

So the R580 has four threads active at anytime, with a maximum of 512 in flight.
The G70 has six threads active at a time, with a maximum of 'hundreds' (according to NV).

Both are SIMD architectures; for a given clock, all active threads are executing the same shader program.

Am I interpreting the two architectures correctly?
I've heard Nvidia has up to 1024 pixels in a batch, but where did the 6 threads come from?
 
:cool:
Yes, with R2VB. Actually, depending on how many components you need for your vertices you can work in parallel with more. If you're working on a single channel attribute you can work on 192 vertices in parallel (well, actually even 768 if we consider MRTs).
 
I'm not sure you can really say that, Humus. Going beyond 96 parallel ops requires that you can pack your operations into 4-vectors, and you can't do that with every operation. Going beyond 192 would require that once you have calculated one attribute, calculating another is a simple manipulation, and even then it's not really 768 completely in parallel.
 
Jawed said:
I know this isn't the real pattern being used - but the trouble is, what is the real pattern? I was hoping you'd say... I've seen a fancy "snake-walk" somewhere but I can't put my finger on it.

Morton and Peano-Hilbert orders are pretty nice for texture cache hitratios and such.
 
3dcgi said:
Jawed, he's just throwing out a number (1MB) for the sake of discussion. I wouldn't view it as anything other than that.
Indeed. It doesn't matter what the actual numbers are. I was just trying to illustrate a point: you'd see pretty much the same result no matter how large (or small) your caches were.
 
Do they really use Peano-Hillbert? At first glance it looks a bit better than Morton. But when thinking some more about it, I just see a more complex addressing without any practical benefits.
 
Jawed said:
Hmm, I think there's a concensus round here that texture caches are in the KB, not MB. I saw a figure of 8KB for L2 in NV40 the other day (on another webby) - I can imagine that L1 is something like 1KB...

The suggestion was also that L2 contains compressed texture data, while L1 holds the data in un-compressed form. Additionally, how is L1 organised? Is L1 actually distinct per fragment-texturing pipe, or is L1 shared for all fragments in the quad?

Any good info on cache size and organisation?

Jawed
L1 is likely 256 bytes per quad.
 
As far as I can see, Peano-Hilbert is 'better' in that diagonally opposite blocks are not stored next to each other in memory, thus avoiding the most severe locality problem of Morton. This may have a small beneficial effect if the texture map is not aligned to DRAM or virtual-memory pages (fewer page breaks/page misses) or if the size of the individual texel is not a power of 2, but other than that, the effect should be zero.
 
Yes, that's what I meant. At first it seems as the better coherency would help. But when thinking more about it, it seems that it'll only help in cases that never happens. Are there any non-power-of-2 texture formats? Will a gfx card ever do reads, writes or memory allocation for textures that aren't well aligned with power-of-2 blocks. I doubt it, even for virtual memory.

Only exception I can think of is non-power-of-2 sized textures. But address swizzling doesn't work well with those anyway.
 
Hilbert order? There was a paper in the 2001 Graphics Hardware from McCool and others about implementing a recursive rasterization method that made use of Hilbert order. But it didn't went into the benefits of Hilbert order as a pattern for texture or framebuffer access. They key reason for Hilbert order was that the recursive algorithm could be implemented without a large stack using an automat that would walk through the whole framebuffer using a Hilbert space filling curve.

For the simulator I also use recursive rasterization but as I gave up on implementing something that realistic (researching rasterization wasn't that interesting) the algorithm and the recursivity is only in the emulation side, and the simulator only requests n tiles of fragment quads per cycle and expects on average to get them. I think the different levels of tiles (down to the quad) are generated in Morton order now.

After lazily implementing round robin for 'better' shader workload distribution I went towards a workload distribution based on tiles (as reported for ATI) and later implementing a Morton distribution of those tiles in memory and between the quad pipelines. Which reduced significantly the amount of extra texture bandwidth consumption and the unbalance accessing the memory.

The textures were always implemented with multiple levels of tiling and stored using Morton order at each level (down to the texel level).

About using the name 'threads' or whatever it becomes really confusing. When discussing with the other people working on the simulator I end talking about shader inputs (as we have also vertices and potentially any kind of input going into the shader units), fragments (as it becomes hard to forget that there isn't just fragments), threads, quads and groups. I tried to avoid batches because the other people working here weren't reading Beyond3D or other similar sources that used batches for 'fragments on fly in the shader' and the use of batches for primitives and the unit workload sent to the GPU with no internal state changes was already consolidated.

Now I tend to use quad as the minimun work unit for fragment processing with the shader 'thread' being a multiple of a quad. I then use 'thread', 'thread group' or just 'group' (tending to the later to avoid confusion) to the group of quads that are scheduled together on a shader and take n cycles to complete. I would say that 'thread' for me is more the hardware concept of having a PC, a position in a schedule window and other related state that is shared for all the quads in a group. A group is just the workload assigned to a thread which determines how the register storage is reserved. The available threads define the 'thread window' from where work is scheduled. You could have an architecture with the same number of threads but a different group size (R520 vs R580). Then another parameter would be how many of those quads in a group are processed in parallel in a shader (the number of ALUs).

For very large 'batches' of quads that are scheduled together (what old ATI GPUs and NVidia GPUs still seem to use) I don't like the term thread or group because there isn't any scheduling or true grouping. So I preffer to talk about queues (of quads) even if an implemention may be using and scheduling between a couple or two of those large queues.

But that is just my personal prefferences.
 
Back
Top