R520 die size and transistor count redux

Jawed said:
Presuming you do really mean Xenos, not R520, then I imagine it's much like R520, in fact. The texture pipes can send data directly to the register array.

In order for this to be valid, the batch that wants that texture data needs to be in context, so the scheduler would have to time the batch so that it is ready, too. So there might be a buffer on the texture pipes' output to soak up the unknowable latency of the texture operation.
I think they'd be in context automatically. You don't need to worry about the latency, either. All you need is to know how deeply-pipelined the texture units are, and have significantly more register space than that available, so that the pixel pipelines can be fed while the texture pipelines are full or stalled.
 
Jawed, I should have gone back to Dave's Xenos writeup first.

Here's what I was getting at though - you can see that in Xenos
(http://www.beyond3d.com/articles/xenos/index.php?p=06)
they have 16 texture samplers, apparently grouped into 4 blocks of 4,
servicing 3 shader arrays.

The thing is on Xenos, it looks like any one of the texture "quads" can
talk to any one of the shader arrays (texture data crossbar). So to correct
my earlier post, a texture quad on Xenos can probably deliver 4*32 (4*64?)
bits of results per cycle and the texture data crossbar isn't as huge as I initially
thought. That makes it even stranger that r520 doesn't have something similar...
perhaps the r520 is intended to scale to many more quads (6,8,...) and the
die area for a 6x6 or 8x8 crossbar would start to become much too large at
that point?

Cheers,
Serge
 
R520's quads seem very tightly bound, as in R420.

I'm guessing that the register array, texture pipes and shader pipes are all grouped into a self-contained unit. Indeed the scheduler might be part of that unit too.

One of the design concepts in R300 that's carried through into R420 is the screen-space tiling per quad. Each pixel shader quad "owns" an 16x16 portion of the screen. This is to get texture cache locality and at the same time increase the granularity of shader execution to provide load-balancing across the screen. It also provides register file locality, which basically means that less logic delay (and actual circuitry!) is incurred in switching context between one quad of pixels and another (each quad unit has to process 64 screen-space quads in a 16x16 tile in order to complete a batch - i.e. there are 64 phases of execution for each instruction).

R520 carries on this "tradition" it seems. So it seems to me that there's no need for a "crossbar" between the texture pipes and the register array. And the register file locality becomes even more useful because now it's not just quads that are being switched in and out of context, but also entire batches. A batch switch is required whenever the current batch hits a stall due to texture-dependency. A new batch is required to keep execution running...

This organisation would support localised scheduling (rather than centralised scheduling) because the rasteriser, when it generates triangle-tile work units, effectively creates entirely independent work that can be multi-threaded with no regard to the work of other shader-quads. So each shader-quad might have a queue of, say, 128 batches that it can schedule. That queue is solely its own. A big triangle that covers, say, 1024 screen pixels, will cover 4 (well more...) tiles of the screen, so will generate one batch per tile it appears in.

In R580, it would seem that each quad of texture pipes will be shared by 3 quads of shader pipes (assuming that RV530 is scaled-up). So a mini-crossbar is required per texture pipe quad - i.e. there'll be a total of four independent crossbars in RV580.

It's worth pointing out that Xenos's crossbar can cope with the same number of quads as we'll be seeing in R580. So I don't think the number of quads is particularly the issue with respect to crossbar size/complexity. This all seems to be about other things - of which I'm not too sure.

It may partly be a hangover from R300-style tiled-organisation of the screen. Or it may indicate that Xenos is a bit of an anomaly compared with how R600 will operate.

There are some scheduling complexities that result from the R520/RV530 screen-space tiling. They seem quite different from Xenos scheduling. I might start a thread about it at some point...

Jawed
 
Last edited by a moderator:
Back
Top