Fundamental 3D Microarchitecture

gking said:
RoBoBo,

Those extra fragments are all in the shader pipeline's FIFOs (first in, first out), in order to absorb things like memory latency. It takes a while for a read request into DRAM to get a response (page swap, charge, fetch, etc.). If the graphics chip did nothing while waiting for DRAM, performance would be aybsmal (thousands of pixels per second, instead of billions). So, the fragments go into a FIFO to wait for the texture read to complete, and the graphics chip processes other fragments.

You mean that when the pixel shader program for a fragment fails in reading a texture (it must go to video ram or even to the AGP) it stops with that fragment and starts a new one until the first one has the information available? If that was the case the overhead of storing the architectural state of the shader (temporary registers, output values) would be almost as big as the overhead of the interpolators, if no more.

And I think that what avoids those large stalls because of memory latency are the texture, z and color caches. If you are processing a single triangle at the same time you can almost garantee that all the pixels are going to stall if the first ones stall because they will hit more or less the same memory region.

Perhaps with fixed texturing and color that could be different.
 
You mean that when the pixel shader program for a fragment fails in reading a texture (it must go to video ram or even to the AGP)

Not just when the cache misses -- whever a texture is fetched, the fragment gets placed in the FIFO. This is because the execution units in the shader hardware are at a premium. Every cycle that they spend stalled waiting for texture data is a cycle that could have been used for improving shader effects.

And I think that what avoids those large stalls because of memory latency are the texture, z and color caches

Those save bandwidth, and optimally (if a texel is in the cache) the latency would be 0. However, since you can't guarantee texels are in cache (imagine a multi-pass algorithm where the final pass just copies a screen-sized texture to the screen -- there is no benefit to having a cache), you need to design for near-worst case scenarios in order to ensure that performance doesn't fluctuate wildly.

If that was the case the overhead of storing the architectural state of the shader (temporary registers, output values) would be almost as big as the overhead of the interpolators, if no more.

Every bit counts. A 128-bit datapath is a significant chunk of chip real estate, and you want to save wherever possible when you design a chip (for cost, complexity, and performance). There's no way around saving the temporary registers, but there is a way around saving interpolated values.
 
gking said:
Not just when the cache misses -- whever a texture is fetched, the fragment gets placed in the FIFO. This is because the execution units in the shader hardware are at a premium. Every cycle that they spend stalled waiting for texture data is a cycle that could have been used for improving shader effects.
...

Every bit counts. A 128-bit datapath is a significant chunk of chip real estate, and you want to save wherever possible when you design a chip (for cost, complexity, and performance). There's no way around saving the temporary registers, but there is a way around saving interpolated values.

So having (doing your same maths) 32x128x150 = 600Kb+ of storage is ok (only counting temporary registers) for the FIFO. And moving 32 128 bit registers in a clock is something really hard to implement (however you could use something like a register window, but even then register acccess from the pixel shaders would need a complex implementation). And the derivatives are just per triangle, not per pixel as the pixel shader state. I'm not saying that current hardware use barycentric coords. or derivatives (I don't know what they use), but those FIFO you're talking seem way too large for what I know about hardware.

There is any source where I can check that pixel shaders are multithreaded (even multitriangle) in current or future hardware?
 
gking said:
Take a look at the course slides from http://graphics.stanford.edu/courses/cs448a-01-fall/

It may not have the stamp of approval from hardware companies (who don't want to reveal any trade secrets), but it's reasonably accurate.

I already know about that course (btw my eternal thanks for Stanford uni and teachers for putting online their slides ;)) but I don't remember nothing about pixel shaders being multithreaded.

In the part about rasterization they talk about barycentric coordinates but also about homogeneus recursive descent (whatever it is, but seems related to Olano&Deer Triangle Setup in 2DH paper) as the possible approach for rasterization used by NVidia. Ummm, maybe they are not incompatible as the barycentric coordinates for a fragment could be calculated as parameters in the Olano&Deer algorithm.

In any case it would have been really interesting to assist to that course (if the space-time coordinates could have been the corrects).
 
I've been trying to visualise the advantages/disadvantages of using barycentric coords to generate the texture/shading parameters per-pixel VS a more "traditional" direct computation (i.e. direct hyperbolic interpolation) and AFAICS...
  • Barycentric coords give you slightly cheaper per-polygon set-up VS a "traditional" method
  • The storage costs per tri would seem to be nearly identical for both. In fact barycentric would be far more expensive if you assume all per-vertex values are FP.
  • The per-pixel calcs would seem to require about the same number of muls+adds for both, but the barycentric method would have more latency since you have to do the interpolation of the Bary' coord's per pixel first before computing the 'real' vals.
  • The actual multiply operations (when computing a texture coord) might also be more expensive in the barycentric method. In the direct computation method, apart from the final scale by 1/(1/w), the X and Y operands are 'small'.
  • In the direct method, there is always the possibility of doing things incrementally if you are stepping Horizontally or Vertically.
Unless I've missed something, I'm not convinced that the barycentric method is all that great but I'd be intrigued if someone could convince me otherwise.
...about homogeneus recursive descent (whatever it is, but seems related to Olano&Deer Triangle Setup in 2DH paper) as the possible approach for rasterization used by NVidia. ...

I was interested that Nvidia's system doesn't need projection/clipping and I'm curious to know what the 'recursive' means in this context. The PowerVR PCX1/2 used to be able to switch to homogeneous rendering if a triangle was poking through the Front Clipping plane, thus avoiding quite a bit of work, but I can't say there was anything recursive about it.
 
Simon F said:
I've been trying to visualise the I was interested that Nvidia's system doesn't need projection/clipping and I'm curious to know what the 'recursive' means in this context. The PowerVR PCX1/2 used to be able to switch to homogeneous rendering if a triangle was poking through the Front Clipping plane, thus avoiding quite a bit of work, but I can't say there was anything recursive about it.

I don't understand either what they mean with recursive descent. In the Olano paper there isn't any reference to that. If we could find someone who was in the course we could ask about that. Clipping is suported with interpolated functions: the clipping value is calculated for each vertex doing a dot product against the clip plane equation and then interpolated inside the triangle.

From what I have read ATI doesn't seem to use this approach (but it seems that there is far less information about ATI hardware than for NVidia, so who knows) so even with T&L they seem to be using real geometry clipping. I wonder how geometric clipping can be done fast in modern GPUs. In the old SGI machines that stage seems really expensive.
 
RoOoBo said:
I don't understand either what they mean with recursive descent. In the Olano paper there isn't any reference to that. If we could find someone who was in the course we could ask about that. Clipping is suported with interpolated functions: the clipping value is calculated for each vertex doing a dot product against the clip plane equation and then interpolated inside the triangle.
I assume that's probably where the "texkill" command in DX came from.

From what I have read ATI doesn't seem to use this approach (but it seems that there is far less information about ATI hardware than for NVidia, so who knows) so even with T&L they seem to be using real geometry clipping. I wonder how geometric clipping can be done fast in modern GPUs. In the old SGI machines that stage seems really expensive.
"Just" throw in the dedicated hardware and you're done. I don't think it's any worse than any of the other bits of functionality in today's chips.
 
gking said:
but I don't remember nothing about pixel shaders being multithreaded.

Check the bit about texturing -- you'll find references to long FIFOs in the texture/shader pipe.

Sure, but I don't see how that could be implemented with pixel shaders. It also mentions prefetching that I think is a far better approach for pixel shaders that storing hundreds of architectural state per fragment. And althought it is more a speculation than anything else the FIFO in the second prefetch implementation seems smaller to me and able to stall.

About the FIFOs in a triangle setup and rasterizer processor I was reading today this paper:

http://www.research.compaq.com/wrl/techreports/abstracts/98.1.html

However it doesn't say how many fragments are stored in those FIFOs. Sure more than tens of fragments but I can't see if there could be a hundred or more.

A bit old for today standards as it is pre T&L (but it has a 256bit data bus :)) but for the triangle setup and rasterization stages seems still valid.

It seems to use Pixel Planes/Pineda approach of half-plane edge functions for rasterization and calculates r, b, g, b, a, 1/w, s, t and r parameters. It also calculates other parameters for other cases, but I don't remember now which ones.
 
For simplicity let's assume a single pipe design. gking you are saying that the pixel shader unit has n copies of the register file and other state?

so instruction issue is something like (i = instruction, p = pixel, c = cycle)

i1 p1 c1
i1 p2 c2
...
i1 pn cn
i2 p1 cn+1
...
i2 pn c2n
...

Where n is large?

Say you allow 64 threads in flight. To hold temp values alone you
would need 32KB... logic/cache ratio is starting to look very CPU like.

I believe NVidia already does this in their vertex shader - OTOH each pipe is working on 6 vertices simultaneously.... but 64 sounds like a lot.

How many transistors is 32KB of cache anyway?
 
psurge said:
How many transistors is 32KB of cache anyway?

Depends. For a single-port SRAM cache, you can normally assume 6 transistors per bit + a few additional transistors for address decode and sense amplifiers, adding perhaps 1/2 to 1 transistor per bit. For a dual-port cache, add another 2.5-3 transistors per bit.

So a dual-port 32KB cache should amount to roughly 2.5 million transistors.
 
You would need to be able to do 1 128bit write per cycle, and
3 128bit reads... I was never sure what ports actually corresponded to, but I guess this is 1 write port and 3 read ports?
 
1 write port + 3 read ports = 4 ports total - yup, that's it. Each 'port' on an SRAM or a register file corresponds to the capablility to read or write 1 data element per clock cycle.

While transistor count for an SRAM does not increase that much per port (~2.5 transistors per port per bit) after 2 ports the area tends to increase proportionally to about square the number of ports (due to interconnect routing density issues), making a 4-port SRAM about 4x as big as a 2-port SRAM. So it might actually be better in this case to replace the 4-port SRAM with three (!) dual-port SRAMs, with 1 write port and 1 read port each.
 
darkblu said:
ok, i can see the general storage factor concern, yet i believe the voodoo line used considerably more than 64 bits per fragment interpolators. behold, it's from their interpolators' setup:

Code:
Change in Red with respect to X (12.12 format)
Change in Green with respect to X (12.12 format)
Change in Blue with respect to X (12.12 format)
Change in Alpha with respect to X (12.12 format)
Change in Z with respect to X (20.12 format)
Change in S/W with respect to X (14.18 format)
Change in T/W with respect to X (14.18 format)
Change in 1/W with respect to X (2.30 format)

although it's not clear what the internally-maintained bitness for the above was, it still seems rather a lot. now, what space would a typical-precision barycentric vector take?

12+12 + 12+12 + 12+12 + 12+12 + 20+12 + 14+18 + 14+18 + 2+30

24 + 24 + 24 + 24 + 32 + 32 + 32 + 32

4 * (24) + 4 * (32)

4 * (24 + 32) = 224 bits.

Lovely. I doubt the internal precision is much higher, mostly cos VoodooG through Voodoo3 only supported 16-bit targets (24-bit internal is already very good for 16-bit only output)... but IIRC Napalm has 40-bit internal precision in colours. I think.
 
Tagrineth said:
Lovely. I doubt the internal precision is much higher, mostly cos VoodooG through Voodoo3 only supported 16-bit targets (24-bit internal is already very good for 16-bit only output)... but IIRC Napalm has 40-bit internal precision in colours. I think.

It probably does get to that high of precision in certain parts of the pipeline. Modern graphics cards with support for anisotropic probably have higher-precision color at some stages (Excluding the R300, which definitely has higher-precision...). But all the way through? Highly doubtful.
 
Chalnoth said:
Tagrineth said:
Lovely. I doubt the internal precision is much higher, mostly cos VoodooG through Voodoo3 only supported 16-bit targets (24-bit internal is already very good for 16-bit only output)... but IIRC Napalm has 40-bit internal precision in colours. I think.

It probably does get to that high of precision in certain parts of the pipeline. Modern graphics cards with support for anisotropic probably have higher-precision color at some stages (Excluding the R300, which definitely has higher-precision...). But all the way through? Highly doubtful.
If your comment was refering to 40 bit precision I'd like to comment that Parhelia has 40 bit precision through the pipeline and I expect that the P10 does as well.
 
Back
Top