Fundamental 3D Microarchitecture

Discussion in 'Rendering Technology and APIs' started by Vince, Jul 22, 2002.

  1. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    You mean that when the pixel shader program for a fragment fails in reading a texture (it must go to video ram or even to the AGP) it stops with that fragment and starts a new one until the first one has the information available? If that was the case the overhead of storing the architectural state of the shader (temporary registers, output values) would be almost as big as the overhead of the interpolators, if no more.

    And I think that what avoids those large stalls because of memory latency are the texture, z and color caches. If you are processing a single triangle at the same time you can almost garantee that all the pixels are going to stall if the first ones stall because they will hit more or less the same memory region.

    Perhaps with fixed texturing and color that could be different.
     
  2. gking

    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    130
    Likes Received:
    0
    Not just when the cache misses -- whever a texture is fetched, the fragment gets placed in the FIFO. This is because the execution units in the shader hardware are at a premium. Every cycle that they spend stalled waiting for texture data is a cycle that could have been used for improving shader effects.

    Those save bandwidth, and optimally (if a texel is in the cache) the latency would be 0. However, since you can't guarantee texels are in cache (imagine a multi-pass algorithm where the final pass just copies a screen-sized texture to the screen -- there is no benefit to having a cache), you need to design for near-worst case scenarios in order to ensure that performance doesn't fluctuate wildly.

    Every bit counts. A 128-bit datapath is a significant chunk of chip real estate, and you want to save wherever possible when you design a chip (for cost, complexity, and performance). There's no way around saving the temporary registers, but there is a way around saving interpolated values.
     
  3. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    So having (doing your same maths) 32x128x150 = 600Kb+ of storage is ok (only counting temporary registers) for the FIFO. And moving 32 128 bit registers in a clock is something really hard to implement (however you could use something like a register window, but even then register acccess from the pixel shaders would need a complex implementation). And the derivatives are just per triangle, not per pixel as the pixel shader state. I'm not saying that current hardware use barycentric coords. or derivatives (I don't know what they use), but those FIFO you're talking seem way too large for what I know about hardware.

    There is any source where I can check that pixel shaders are multithreaded (even multitriangle) in current or future hardware?
     
  4. gking

    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    130
    Likes Received:
    0
  5. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    I already know about that course (btw my eternal thanks for Stanford uni and teachers for putting online their slides ;)) but I don't remember nothing about pixel shaders being multithreaded.

    In the part about rasterization they talk about barycentric coordinates but also about homogeneus recursive descent (whatever it is, but seems related to Olano&Deer Triangle Setup in 2DH paper) as the possible approach for rasterization used by NVidia. Ummm, maybe they are not incompatible as the barycentric coordinates for a fragment could be calculated as parameters in the Olano&Deer algorithm.

    In any case it would have been really interesting to assist to that course (if the space-time coordinates could have been the corrects).
     
  6. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,563
    Likes Received:
    171
    Location:
    In the Island of Sodor, where the steam trains lie
    I've been trying to visualise the advantages/disadvantages of using barycentric coords to generate the texture/shading parameters per-pixel VS a more "traditional" direct computation (i.e. direct hyperbolic interpolation) and AFAICS...
    • Barycentric coords give you slightly cheaper per-polygon set-up VS a "traditional" method
    • The storage costs per tri would seem to be nearly identical for both. In fact barycentric would be far more expensive if you assume all per-vertex values are FP.
    • The per-pixel calcs would seem to require about the same number of muls+adds for both, but the barycentric method would have more latency since you have to do the interpolation of the Bary' coord's per pixel first before computing the 'real' vals.
    • The actual multiply operations (when computing a texture coord) might also be more expensive in the barycentric method. In the direct computation method, apart from the final scale by 1/(1/w), the X and Y operands are 'small'.
    • In the direct method, there is always the possibility of doing things incrementally if you are stepping Horizontally or Vertically.
    Unless I've missed something, I'm not convinced that the barycentric method is all that great but I'd be intrigued if someone could convince me otherwise.
    I was interested that Nvidia's system doesn't need projection/clipping and I'm curious to know what the 'recursive' means in this context. The PowerVR PCX1/2 used to be able to switch to homogeneous rendering if a triangle was poking through the Front Clipping plane, thus avoiding quite a bit of work, but I can't say there was anything recursive about it.
     
  7. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    I don't understand either what they mean with recursive descent. In the Olano paper there isn't any reference to that. If we could find someone who was in the course we could ask about that. Clipping is suported with interpolated functions: the clipping value is calculated for each vertex doing a dot product against the clip plane equation and then interpolated inside the triangle.

    From what I have read ATI doesn't seem to use this approach (but it seems that there is far less information about ATI hardware than for NVidia, so who knows) so even with T&L they seem to be using real geometry clipping. I wonder how geometric clipping can be done fast in modern GPUs. In the old SGI machines that stage seems really expensive.
     
  8. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,563
    Likes Received:
    171
    Location:
    In the Island of Sodor, where the steam trains lie
    I assume that's probably where the "texkill" command in DX came from.

    "Just" throw in the dedicated hardware and you're done. I don't think it's any worse than any of the other bits of functionality in today's chips.
     
  9. gking

    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    130
    Likes Received:
    0
    Check the bit about texturing -- you'll find references to long FIFOs in the texture/shader pipe.
     
  10. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    Sure, but I don't see how that could be implemented with pixel shaders. It also mentions prefetching that I think is a far better approach for pixel shaders that storing hundreds of architectural state per fragment. And althought it is more a speculation than anything else the FIFO in the second prefetch implementation seems smaller to me and able to stall.

    About the FIFOs in a triangle setup and rasterizer processor I was reading today this paper:

    http://www.research.compaq.com/wrl/techreports/abstracts/98.1.html

    However it doesn't say how many fragments are stored in those FIFOs. Sure more than tens of fragments but I can't see if there could be a hundred or more.

    A bit old for today standards as it is pre T&L (but it has a 256bit data bus :)) but for the triangle setup and rasterization stages seems still valid.

    It seems to use Pixel Planes/Pineda approach of half-plane edge functions for rasterization and calculates r, b, g, b, a, 1/w, s, t and r parameters. It also calculates other parameters for other cases, but I don't remember now which ones.
     
  11. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    For simplicity let's assume a single pipe design. gking you are saying that the pixel shader unit has n copies of the register file and other state?

    so instruction issue is something like (i = instruction, p = pixel, c = cycle)

    i1 p1 c1
    i1 p2 c2
    ...
    i1 pn cn
    i2 p1 cn+1
    ...
    i2 pn c2n
    ...

    Where n is large?

    Say you allow 64 threads in flight. To hold temp values alone you
    would need 32KB... logic/cache ratio is starting to look very CPU like.

    I believe NVidia already does this in their vertex shader - OTOH each pipe is working on 6 vertices simultaneously.... but 64 sounds like a lot.

    How many transistors is 32KB of cache anyway?
     
  12. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Depends. For a single-port SRAM cache, you can normally assume 6 transistors per bit + a few additional transistors for address decode and sense amplifiers, adding perhaps 1/2 to 1 transistor per bit. For a dual-port cache, add another 2.5-3 transistors per bit.

    So a dual-port 32KB cache should amount to roughly 2.5 million transistors.
     
  13. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    You would need to be able to do 1 128bit write per cycle, and
    3 128bit reads... I was never sure what ports actually corresponded to, but I guess this is 1 write port and 3 read ports?
     
  14. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    1 write port + 3 read ports = 4 ports total - yup, that's it. Each 'port' on an SRAM or a register file corresponds to the capablility to read or write 1 data element per clock cycle.

    While transistor count for an SRAM does not increase that much per port (~2.5 transistors per port per bit) after 2 ports the area tends to increase proportionally to about square the number of ports (due to interconnect routing density issues), making a 4-port SRAM about 4x as big as a 2-port SRAM. So it might actually be better in this case to replace the 4-port SRAM with three (!) dual-port SRAMs, with 1 write port and 1 read port each.
     
  15. Tagrineth

    Tagrineth murr
    Veteran

    Joined:
    Feb 14, 2002
    Messages:
    2,537
    Likes Received:
    25
    Location:
    Sunny (boring) Florida
    12+12 + 12+12 + 12+12 + 12+12 + 20+12 + 14+18 + 14+18 + 2+30

    24 + 24 + 24 + 24 + 32 + 32 + 32 + 32

    4 * (24) + 4 * (32)

    4 * (24 + 32) = 224 bits.

    Lovely. I doubt the internal precision is much higher, mostly cos VoodooG through Voodoo3 only supported 16-bit targets (24-bit internal is already very good for 16-bit only output)... but IIRC Napalm has 40-bit internal precision in colours. I think.
     
  16. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    It probably does get to that high of precision in certain parts of the pipeline. Modern graphics cards with support for anisotropic probably have higher-precision color at some stages (Excluding the R300, which definitely has higher-precision...). But all the way through? Highly doubtful.
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    If your comment was refering to 40 bit precision I'd like to comment that Parhelia has 40 bit precision through the pipeline and I expect that the P10 does as well.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...