Modern and Future Geometry Rasterizer layout? *spawn*

Discussion in 'Architecture and Products' started by Digidi, Aug 19, 2020.

  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,847
    Likes Received:
    1,044
    Location:
    New York
    Rasterization, tessellation, culling and triangle setup are all distributed on RDNA in each shader array. What does the central “geometry processor” actually do?
     
  2. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,156
    Likes Received:
    2,663
    Location:
    Germany
    Probably, because shading in quads nets more efficiency for the general use case than you lose with micropolygons as a corner case.
     
    w0lfram, Lightman and techuse like this.
  3. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,042
    Likes Received:
    441
    We don't know.
    Seems like scheduling or w/l distribution.
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,847
    Likes Received:
    1,044
    Location:
    New York
    The micro polygon problem is mitigated somewhat by higher resolutions with their finer pixel grids. Probably doesn’t help much though if your triangles are pixel sized at 1080p.
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,469
    Likes Received:
    4,398
    Location:
    Well within 3d
    The efficiency gap in AMD's front-end has been a topic of debate for generations. I think the first questions about scaling came up in the VLIW era when the first "dual rasterizer" models were released and AMD didn't seem to benefit all that much from it.
    Fast-forward through years of product releases and increases to 4 rasterizers and seeing AMD fall further from peak.
    The most recent GPUs did seem to catch up in a number of targeted benchmarks to the competition, however.

    I think there were some posts by some with more inside knowledge about why this was, but I don't recall a definitive answer.
    At two or four geometry blocks, there would have been a problem of deciding how to partition a stream of primitives between them, and how to pass geometry that covered more than one screen tile between them.
    There are code references to potential heuristics, such as moving from the first geometry engine to the second after a certain saturation on the first, round-robin selection, or maybe just feeding one engine at a time.
    References to limitations in how a geometry engine can then pass shared geometry to other front ends shows up in a few places and also in AMD patents.
    It does seem like there are challenges in how much overhead is incurred in feeding geometry to one or more front ends, where different scenarios might result in performance degradation for a given choice. The process for passing data between front ends and synchronizing them is also a potential bottleneck, as it seems these paths are finicky in terms of synchronization and latency, and there is presumably some heavy crossbar hardware that is difficult to scale.
    What Nvidia did to stay ahead of AMD for so long, or what AMD did that left it behind for so long isn't spelled out, to my knowledge.

    I think AMD's proposed schemes for moving beyond input assemblers and rasterizers feeding each other through a crossbar network.
    However, the rough outline of having up to 4 rasterizers responsible for a checkerboard pattern of tiles in screens space continues even into the purported leak for the big RDNA2 architecture.
    In theory, some kind of distributed form of primitive shader might allow for the architecture to drop the limited scaling of the crossbar, but no such scaling is in evidence. The centralized geometry engine seems to regress from some of these proposals, which attempted to make it possible to scale the front end out. Perhaps load-balancing between four peer-level geometry front ends proved more problematic than having a stage in the process that makes some of the distribution decisions ahead of the primitive pipelines.


    Triangle packing has been visited as a topic on various occasions, but it seems like in most cases the overheads are too extreme on SIMD hardware. One brief exception being mention of possibly packing triangles in the case of certain instanced primitives in some AMD presentations.

    It may play a part in deciding which shader engines/arrays have cycles allocated to processing geometry that straddles their screen tiles, and perhaps some early culling that would be redundantly performed if the default process of passing a triangle whose bounding box indicates multiple engines may be involved. Some references to primitive shader culling in Vega do rely on calculating a bounding box and the values of certain bits in the x and y dimension indicating 1, 2, or 4 front ends being involved.


    Quads come in in part because there are built-in assumptions about gradients and interpolation that like 2x2 blocks desired at the shader level. It's a common case for graphics, and a crossbar between 4 clients appears to be a worthwhile hardware investment in general, as various compute, shifts, and cross-lane operations also have shuffles or permutation between the lanes in blocks of 4 as an option or as intermediate steps.
    Just removing quad functionality doesn't mean the SIMD hardware, cache structure, or DRAM architecture wouldn't still be much wider than necessary.

    One thing I noticed for many compute solutions for culling triangles is that a large number of them avoided placing the culling of triangles based on their being too small or falling between sample points on the programmable hardware. Decisions like frustrum or backface culling tended to be handled in a small number of instructions, and it seems like primitive shaders or CS sieves needed to be mindful of the overhead the culling work would have, since there would be a serial component and duplicated work for any non-culled triangles.

    However, even if the pain point for the rasterizers were somehow handled, it's not so much the fixed-function block but the whole SIMD architecture that's behind them as well. SIMDs are 16-32 lanes wide (wavefronts/warps potentially wider) and without efficient packing, a rasterizer that handles small triangles efficiently would still be generating mostly empty or very divergent thread groups.
     
    Qesa, tinokun, pharma and 5 others like this.
  6. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,156
    Likes Received:
    2,663
    Location:
    Germany
    Interestingly, the new geometry engine had it's own slide at the Next Horizon Techday, were AMD introduced both Zen2 as well as RDNA/RX 5700. In Mike Mantor's presentation from last years Hot Chips, I could not find any mention of it.
     

    Attached Files:

    w0lfram, Kej, Dictator and 8 others like this.
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,847
    Likes Received:
    1,044
    Location:
    New York
    I don't know if it's formally documented but given Nvidia's rasterizer throughput is 16 pixels I assumed they were doing some sort of packing to fill 32-wide SIMDs. Unless internally their pixel warps are 16 wide and not 32.
     
  8. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    309
    Likes Received:
    152
    You are right.

    But i found Multi-Lane Primitives. This is a new word for me.

    https://www.planet3dnow.de/cms/50413-praesentation-amd-hot-chips-31-navi-und-rdna-whitepaper/


    This is how AMD rasterize Polygons.
    https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
     
    #28 Digidi, Aug 21, 2020
    Last edited: Aug 21, 2020
    Lightman likes this.
  9. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,156
    Likes Received:
    2,663
    Location:
    Germany
    Pixels are composed of the four RGBA channels, each occupying a SIMD-lane at some point.
     
  10. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,345
    Likes Received:
    174
    Location:
    San Francisco
    AFAIK no modern GPU works this way. Each pixel runs on a SIMD/T lane.
     
    Silent_Buddha, 3dcgi and CarstenS like this.
  11. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,156
    Likes Received:
    2,663
    Location:
    Germany
    Yes, most probably. But they nevertheless occupy a lane at some point (in space and time). And the processor groups in the SM (like FP32, INT32) are 16-wide AFAIK, taking two clocks to process their 32-wide warps.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,469
    Likes Received:
    4,398
    Location:
    Well within 3d
    I think the warps are still 32-wide, and the exact details of Nvidia's solution might not be disclosed. However, if it's similar to AMD's situation, the solution is to take multiple rasterizer clocks to supply the coverage information for a warp/wavefront. AMD's rasterizers have maxed out at 16 pixels/clock despite having 64/32-wide waves. Pixel shader waves aren't required to launch every cycle.
     
  13. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,446
    Likes Received:
    302
    It seems this thread has gone off topic so I'll continue the trend. :)

    On AMD hardware multiple primitives can contribute to a PS wave. Up to 16 in fact.

    The API needs to change to allow developers to specify texture LOD.

    Some of what it does is fetch indices, form primitives, and distribute primitives. Some of the geometry processing, like those tasks you mentioned, are distributed.
     
    w0lfram, pharma, Lightman and 6 others like this.
  14. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    417
    Likes Received:
    416
    Does it make sense at all?
    Wouldn't it be easier to use RT for visibility pass? With RT, it's possible to draw billions of triangles as long as BVH fits in video memory. This presumes heavy instancing, but not that UE5 doesn't use it)
     
  15. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    417
    Likes Received:
    416
    RT also allows for much more customizable sampling patterns, such technics as TAA or DLSS might benefit a lot. RT will be much faster on micro triangles right out of the box and then much more can be achieved on top of that via more robust reconstruction technics, looks like a way forward to me.
     
  16. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    411
    Likes Received:
    457
    How? When I looked at the compiled shaders, I saw non-interpolated data dependencies on the triggering vertex handled with scalar instructions. Everything I saw so far indicated that VS waves never can span instance boundaries (during instanced rendering), and I was under the impression that FS waves can't span geometry primitives either.

    Are there any special preconditions which just be met to lift these restrictions?
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,446
    Likes Received:
    302
    I can't explain what you saw in the shader. RDNA can have multiple draw instances in the same VS wave. This was sometimes true for previous hardware. For example, Xbox One and PS4 could have multiple instances in a VS wave.
     
    Krteq, Lightman and BRiT like this.
  18. milk

    milk Like Verified
    Veteran Regular

    Joined:
    Jun 6, 2012
    Messages:
    3,545
    Likes Received:
    3,548
    Layman here, but is that really the only way to derive miplevel? I always undertood it that mips use pixel quads to decide mip level as a clever exploitation of the fact pixels already were computed in groups anyway for other reasons, but not because that was the only possible way to to derive the texel density at any given pixel. As some decade old assupmptions become obsolete, they have these domino effect of knocking down other optimizations that dependes on them, but such is progress.
     
  19. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    411
    Likes Received:
    457
    For determining the mip level alone, it wouldn't be required. If your UV mapping made certain guarantees about uniform texture resolution, and you know the projection, then normal, tangent and fragment depth are sufficient to decide on the correct mip level, as well as the level of anisotropic filtering required. And the APIs already allow you to provide the derivates if you can calculate them yourselves.

    And switching from quads to single pixel packing, if the fragment shader program requires no derivates would not require any API changes at all, but it's pure implementation detail.

    Without these preconditions met, there isn't any real alternative to quads though.
     
    Frenetic Pony and milk like this.
  20. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,446
    Likes Received:
    302
    I'm not an expert in this area, but my understanding is developer input is necessary to get rid of quads and perform well. It's been a discussion topic between some ISVs and IHVs for years, but seemingly hasn't been a big enough pain point to solve. There was a quad fragment merging paper years ago, but I think it was a lossy approach. I'm not aware of anyone implementing the technique.
     
    milk and Lightman like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...