GT200 (Tesla 2.0) : Does it have two Rasterisers ?

Discussion in 'Architecture and Products' started by agent_x007, Mar 25, 2019.

  1. agent_x007

    Newcomer

    Joined:
    Dec 2, 2014
    Messages:
    19
    Likes Received:
    0
    Hello

    Basicly, based on this die shot : LINK (warning : heavy picture).

    Anonymous (probably engineer ?), maped GT200 core like this :
    [​IMG]

    Question : Can GT200 have two rasterisers, as seen above (thery are not equall, but there are two) ?

    Thank you for any feedback.
     
  2. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,004
    Likes Received:
    109
    I don't think so. The geometry throughput of all the pre-fermi unified shader nvidia chips (from g80 to gt2xx) is limited to one tri / clock max (achieving something like 70-80% of it in practice, gt200 may be a bit higher there than older chips). I can't see why you'd have multiple rasterizers for this result.
    In fact a gtx 280 looses in that category by quite a bit to even a hd 3870 (simply because the GeForces of that aera had high shader clock, but the core clock was quite a bit lower than that of the radeons, which could also achieve 1 tri / clock, and IIRC might even have achieved results closer to the theoretical max). Fermi, of course, was a massive improvement there...
     
  3. iamw

    Newcomer

    Joined:
    Jul 20, 2010
    Messages:
    20
    Likes Received:
    44
    You remember right, GT200 does have only one tri/cycle. I didn't find a test for GT200, but from other graphics card tests, when the triangle size is larger than 16 pixels per cycle, the generation rate decreases. I think maybe GT200 will use two rasterizers to maintain pixel throughput for its 32 ROPs?
     
  4. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    G80 could reach its 24 pixels/clock for INT8 colour fillrate with fullscreen triangles (just checked my old test results from ~2007, wee!) so I don't think the rasteriser was limited to 16 pixels per cycle? Of course, they could have gone to 2 smaller/cheaper rasterisers in GT200 rather than a single bigger one, or wanted to increase some other rate (e.g. depth test)... But I think it's equally likely it's a minor implementation detail; e.g. maybe it's a single "rasterizer" (+triangle setup maybe) spread over 2 layout blocks for practical reasons?

    Either way I don't think it's that big a deal: it's significantly more complicated to have multiple geometry pipelines than multiple rasterisers and that part definitely only happened in Fermi.

    BTW it's intriguing to see someone interested in these old architectures! Is there anything in particuliar that's made you curious about them now? Because if you want a mystery to solve, feel free to solve the G80's "Missing MUL" and how it evolved but still didn't work quite right in G86 and GT200... I still haven't figured out that one :p

    Also this made me find my old G80 diagram from way back again, it was so beautiful and readable, fun times... ;) Definitely not all correct (and wildly overoptimistic with things like "scalable to handhelds") but not too bad given how little info we had from NVIDIA and how much we had to figure out ourselves on a pretty tight schedule!

    https://www.beyond3d.com/content/reviews/1/14
     
    agent_x007, Lightman and AlBran like this.
  5. agent_x007

    Newcomer

    Joined:
    Dec 2, 2014
    Messages:
    19
    Likes Received:
    0
    Old tech is good place to start if you want to figure out how modern stuff works.
    I was asked by TPU database maintainer to answer some questions, however, my knowledge of die shot reading isn't good enough :(
    That's why I asked here, so that more experienced people can take a look at this.

    As for G80...
    [​IMG]
    ;)
    I thought "missing MUL" was decided to be there, but only can be used with super specific type of case/program.
    Which led to general "if you can't use it in 99% of the time - it's not in there" - type of thinking.

    I find it fascinating that going over 16pix/cycle "kills" efficiency. Is that the reason why we don't see 128ROP GPUs (Titan V "CEO" being the only one) ?
     
    #5 agent_x007, Mar 26, 2019
    Last edited: Mar 26, 2019
  6. iamw

    Newcomer

    Joined:
    Jul 20, 2010
    Messages:
    20
    Likes Received:
    44
    In fact, NVIDIA has a lot of GPUs that rasterizer can not feed ROPs,
    such as GTX1060, 48 ROPs, but only two rasterizers, so in theory test, it can only reach half of the pixel throughput of GTX1080.https://www.hardware.fr/articles/948-9/performances-theoriques-pixels.html
    I think the extra ROPs are just for better memory bandwidth and anti-aliasing performance.
     
    agent_x007 likes this.
  7. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,452
    Likes Received:
    336
    Location:
    Varna, Bulgaria
    Old Nvidia GPUs (CUDA-compatible) were limited to one rasterized primitive every other clock cycle (half-rate), while 100% occluded geometry was processed at full-rate (1 prim/cycle). This was in a slight disadvantage to the contemporary GPUs from ATi/AMD that could rasterize geometry at full-rate, though stalls in the pixel pipeline could reduce the effective throughput in many situations.
    With Fermi, the upgraded parallel rasterization was still kept at half-rate for rasterized primitives, but AFAIK this was artificial limitation for the consumer SKUs. With tessellation the throughput was supposed to be full-rate (4 prims/cycle), but I don't have conclusive test numbers.
     
    CarstenS likes this.
  8. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,544
    Likes Received:
    4,203
    Isn't this the case with the PS4 Pro with its 64 ROPs? Or did they double the setup units too?
     
  9. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,452
    Likes Received:
    336
    Location:
    Varna, Bulgaria
    The setup pipeline was doubled as well.
     
  10. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Yep some NVIDIA GPUs definitely can't hit their peak ROP rate (without MSAA) due to rasteriser performance. I'm sure there are many reasons for that, e.g. halving the rate would make it too slow, MSAA performance, and blending is half-rate on some chips so more ROPs might still help blending if there's enough bandwidth, etc...

    I just don't think G80 was limited by the rasteriser though. Looking back at my old ROP data from 2007 some of it is a bit weird, but INT8 gets very close to peak (24*575=13800 => 99.7%):

    For comparison sake, I just ran the same test on my current RTX 2060:
    Kinda interesting INT16 is so slow - makes me think NVIDIA is literally emulating it in the shader core with the pixel shader modifying the depth manually! The FP32_S0 result is abnormally slow as well, might be an outlier or a bug in my code.

    However this is an OpenGL test, not Vulkan or DirectX, so there might be some weirdness from that as well which has nothing to do with the HW. I definitely wouldn't trust these results as much as anything from a modern Vulkan/DX microbenchmark framework, or even any DX10-based numbers for G80, unfortunately none of them exist with this kind of data AFAIK...

    And yes, G80 could only do 1 tri/clk when it was culled on G80, and also they could only do 1 index/clk. Again here's my old set of tests from 2006/2007:
    That test didn't age very well, as it becomes limited by other parts of the GPU especially for the version where the depth test succeeds (I can't remember the size of the triangles off the top of my head so please don't read too much into it!) but the results on RTX 2060 are still quite interesting:
    Based on that and an apparent boost clock of 1965MHz on my RTX 2060 with 30 SMs, each SM can process 1 index in ~4 clocks on TU106 (for backface culled triangles). The performance is clearly a *lot* lower for triangles that aren't culled, but I wouldn't trust my numbers too much as the triangles probably aren't as small as they should be nowadays for maximum performance.
     
    agent_x007 and pharma like this.
  11. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Maybe I should just open source these tests, and if someone still has both a G80 and a GT200 lying around in their basement they could give it a go ;) But then I'm a bit scared people would actually use them and take the results seriously on modern HW due to lack of alternatives, which wouldn't be a great idea...

    agent_x007, where do these annotated die shots come from?! I've never seen that one from G80! You don't have an annotated die shot of NV30 by any chance? :p

    ---

    And yes, Missing MUL was effectively not there (except for multiplying by 1/W for interpolation) but the really bizarre thing is how it evolved through both driver versions and hardware revisions. I swear that there was one driver revision on G80 where I actually had some of my old (lost to time) microbenchmarks use the MUL slightly more effectively, but then it reverted in later drivers to being practically never used. I really wish I wrote down that driver revision back then but now it's lost to time - maybe it was just buggy...
     
    Lightman likes this.
  12. agent_x007

    Newcomer

    Joined:
    Dec 2, 2014
    Messages:
    19
    Likes Received:
    0
    Source is classified ;)
    No NV30 sadly.
    I use free 3DMark Vantage to test Fillrates (I know it's VERY limited, however since I test it anyhow, I can use it's numbers) :
    [​IMG]

    And like you said @Arun, there are no good Fillrate tests out there :(
    Same can be said about triangle throughput.
    Maybe 3DMark should think about doing something about it ?
    I know GTX 780 is quite complicated in Pixel Fillrate :
    Each of it's Rasterisers does 8 pixels/cycle (assuming they aren't changed from GK104), and GPU itself may have four or five (depending on luck).
    On top of that 48 ROPs guarantee few will always be left hangin' (not in blending, like you mentioned), all the while each SMX can do 4 pixels/cycle (IIRC).
     
    #12 agent_x007, Mar 26, 2019
    Last edited: Mar 26, 2019
  13. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,004
    Likes Received:
    109
    Ah yes you're right it could have multiple rasterizers despite only being 1 prim/clock. Honestly I have no idea what the max pixel throughput there is, and how you'd actually distinguish one rasterizer with some throughput versus two with half the throughput each (well the hd 5870 had two rasterizers, which could be easily seen in the block diagrams...).
     
  14. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,452
    Likes Received:
    336
    Location:
    Varna, Bulgaria
    GT200 fragment rate is 32 per clock to the pixel pipes, that matched the ROP throughout.
    G80 has the same scan output of 32, but it's limited by the 24 ROPs. On top of that, pixel blending is half-rate.
    Those were two scan converters, not complete setup pipes. The diagram was a bit misleading.
     
    agent_x007 likes this.
  15. iamw

    Newcomer

    Joined:
    Jul 20, 2010
    Messages:
    20
    Likes Received:
    44
    The setup/rasterizer of PS4 Pro has doubled.
    @rSkip made a website for this.https://misdake.github.io/ChipAnnotationViewer/
     
  16. T4CFantasy

    Joined:
    Mar 25, 2019
    Messages:
    3
    Likes Received:
    4
    i made these based off the die
    image-original (1).jpg image-original (2).jpg
     
  17. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Interesting! Out of curiosity where was this announced/discussed? And how can I zoom in "properly" - I can't find any instructions, or is this just Firefox misbehaving and I should use another browser?

    BTW - the layout blocks between TU106 and NVIDIA's Xavier GPU are surprisingly different! That might seem completely off-topic, but I think it's also a really good reminder that layout blocks don't necessarily match what you'd expect from the "logical" top-level architecture, and they're sometimes more of an implementation detail based on layout restrictions etc... you might split 1 unit into 2 layout blocks if necessary, or merge 2 units into a single layout block sometimes.

    Compare the annotated TU106/TU116 to Xavier: https://en.wikichip.org/w/images/d/da/nvidia_xavier_die_shot_(annotated).png

    As far as I can tell... on Xavier/Volta, each SM is split into 2 layout blocks (32xFMA per layout block) with a single 128KiB L1 layout block. On Turing, each SM is a single layout block (64xFMA per layout block) but 2 SMs share a single L1 layout block (192KiB total where each SM's L1 is 96KiB, but can only be split 32KiB L1+64KiB shared or 64KiB L1+32KiB shared, unlike Volta which is very flexible in how you split L1/shared).

    NVIDIA also reduced L1 bandwidth from 128 bytes/clk on Volta to 64 bytes/clk on Turing, which I guess makes sense for a more graphics-oriented GPU... but it definitely seems like the low-level microarchitecture of Turing is more different from Volta than I thought!
     
  18. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,428
    Likes Received:
    548
    Location:
    WI, USA
    nevermind :)
     
  19. T4CFantasy

    Joined:
    Mar 25, 2019
    Messages:
    3
    Likes Received:
    4
    https://misdake.github.io/ChipAnnotationViewer/view.html?map=GT200&commentId=475917799
     
    CaptainGinger likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...