GT200 (Tesla 2.0) : Does it have two Rasterisers ?

agent_x007

Newcomer
Hello

Basicly, based on this die shot : LINK (warning : heavy picture).

Anonymous (probably engineer ?), maped GT200 core like this :
nRKlIne.jpg


Question : Can GT200 have two rasterisers, as seen above (thery are not equall, but there are two) ?

Thank you for any feedback.
 
I don't think so. The geometry throughput of all the pre-fermi unified shader nvidia chips (from g80 to gt2xx) is limited to one tri / clock max (achieving something like 70-80% of it in practice, gt200 may be a bit higher there than older chips). I can't see why you'd have multiple rasterizers for this result.
In fact a gtx 280 looses in that category by quite a bit to even a hd 3870 (simply because the GeForces of that aera had high shader clock, but the core clock was quite a bit lower than that of the radeons, which could also achieve 1 tri / clock, and IIRC might even have achieved results closer to the theoretical max). Fermi, of course, was a massive improvement there...
 
I don't think so. The geometry throughput of all the pre-fermi unified shader nvidia chips (from g80 to gt2xx) is limited to one tri / clock max (achieving something like 70-80% of it in practice, gt200 may be a bit higher there than older chips). I can't see why you'd have multiple rasterizers for this result.
In fact a gtx 280 looses in that category by quite a bit to even a hd 3870 (simply because the GeForces of that aera had high shader clock, but the core clock was quite a bit lower than that of the radeons, which could also achieve 1 tri / clock, and IIRC might even have achieved results closer to the theoretical max). Fermi, of course, was a massive improvement there...
You remember right, GT200 does have only one tri/cycle. I didn't find a test for GT200, but from other graphics card tests, when the triangle size is larger than 16 pixels per cycle, the generation rate decreases. I think maybe GT200 will use two rasterizers to maintain pixel throughput for its 32 ROPs?
 
G80 could reach its 24 pixels/clock for INT8 colour fillrate with fullscreen triangles (just checked my old test results from ~2007, wee!) so I don't think the rasteriser was limited to 16 pixels per cycle? Of course, they could have gone to 2 smaller/cheaper rasterisers in GT200 rather than a single bigger one, or wanted to increase some other rate (e.g. depth test)... But I think it's equally likely it's a minor implementation detail; e.g. maybe it's a single "rasterizer" (+triangle setup maybe) spread over 2 layout blocks for practical reasons?

Either way I don't think it's that big a deal: it's significantly more complicated to have multiple geometry pipelines than multiple rasterisers and that part definitely only happened in Fermi.

BTW it's intriguing to see someone interested in these old architectures! Is there anything in particuliar that's made you curious about them now? Because if you want a mystery to solve, feel free to solve the G80's "Missing MUL" and how it evolved but still didn't work quite right in G86 and GT200... I still haven't figured out that one :p

Also this made me find my old G80 diagram from way back again, it was so beautiful and readable, fun times... ;) Definitely not all correct (and wildly overoptimistic with things like "scalable to handhelds") but not too bad given how little info we had from NVIDIA and how much we had to figure out ourselves on a pretty tight schedule!

https://www.beyond3d.com/content/reviews/1/14
 
Old tech is good place to start if you want to figure out how modern stuff works.
I was asked by TPU database maintainer to answer some questions, however, my knowledge of die shot reading isn't good enough :(
That's why I asked here, so that more experienced people can take a look at this.

As for G80...
SYVaQOz.jpg

;)
I thought "missing MUL" was decided to be there, but only can be used with super specific type of case/program.
Which led to general "if you can't use it in 99% of the time - it's not in there" - type of thinking.

I find it fascinating that going over 16pix/cycle "kills" efficiency. Is that the reason why we don't see 128ROP GPUs (Titan V "CEO" being the only one) ?
 
Last edited:
Old tech is good place to start if you want to figure out how modern stuff works.
I was asked by TPU database maintainer to answer some questions, however, my knowledge of die shot reading isn't good enough :(
That's why I asked here, so that more experienced people can take a look at this.

As for G80...
SYVaQOz.jpg

;)
I thought "missing MUL" was decided to be there, but only can be used with super specific type of case/program.
Which led to general "if you can't use it in 99% of the time - it's not in there" - type of thinking.

I find it fascinating that going over 16pix/cycle "kills" efficiency. Is that the reason why we don't see 128ROP GPUs (Titan V "CEO" being the only one) ?

In fact, NVIDIA has a lot of GPUs that rasterizer can not feed ROPs,
such as GTX1060, 48 ROPs, but only two rasterizers, so in theory test, it can only reach half of the pixel throughput of GTX1080.https://www.hardware.fr/articles/948-9/performances-theoriques-pixels.html
I think the extra ROPs are just for better memory bandwidth and anti-aliasing performance.
 
Old Nvidia GPUs (CUDA-compatible) were limited to one rasterized primitive every other clock cycle (half-rate), while 100% occluded geometry was processed at full-rate (1 prim/cycle). This was in a slight disadvantage to the contemporary GPUs from ATi/AMD that could rasterize geometry at full-rate, though stalls in the pixel pipeline could reduce the effective throughput in many situations.
With Fermi, the upgraded parallel rasterization was still kept at half-rate for rasterized primitives, but AFAIK this was artificial limitation for the consumer SKUs. With tessellation the throughput was supposed to be full-rate (4 prims/cycle), but I don't have conclusive test numbers.
 
Yep some NVIDIA GPUs definitely can't hit their peak ROP rate (without MSAA) due to rasteriser performance. I'm sure there are many reasons for that, e.g. halving the rate would make it too slow, MSAA performance, and blending is half-rate on some chips so more ROPs might still help blending if there's enough bandwidth, etc...

I just don't think G80 was limited by the rasteriser though. Looking back at my old ROP data from 2007 some of it is a bit weird, but INT8 gets very close to peak (24*575=13800 => 99.7%):

8800 GTX said:
INT16_S0 Depth-Pass: 106997.55MPixels/s
INT24_S0 Depth-Pass: 106997.55MPixels/s
INT24_S8 Depth-Pass: 106997.55MPixels/s
FP32_S0 Depth-Pass: 108100.62MPixels/s
FP32_S8 Depth-Pass: 54899.27MPixels/s

INT16_S0 Depth-Reject: 145635.55MPixels/s
INT24_S0 Depth-Reject: 145635.55MPixels/s
INT24_S8 Depth-Reject: 145635.55MPixels/s
FP32_S0 Depth-Reject: 110376.42MPixels/s
FP32_S8 Depth-Reject: 54899.27MPixels/s

INT5x3 Colour: 11676.79MPixels/s
INT8x4 Colour: 13760.84MPixels/s
INT16x1 Colour: N/A
INT16x4 Colour: 6889.46MPixels/s
INT32x1 Colour: N/A
INT32x4 Colour: N/A
FP10x3 Colour: 6786.90MPixels/s
FP16x1 Colour: 11663.80MPixels/s
FP16x4 Colour: 6889.46MPixels/s
FP32x1 Colour: 11484.95MPixels/s
FP32x4 Colour: 3443.60MPixels/s

INT5x3 Blend: 6405.47MPixels/s
INT8x4 Blend: 6889.46MPixels/s
INT16x1 Blend: N/A
INT16x4 Blend: N/A
INT32x1 Blend: N/A
INT32x4 Blend: N/A
FP10x3 Blend: 6773.75MPixels/s
FP16x1 Blend: 6351.16MPixels/s
FP16x4 Blend: 6054.13MPixels/s
FP32x1 Blend: 3443.60MPixels/s
FP32x4 Blend: 861.75MPixels/s

For comparison sake, I just ran the same test on my current RTX 2060:
INT16_S0 Depth-Pass: 45392.90MPixels/s
INT24_S0 Depth-Pass: 227951.30MPixels/s
INT24_S8 Depth-Pass: 233016.89MPixels/s
FP32_S0 Depth-Pass: 213995.10MPixels/s
FP32_S8 Depth-Pass: 255750.24MPixels/s

INT16_S0 Depth-Reject: 1165084.43MPixels/s
INT24_S0 Depth-Reject: 1165084.43MPixels/s
INT24_S8 Depth-Reject: 1165084.43MPixels/s
FP32_S0 Depth-Reject: 243854.88MPixels/s
FP32_S8 Depth-Reject: 953250.89MPixels/s

INT5x3 Colour: 81284.96MPixels/s
INT8x4 Colour: 80043.97MPixels/s
INT16x1 Colour: N/A
INT16x4 Colour: 40960.00MPixels/s
INT32x1 Colour: N/A
INT32x4 Colour: N/A
FP10x3 Colour: 80659.69MPixels/s
FP16x1 Colour: 80043.97MPixels/s
FP16x4 Colour: 40800.62MPixels/s
FP32x1 Colour: 81284.96MPixels/s
FP32x4 Colour: 19030.42MPixels/s

INT5x3 Blend: 77101.18MPixels/s
INT8x4 Blend: 77672.30MPixels/s
INT16x1 Blend: N/A
INT16x4 Blend: 39568.91MPixels/s
INT32x1 Blend: N/A
INT32x4 Blend: N/A
FP10x3 Blend: 37583.37MPixels/s
FP16x1 Blend: 41445.69MPixels/s
FP16x4 Blend: 40800.62MPixels/s
FP32x1 Blend: 21098.11MPixels/s
FP32x4 Blend: 5180.71MPixels/s
Kinda interesting INT16 is so slow - makes me think NVIDIA is literally emulating it in the shader core with the pixel shader modifying the depth manually! The FP32_S0 result is abnormally slow as well, might be an outlier or a bug in my code.

However this is an OpenGL test, not Vulkan or DirectX, so there might be some weirdness from that as well which has nothing to do with the HW. I definitely wouldn't trust these results as much as anything from a modern Vulkan/DX microbenchmark framework, or even any DX10-based numbers for G80, unfortunately none of them exist with this kind of data AFAIK...

And yes, G80 could only do 1 tri/clk when it was culled on G80, and also they could only do 1 index/clk. Again here's my old set of tests from 2006/2007:
8800 GTX said:
CULL: EVERYTHING - DEPTH: ALWAYS PASS
Triangle Setup 3V/Tri CST: 190.476190 triangles/s
Triangle Setup 2V/Tri AVG: 285.714286 triangles/s
Triangle Setup 2V/Tri CST: 286.567164 triangles/s
Triangle Setup 1V/Tri CST: 564.705882 triangles/s
Triangle Setup 1V/T Strip: 561.403509 triangles/s

CULL: NOTHING - DEPTH: NEVER PASS (LESS)
Triangle Setup 3V/Tri CST: 191.235060 triangles/s
Triangle Setup 2V/Tri AVG: 229.665072 triangles/s
Triangle Setup 2V/Tri CST: 283.185841 triangles/s
Triangle Setup 1V/Tri CST: 282.352941 triangles/s
Triangle Setup 1V/T Strip: 282.352941 triangles/s

CULL: NOTHING - DEPTH: ALWAYS PASS
Triangle Setup 3V/Tri CST: 191.235060 triangles/s
Triangle Setup 2V/Tri AVG: 228.571429 triangles/s
Triangle Setup 2V/Tri CST: 282.352941 triangles/s
Triangle Setup 1V/Tri CST: 282.352941 triangles/s
Triangle Setup 1V/T Strip: 282.352941 triangles/s

That test didn't age very well, as it becomes limited by other parts of the GPU especially for the version where the depth test succeeds (I can't remember the size of the triangles off the top of my head so please don't read too much into it!) but the results on RTX 2060 are still quite interesting:
RTX 2060 said:
===================
CULL: EVERYTHING - DEPTH: ALWAYS PASS
===================
Triangle Setup 3V/Tri CST: 3128.666504 Mtriangles/s (61.368000 ms)
Triangle Setup 2V/Tri AVG: 6869.164063 Mtriangles/s (27.951000 ms)
Triangle Setup 2V/Tri CST: 6636.938477 Mtriangles/s (28.929001 ms)
Triangle Setup 1V/Tri CST: 13073.675781 Mtriangles/s (14.686000 ms)
Triangle Setup 1V/Tri Strip: 12531.001953 Mtriangles/s (15.322000 ms)

===================
CULL: NOTHING - DEPTH: NEVER PASS (LESS)
===================
Triangle Setup 3V/Tri CST: 2155.269287 Mtriangles/s (89.084000 ms)
Triangle Setup 2V/Tri AVG: 2232.843750 Mtriangles/s (85.988998 ms)
Triangle Setup 2V/Tri CST: 2529.210938 Mtriangles/s (75.913002 ms)
Triangle Setup 1V/Tri CST: 2552.987793 Mtriangles/s (75.206001 ms)
Triangle Setup 1V/Tri Strip: 2542.339111 Mtriangles/s (75.521004 ms)

===================
CULL: NOTHING - DEPTH: ALWAYS PASS
===================
Triangle Setup 3V/Tri CST: 990.824524 Mtriangles/s (193.778000 ms)
Triangle Setup 2V/Tri AVG: 997.371521 Mtriangles/s (192.505997 ms)
Triangle Setup 2V/Tri CST: 1000.109375 Mtriangles/s (191.979004 ms)
Triangle Setup 1V/Tri CST: 1001.100159 Mtriangles/s (191.789001 ms)
Triangle Setup 1V/Tri Strip: 994.375549 Mtriangles/s (193.085999 ms)
Based on that and an apparent boost clock of 1965MHz on my RTX 2060 with 30 SMs, each SM can process 1 index in ~4 clocks on TU106 (for backface culled triangles). The performance is clearly a *lot* lower for triangles that aren't culled, but I wouldn't trust my numbers too much as the triangles probably aren't as small as they should be nowadays for maximum performance.
 
Maybe I should just open source these tests, and if someone still has both a G80 and a GT200 lying around in their basement they could give it a go ;) But then I'm a bit scared people would actually use them and take the results seriously on modern HW due to lack of alternatives, which wouldn't be a great idea...

agent_x007, where do these annotated die shots come from?! I've never seen that one from G80! You don't have an annotated die shot of NV30 by any chance? :p

---

And yes, Missing MUL was effectively not there (except for multiplying by 1/W for interpolation) but the really bizarre thing is how it evolved through both driver versions and hardware revisions. I swear that there was one driver revision on G80 where I actually had some of my old (lost to time) microbenchmarks use the MUL slightly more effectively, but then it reverted in later drivers to being practically never used. I really wish I wrote down that driver revision back then but now it's lost to time - maybe it was just buggy...
 
Source is classified ;)
No NV30 sadly.
I use free 3DMark Vantage to test Fillrates (I know it's VERY limited, however since I test it anyhow, I can use it's numbers) :
56qsnk8.png


And like you said @Arun, there are no good Fillrate tests out there :(
Same can be said about triangle throughput.
Maybe 3DMark should think about doing something about it ?
I know GTX 780 is quite complicated in Pixel Fillrate :
Each of it's Rasterisers does 8 pixels/cycle (assuming they aren't changed from GK104), and GPU itself may have four or five (depending on luck).
On top of that 48 ROPs guarantee few will always be left hangin' (not in blending, like you mentioned), all the while each SMX can do 4 pixels/cycle (IIRC).
 
Last edited:
You remember right, GT200 does have only one tri/cycle. I didn't find a test for GT200, but from other graphics card tests, when the triangle size is larger than 16 pixels per cycle, the generation rate decreases. I think maybe GT200 will use two rasterizers to maintain pixel throughput for its 32 ROPs?
Ah yes you're right it could have multiple rasterizers despite only being 1 prim/clock. Honestly I have no idea what the max pixel throughput there is, and how you'd actually distinguish one rasterizer with some throughput versus two with half the throughput each (well the hd 5870 had two rasterizers, which could be easily seen in the block diagrams...).
 
GT200 fragment rate is 32 per clock to the pixel pipes, that matched the ROP throughout.
G80 has the same scan output of 32, but it's limited by the 24 ROPs. On top of that, pixel blending is half-rate.
(well the hd 5870 had two rasterizers, which could be easily seen in the block diagrams...).
Those were two scan converters, not complete setup pipes. The diagram was a bit misleading.
 
Isn't this the case with the PS4 Pro with its 64 ROPs? Or did they double the setup units too?
The setup/rasterizer of PS4 Pro has doubled.
Maybe I should just open source these tests, and if someone still has both a G80 and a GT200 lying around in their basement they could give it a go ;) But then I'm a bit scared people would actually use them and take the results seriously on modern HW due to lack of alternatives, which wouldn't be a great idea...

agent_x007, where do these annotated die shots come from?! I've never seen that one from G80! You don't have an annotated die shot of NV30 by any chance? :p

---

And yes, Missing MUL was effectively not there (except for multiplying by 1/W for interpolation) but the really bizarre thing is how it evolved through both driver versions and hardware revisions. I swear that there was one driver revision on G80 where I actually had some of my old (lost to time) microbenchmarks use the MUL slightly more effectively, but then it reverted in later drivers to being practically never used. I really wish I wrote down that driver revision back then but now it's lost to time - maybe it was just buggy...
@rSkip made a website for this.https://misdake.github.io/ChipAnnotationViewer/
 
Interesting! Out of curiosity where was this announced/discussed? And how can I zoom in "properly" - I can't find any instructions, or is this just Firefox misbehaving and I should use another browser?

BTW - the layout blocks between TU106 and NVIDIA's Xavier GPU are surprisingly different! That might seem completely off-topic, but I think it's also a really good reminder that layout blocks don't necessarily match what you'd expect from the "logical" top-level architecture, and they're sometimes more of an implementation detail based on layout restrictions etc... you might split 1 unit into 2 layout blocks if necessary, or merge 2 units into a single layout block sometimes.

Compare the annotated TU106/TU116 to Xavier: https://en.wikichip.org/w/images/d/da/nvidia_xavier_die_shot_(annotated).png

As far as I can tell... on Xavier/Volta, each SM is split into 2 layout blocks (32xFMA per layout block) with a single 128KiB L1 layout block. On Turing, each SM is a single layout block (64xFMA per layout block) but 2 SMs share a single L1 layout block (192KiB total where each SM's L1 is 96KiB, but can only be split 32KiB L1+64KiB shared or 64KiB L1+32KiB shared, unlike Volta which is very flexible in how you split L1/shared).

NVIDIA also reduced L1 bandwidth from 128 bytes/clk on Volta to 64 bytes/clk on Turing, which I guess makes sense for a more graphics-oriented GPU... but it definitely seems like the low-level microarchitecture of Turing is more different from Volta than I thought!
 
Interesting! Out of curiosity where was this announced/discussed? And how can I zoom in "properly" - I can't find any instructions, or is this just Firefox misbehaving and I should use another browser?

BTW - the layout blocks between TU106 and NVIDIA's Xavier GPU are surprisingly different! That might seem completely off-topic, but I think it's also a really good reminder that layout blocks don't necessarily match what you'd expect from the "logical" top-level architecture, and they're sometimes more of an implementation detail based on layout restrictions etc... you might split 1 unit into 2 layout blocks if necessary, or merge 2 units into a single layout block sometimes.

Compare the annotated TU106/TU116 to Xavier: https://en.wikichip.org/w/images/d/da/nvidia_xavier_die_shot_(annotated).png

As far as I can tell... on Xavier/Volta, each SM is split into 2 layout blocks (32xFMA per layout block) with a single 128KiB L1 layout block. On Turing, each SM is a single layout block (64xFMA per layout block) but 2 SMs share a single L1 layout block (192KiB total where each SM's L1 is 96KiB, but can only be split 32KiB L1+64KiB shared or 64KiB L1+32KiB shared, unlike Volta which is very flexible in how you split L1/shared).

NVIDIA also reduced L1 bandwidth from 128 bytes/clk on Volta to 64 bytes/clk on Turing, which I guess makes sense for a more graphics-oriented GPU... but it definitely seems like the low-level microarchitecture of Turing is more different from Volta than I thought!

https://misdake.github.io/ChipAnnotationViewer/view.html?map=GT200&commentId=475917799
 
Back
Top