agent_x007
Newcomer
You remember right, GT200 does have only one tri/cycle. I didn't find a test for GT200, but from other graphics card tests, when the triangle size is larger than 16 pixels per cycle, the generation rate decreases. I think maybe GT200 will use two rasterizers to maintain pixel throughput for its 32 ROPs?I don't think so. The geometry throughput of all the pre-fermi unified shader nvidia chips (from g80 to gt2xx) is limited to one tri / clock max (achieving something like 70-80% of it in practice, gt200 may be a bit higher there than older chips). I can't see why you'd have multiple rasterizers for this result.
In fact a gtx 280 looses in that category by quite a bit to even a hd 3870 (simply because the GeForces of that aera had high shader clock, but the core clock was quite a bit lower than that of the radeons, which could also achieve 1 tri / clock, and IIRC might even have achieved results closer to the theoretical max). Fermi, of course, was a massive improvement there...
Old tech is good place to start if you want to figure out how modern stuff works.
I was asked by TPU database maintainer to answer some questions, however, my knowledge of die shot reading isn't good enough
That's why I asked here, so that more experienced people can take a look at this.
As for G80...
I thought "missing MUL" was decided to be there, but only can be used with super specific type of case/program.
Which led to general "if you can't use it in 99% of the time - it's not in there" - type of thinking.
I find it fascinating that going over 16pix/cycle "kills" efficiency. Is that the reason why we don't see 128ROP GPUs (Titan V "CEO" being the only one) ?
In fact, NVIDIA has a lot of GPUs that rasterizer can not feed ROPs,
such as GTX1060, 48 ROPs, but only two rasterizers, so in theory test, it can only reach half of the pixel throughput of GTX1080.https://www.hardware.fr/articles/948-9/performances-theoriques-pixels.html
I think the extra ROPs are just for better memory bandwidth and anti-aliasing performance.
The setup pipeline was doubled as well.Isn't this the case with the PS4 Pro with its 64 ROPs? Or did they double the setup units too?
8800 GTX said:INT16_S0 Depth-Pass: 106997.55MPixels/s
INT24_S0 Depth-Pass: 106997.55MPixels/s
INT24_S8 Depth-Pass: 106997.55MPixels/s
FP32_S0 Depth-Pass: 108100.62MPixels/s
FP32_S8 Depth-Pass: 54899.27MPixels/s
INT16_S0 Depth-Reject: 145635.55MPixels/s
INT24_S0 Depth-Reject: 145635.55MPixels/s
INT24_S8 Depth-Reject: 145635.55MPixels/s
FP32_S0 Depth-Reject: 110376.42MPixels/s
FP32_S8 Depth-Reject: 54899.27MPixels/s
INT5x3 Colour: 11676.79MPixels/s
INT8x4 Colour: 13760.84MPixels/s
INT16x1 Colour: N/A
INT16x4 Colour: 6889.46MPixels/s
INT32x1 Colour: N/A
INT32x4 Colour: N/A
FP10x3 Colour: 6786.90MPixels/s
FP16x1 Colour: 11663.80MPixels/s
FP16x4 Colour: 6889.46MPixels/s
FP32x1 Colour: 11484.95MPixels/s
FP32x4 Colour: 3443.60MPixels/s
INT5x3 Blend: 6405.47MPixels/s
INT8x4 Blend: 6889.46MPixels/s
INT16x1 Blend: N/A
INT16x4 Blend: N/A
INT32x1 Blend: N/A
INT32x4 Blend: N/A
FP10x3 Blend: 6773.75MPixels/s
FP16x1 Blend: 6351.16MPixels/s
FP16x4 Blend: 6054.13MPixels/s
FP32x1 Blend: 3443.60MPixels/s
FP32x4 Blend: 861.75MPixels/s
Kinda interesting INT16 is so slow - makes me think NVIDIA is literally emulating it in the shader core with the pixel shader modifying the depth manually! The FP32_S0 result is abnormally slow as well, might be an outlier or a bug in my code.INT16_S0 Depth-Pass: 45392.90MPixels/s
INT24_S0 Depth-Pass: 227951.30MPixels/s
INT24_S8 Depth-Pass: 233016.89MPixels/s
FP32_S0 Depth-Pass: 213995.10MPixels/s
FP32_S8 Depth-Pass: 255750.24MPixels/s
INT16_S0 Depth-Reject: 1165084.43MPixels/s
INT24_S0 Depth-Reject: 1165084.43MPixels/s
INT24_S8 Depth-Reject: 1165084.43MPixels/s
FP32_S0 Depth-Reject: 243854.88MPixels/s
FP32_S8 Depth-Reject: 953250.89MPixels/s
INT5x3 Colour: 81284.96MPixels/s
INT8x4 Colour: 80043.97MPixels/s
INT16x1 Colour: N/A
INT16x4 Colour: 40960.00MPixels/s
INT32x1 Colour: N/A
INT32x4 Colour: N/A
FP10x3 Colour: 80659.69MPixels/s
FP16x1 Colour: 80043.97MPixels/s
FP16x4 Colour: 40800.62MPixels/s
FP32x1 Colour: 81284.96MPixels/s
FP32x4 Colour: 19030.42MPixels/s
INT5x3 Blend: 77101.18MPixels/s
INT8x4 Blend: 77672.30MPixels/s
INT16x1 Blend: N/A
INT16x4 Blend: 39568.91MPixels/s
INT32x1 Blend: N/A
INT32x4 Blend: N/A
FP10x3 Blend: 37583.37MPixels/s
FP16x1 Blend: 41445.69MPixels/s
FP16x4 Blend: 40800.62MPixels/s
FP32x1 Blend: 21098.11MPixels/s
FP32x4 Blend: 5180.71MPixels/s
8800 GTX said:CULL: EVERYTHING - DEPTH: ALWAYS PASS
Triangle Setup 3V/Tri CST: 190.476190 triangles/s
Triangle Setup 2V/Tri AVG: 285.714286 triangles/s
Triangle Setup 2V/Tri CST: 286.567164 triangles/s
Triangle Setup 1V/Tri CST: 564.705882 triangles/s
Triangle Setup 1V/T Strip: 561.403509 triangles/s
CULL: NOTHING - DEPTH: NEVER PASS (LESS)
Triangle Setup 3V/Tri CST: 191.235060 triangles/s
Triangle Setup 2V/Tri AVG: 229.665072 triangles/s
Triangle Setup 2V/Tri CST: 283.185841 triangles/s
Triangle Setup 1V/Tri CST: 282.352941 triangles/s
Triangle Setup 1V/T Strip: 282.352941 triangles/s
CULL: NOTHING - DEPTH: ALWAYS PASS
Triangle Setup 3V/Tri CST: 191.235060 triangles/s
Triangle Setup 2V/Tri AVG: 228.571429 triangles/s
Triangle Setup 2V/Tri CST: 282.352941 triangles/s
Triangle Setup 1V/Tri CST: 282.352941 triangles/s
Triangle Setup 1V/T Strip: 282.352941 triangles/s
Based on that and an apparent boost clock of 1965MHz on my RTX 2060 with 30 SMs, each SM can process 1 index in ~4 clocks on TU106 (for backface culled triangles). The performance is clearly a *lot* lower for triangles that aren't culled, but I wouldn't trust my numbers too much as the triangles probably aren't as small as they should be nowadays for maximum performance.RTX 2060 said:===================
CULL: EVERYTHING - DEPTH: ALWAYS PASS
===================
Triangle Setup 3V/Tri CST: 3128.666504 Mtriangles/s (61.368000 ms)
Triangle Setup 2V/Tri AVG: 6869.164063 Mtriangles/s (27.951000 ms)
Triangle Setup 2V/Tri CST: 6636.938477 Mtriangles/s (28.929001 ms)
Triangle Setup 1V/Tri CST: 13073.675781 Mtriangles/s (14.686000 ms)
Triangle Setup 1V/Tri Strip: 12531.001953 Mtriangles/s (15.322000 ms)
===================
CULL: NOTHING - DEPTH: NEVER PASS (LESS)
===================
Triangle Setup 3V/Tri CST: 2155.269287 Mtriangles/s (89.084000 ms)
Triangle Setup 2V/Tri AVG: 2232.843750 Mtriangles/s (85.988998 ms)
Triangle Setup 2V/Tri CST: 2529.210938 Mtriangles/s (75.913002 ms)
Triangle Setup 1V/Tri CST: 2552.987793 Mtriangles/s (75.206001 ms)
Triangle Setup 1V/Tri Strip: 2542.339111 Mtriangles/s (75.521004 ms)
===================
CULL: NOTHING - DEPTH: ALWAYS PASS
===================
Triangle Setup 3V/Tri CST: 990.824524 Mtriangles/s (193.778000 ms)
Triangle Setup 2V/Tri AVG: 997.371521 Mtriangles/s (192.505997 ms)
Triangle Setup 2V/Tri CST: 1000.109375 Mtriangles/s (191.979004 ms)
Triangle Setup 1V/Tri CST: 1001.100159 Mtriangles/s (191.789001 ms)
Triangle Setup 1V/Tri Strip: 994.375549 Mtriangles/s (193.085999 ms)
Ah yes you're right it could have multiple rasterizers despite only being 1 prim/clock. Honestly I have no idea what the max pixel throughput there is, and how you'd actually distinguish one rasterizer with some throughput versus two with half the throughput each (well the hd 5870 had two rasterizers, which could be easily seen in the block diagrams...).You remember right, GT200 does have only one tri/cycle. I didn't find a test for GT200, but from other graphics card tests, when the triangle size is larger than 16 pixels per cycle, the generation rate decreases. I think maybe GT200 will use two rasterizers to maintain pixel throughput for its 32 ROPs?
Those were two scan converters, not complete setup pipes. The diagram was a bit misleading.(well the hd 5870 had two rasterizers, which could be easily seen in the block diagrams...).
The setup/rasterizer of PS4 Pro has doubled.Isn't this the case with the PS4 Pro with its 64 ROPs? Or did they double the setup units too?
@rSkip made a website for this.https://misdake.github.io/ChipAnnotationViewer/Maybe I should just open source these tests, and if someone still has both a G80 and a GT200 lying around in their basement they could give it a go But then I'm a bit scared people would actually use them and take the results seriously on modern HW due to lack of alternatives, which wouldn't be a great idea...
agent_x007, where do these annotated die shots come from?! I've never seen that one from G80! You don't have an annotated die shot of NV30 by any chance?
---
And yes, Missing MUL was effectively not there (except for multiplying by 1/W for interpolation) but the really bizarre thing is how it evolved through both driver versions and hardware revisions. I swear that there was one driver revision on G80 where I actually had some of my old (lost to time) microbenchmarks use the MUL slightly more effectively, but then it reverted in later drivers to being practically never used. I really wish I wrote down that driver revision back then but now it's lost to time - maybe it was just buggy...
Interesting! Out of curiosity where was this announced/discussed? And how can I zoom in "properly" - I can't find any instructions, or is this just Firefox misbehaving and I should use another browser?@rSkip made a website for this.https://misdake.github.io/ChipAnnotationViewer/
Interesting! Out of curiosity where was this announced/discussed? And how can I zoom in "properly" - I can't find any instructions, or is this just Firefox misbehaving and I should use another browser?
BTW - the layout blocks between TU106 and NVIDIA's Xavier GPU are surprisingly different! That might seem completely off-topic, but I think it's also a really good reminder that layout blocks don't necessarily match what you'd expect from the "logical" top-level architecture, and they're sometimes more of an implementation detail based on layout restrictions etc... you might split 1 unit into 2 layout blocks if necessary, or merge 2 units into a single layout block sometimes.
Compare the annotated TU106/TU116 to Xavier: https://en.wikichip.org/w/images/d/da/nvidia_xavier_die_shot_(annotated).png
As far as I can tell... on Xavier/Volta, each SM is split into 2 layout blocks (32xFMA per layout block) with a single 128KiB L1 layout block. On Turing, each SM is a single layout block (64xFMA per layout block) but 2 SMs share a single L1 layout block (192KiB total where each SM's L1 is 96KiB, but can only be split 32KiB L1+64KiB shared or 64KiB L1+32KiB shared, unlike Volta which is very flexible in how you split L1/shared).
NVIDIA also reduced L1 bandwidth from 128 bytes/clk on Volta to 64 bytes/clk on Turing, which I guess makes sense for a more graphics-oriented GPU... but it definitely seems like the low-level microarchitecture of Turing is more different from Volta than I thought!