Beyond3D's GT200 GPU and Architecture Analysis

BTW, do you know of any good die shots of G92/G92b that let us see areas like the RV770 and GT200 shots do?
http://www.anandtech.com/weblog/showpost.aspx?i=453 but I'm honestly not sure whether that's G80 or G92... (as for the 1.2B transistor count, I have some reasons to believe that's likely GT200b)
This is why I doubt we'll see 256-bit in the console space next gen, and EDRAM is here to stay. Sony and MS will want their chips to be <100 mm2 by the end of the generation.
I agree, 128-bit GDDR5 + embedded memory is a much better approach in the PS4/XBox720 generation. The only thing I'm worried is the burst size for the CPU (I'm skeptical separate DDR3 would make sense) but heh. Of course, if you used Rambus memory then you might not get pad limited fast enough to justify using embedded memory...
I think the next step for GPU makers is to increase setup speed, particularly with cascaded and omnidirectional shadow maps. BW is hitting a wall and limiting ROP speed, and math/texturing can only do so much. There are geometry reduction techniques, but it'll take a while before they really chop down polygon count.
Yeah, I've said it many times already but I'll repeat it yet again: being able to do triangle setup in the shader core for small triangles would be a huge deal. During the shadow pass, the shader core is just idle most of the time; this is absurd and highly suboptimal. Yes, you do need to beef up the rasterization hardware in consequence (needs to be hierarchical) but that's honestly not such a big deal AFAIK. I fully stand by my comment in the GT200 article about triangle setup & texture filtering/pixel blending, just haven't had the time to properly defend it yet - guess it'd probably be worth a new thread too! :)
Wouldn't log2 and exp2 still be quite useful?
Hmmm, very good point; I guess that would hardly be impossible to do in fixed-function hardware with that level of precision. Whether it would truly be desirable and worth the silicon cost is another question though, of course.
 
Of course, if you used Rambus memory then you might not get pad limited fast enough to justify using embedded memory...
Are you sure Rambus uses fewer pins? They do differential signalling. 16-bit data busses on their chips use the same 32 lines as a GDDR1/2/3/4/5/6 chip's data bus. Unless Rambus can achieve greater than twice the speed of GDDR5, I don't see where the advantage is.

Yeah, I've said it many times already but I'll repeat it yet again: being able to do triangle setup in the shader core for small triangles would be a huge deal. During the shadow pass, the shader core is just idle most of the time; this is absurd and highly suboptimal. Yes, you do need to beef up the rasterization hardware in consequence (needs to be hierarchical) but that's honestly not such a big deal AFAIK. I fully stand by my comment in the GT200 article about triangle setup & texture filtering/pixel blending, just haven't had the time to properly defend it yet - guess it'd probably be worth a new thread too! :)
Not sure if you've read my posts on this matter, but your suggestion is simply not a good idea.

Setup is very much about scan conversion and interpolator initialization. The math required for culling/clipping/perspective division isn't expensive at all, and it would be dwarfed by the routing and extra scheduling logic needed to move setup calculations into the shader core. This is the last part of the graphics pipeline which has not even been parallelized yet, and there could be a reason for that.

The same is true for texture filtering. Blending is a dependency issue, and you want to make your start-to-finish pipeline as small as possible to limit the complexities there.
 
Are you sure Rambus uses fewer pins? They do differential signalling. 16-bit data busses on their chips use the same 32 lines as a GDDR1/2/3/4/5/6 chip's data bus. Unless Rambus can achieve greater than twice the speed of GDDR5, I don't see where the advantage is.
I seem to remember reading some comparisons of bandwidth/pin where Rambus held a very substantial advantage with their next-generation technology (that I wouldn't expect to be eliminated completely via GDDR5). However, I could be remembering wrong or have miscalculated; definitely worth a recheck on my part!

Not sure if you've read my posts on this matter, but your suggestion is simply not a good idea.
I did, and I'll have to humbly disagree. A couple of points:
- Triangle Setup can already be done in the shader core in the ATTILA and the 'shader' they use for it is public in one of their papers; it's nothing extraordinary.
- Passing several position-only triangles around is very cheap and quite a bit of logic from other subsystems can be reused for this purposed.
- It's not like the entire shader core had to be able to handle triangle setup; even just one or two clusters (or one SM per cluster) would already be very fast.
- As for texture filtering, it's not really about making things faster as much as it is about improving programmability. It could even be half-speed for a given datasize and that'd more than good enough.
- Same for ROP blending, but of course I agree that making it programmable is more complex there (although the difficulty shouldn't be overestimated; some NVIDIA patents handle it in a rather efficient way IMO).
This is the last part of the graphics pipeline which has not even been parallelized yet, and there could be a reason for that.
Yes, but the reason isn't fundamentally all that different from why geometry shaders cannot be infinitely parallelized. Simply extending GS logic a bit might be able to do the trick... (although I'd have to think more about it to be certain)
 
I seem to remember reading some comparisons of bandwidth/pin where Rambus held a very substantial advantage with their next-generation technology (that I wouldn't expect to be eliminated completely via GDDR5).
They seem to mislead people a lot with BW per differential pair being confused for BW per pin. XDR2 is only 8GHz, and again needs twice the wires for the same bus width so the advantage is nil.

I did, and I'll have to humbly disagree. A couple of points:
- Triangle Setup can already be done in the shader core in the ATTILA and the 'shader' they use for it is public in one of their papers; it's nothing extraordinary.
ATTILA is a GPU simulator, not a hardware design. They don't have to route data around and worry about various HW limitations.
- It's not like the entire shader core had to be able to handle triangle setup; even just one or two clusters (or one SM per cluster) would already be very fast.
It doesn't matter, because the data flow doesn't fit the framework of the shaders.
- As for texture filtering, it's not really about making things faster as much as it is about improving programmability. It could even be half-speed for a given datasize and that'd more than good enough.
You already have programmability with point sampling and/or fetch4. Going beyond that is rather pointless.

Arun, you need to get a handle on the size of arithmetic logic. Remember that R420, at 160M transistors, has the same setup rate, HiZ rejection rate, and scanline rasterization rate of RV770. You don't need full FP32 arithmetic for most of the operations. You don't need access to a shader program or pick values from 64K of register space.

The tough part is not the computational resources. It's managing the data flow. You have a list of primitives and they index into a post-transform cache. If you want to do multiple triangles per clock, you need to multiple ports for this cache. You need multiple ports in the HiZ ram, and have to worry about updating some tiles twice per clock. Handling 16 different 4x4 tiles per clock is much easier than 32 tiles that aren't necessarily unique. You have to worry about the first triangle's Z/stencil affecting the second. There's probably a ton of other issues that I can't think of.

Die cost isn't the issue here. It's just a matter of tiptoeing through the minefield of parallelizing a part of the graphics pipeline that has always been serial. There are so many endcases that cause incorrect output if you aren't perfect.
 
ATTILA is a GPU simulator, not a hardware design. They don't have to route data around and worry about various HW limitations.
Duh. However, ATTILA is afaik what influenced Intel to implement triangle setup in their shader core for their current SM3/SM4 IGP architecture. The only reason why I was mentioning it, anyway, is that it's the only source I've ever seen for an arithmetic shader-like 'triangle setup' program.

It doesn't matter, because the data flow doesn't fit the framework of the shaders.
I am rather skeptical that it doesn't loosely fit the framework of the geometry shader. Of course, the geometry shader's peak rate is rather... unimpressive right now, but you can't get away from the fact that in the future it will progressively have to go incrementally faster. That same hardware could be partly shared for the data flow & synchronization of triangle setup (in fact, the two could possibly be merged into the same program).

You already have programmability with point sampling and/or fetch4. Going beyond that is rather pointless.
Getting full programmability for free would be absurd, yes. However, point sampling and/or fetch4 is so far from optimal it's pretty funny. Okay, let's look at this another way: what do you need to do texture filtering in the shader core?
- The texture colors and the weights for each bilinear operation, for every colour.
- The number of bilinear operations to perform and the n-1 weights between them.

For an INT8 texture, the colors & weights are likely all being transmitted in 16-bit format (and converted by a 'free' converter somewhere). Note that I am assuming this (rather than FP32 for everything) in order to be pessimistic for my own estimates. Therefore, it should be possible with proper packing (ugh, I know) to transmit the colors in 2 cycles and the weights in 1 cycle. So 3 cycles total for bilinear, 6 cycles for trilinear. This isn't fundamentally different from how the same paths are reused for a 4xINT8 texture or a 2xFP16 texture IMO...

I'm not saying no new scheduling or routing logic would be required. Duh, of course it would. However, I think you are overestimating its size, and underestimating the logic it could save by always running certain modes in And much more importantly, I think history will prove you wrong... ;)

Arun, you need to get a handle on the size of arithmetic logic. Remember that R420, at 160M transistors, has the same setup rate, HiZ rejection rate, and scanline rasterization rate of RV770. You don't need full FP32 arithmetic for most of the operations. You don't need access to a shader program or pick values from 64K of register space.
I think I have a much better handle on that than you think I do... :) Furthermore, you strangely put HiZ and scanline rasterization into the equation; yes, those are things that are also hard to parallelize, but unlike triangle setup their computational requirements are massively different from the shader core. It's all about a large number of very small operations; and obviously doing that in 'software' would be pure madness: there's nothing to gain there.

On the other hand, while triangle setup is indeed hard to parallelize, the operations are much higher precision (in fact, the best arguement I've heard so far for not doing it in hardware is the shader core is that FP32 just isn't enough! Although I do wonder if that doesn't depend on the algorithm given that Intel seems to be doing it just fine...) and it looks much more like a classic shader program. So yes, it's hard to do, but the reward may very well be worth the initial R&D cost and the slight die size cost.

My point is really this: it will be necessary in the future to be able to parallelize it; it's just not fast enough otherwise for a huge 500mm² 40nm chip, you can't get around it. And once you did find an acceptable way to parallelize it, then simply adding more fixed-function hardware for the job would be very expensive and inefficient because (unlike HiZ or rasterization) those aren't just small and cheap operations at all. So like it or not, I don't see how you can get around having to do this within the next 2 years...

Die cost isn't the issue here. It's just a matter of tiptoeing through the minefield of parallelizing a part of the graphics pipeline that has always been serial. There are so many endcases that cause incorrect output if you aren't perfect.
Where did I say it was easy? However, creating an highly efficient unified shader core isn't easy either, nor is a good redundancy mechanism. That doesn't mean you should give yourself the luxury of avoiding them until you're really really forced to; otherwise you'll just turn out the way of 3dfx.

I would be very surprised if neither IHV considered it very seriously for the DX11 generation anyway given this: http://www.gamedev.net/columns/events/gdc2006/article.asp?id=233

I could definitely still be wrong here, and if I ever change opinion I'll gladly admit so, but I certainly don't think the arguements are anywhere near as clearcut as you make them out to be; of course, I'll also gladly admit that they're not anywhere near as clearcut as *I* originally made them out to be!
 
Getting full programmability for free would be absurd, yes. However, point sampling and/or fetch4 is so far from optimal it's pretty funny. Okay, let's look at this another way: what do you need to do texture filtering in the shader core?
- The texture colors and the weights for each bilinear operation, for every colour.
- The number of bilinear operations to perform and the n-1 weights between them.

For an INT8 texture, the colors & weights are likely all being transmitted in 16-bit format (and converted by a 'free' converter somewhere). Note that I am assuming this (rather than FP32 for everything) in order to be pessimistic for my own estimates. Therefore, it should be possible with proper packing (ugh, I know) to transmit the colors in 2 cycles and the weights in 1 cycle. So 3 cycles total for bilinear, 6 cycles for trilinear. This isn't fundamentally different from how the same paths are reused for a 4xINT8 texture or a 2xFP16 texture IMO...
What exactly are you vying for? Improved point sampling ability (i.e. 4 neighboring point samples in the time of one) or removing the current filtering logic also?

If it's the former, fine, I can agree with you there, as it's mostly just a bigger bus from the TU to the shader. If it's the latter then I totally disagree. That is low precision math (one operand is ~8 bits, allowing huge savings) inserted into a very specific point in the data pipeline, and costs a tiny fraction of the general ALU power needed to replace it.

I think I have a much better handle on that than you think I do... :) Furthermore, you strangely put HiZ and scanline rasterization into the equation; yes, those are things that are also hard to parallelize, but unlike triangle setup their computational requirements are massively different from the shader core. It's all about a large number of very small operations; and obviously doing that in 'software' would be pure madness: there's nothing to gain there.
Well then that's where the confusion is, as I call that stuff "setup". I want end-to-end doubling in polygon throughput per clock.

I suppose doubling what you call setup would be useful in that culled/clipped polygons get removed faster, and that should net the IHVs a real-world polygon throughput increase of near 50%. Maybe that's a good first step.

Where did I say it was easy? However, creating an highly efficient unified shader core isn't easy either, nor is a good redundancy mechanism. That doesn't mean you should give yourself the luxury of avoiding them until you're really really forced to; otherwise you'll just turn out the way of 3dfx.
Well with unified shaders you gained a lot of functionality in vertex programs, especially in terms of usable texturing. For setup, though, there's nothing to enhance, and compared to pixel shading the load upon it has increased far slower. Not slow enough to warrant the same setup speed for 6 years, but definately not fast enough to warrant scaling setup speed with shader count, particularly when scanline rasterization etc are very hard to make that fast.

This is one place where sticking with fixed function hardware makes a lot of sense. Just like texture filtering. :p
 
Pardon the interruption, WRT FF texturing - isn't it widely accepted by now that FF tex hw is going to go the way of the dodo, and probably within the next several product cycles (perhaps DX11/compute shader timeframe)?
 
As far as I know, that's the one used for GT200-press material, no idea what's up with 1.2bil vs 1.4 elsewhere, though
Having had access to the press materials, I never saw this shot outside of the financial presentation.

Counting the "trenches" between each block of supposed ALUs, there are 16. Each related to eight ALUs gives us 128, which IMO strongly indicates, that we're seeing a G92/G80-die here.
 
It definitely looks like an 8 cluster chip. So assuming the 1.2 billion transistors isn't a cock-up, maybe GT200 on 55nm is 8 clusters (192 shaders), with higher clocks to compensate?
 
What exactly are you vying for? Improved point sampling ability (i.e. 4 neighboring point samples in the time of one) or removing the current filtering logic also?

If it's the former, fine, I can agree with you there, as it's mostly just a bigger bus from the TU to the shader. If it's the latter then I totally disagree. That is low precision math (one operand is ~8 bits, allowing huge savings) inserted into a very specific point in the data pipeline, and costs a tiny fraction of the general ALU power needed to replace it.
I'm vying for a slightly faster bus from the TU to the shader and the capability to transmit not just the 4 neighboring point samples in it but also the weights which had to be calculated anyway. Yes, it would take time/bandwidth to transmit that, so I'm not expecting it to go full-speed, at least not initially.

The goal isn't to replace the texture filtering hardware completely, but to complement it (improving software flexibility) and be able to eventually streamline it. If every texture was DXT1, the latter wouldn't make any sense, but that's obviously not the case and the TMU hardware does need to support a variety of formats quite fast indeed. Down the road, I'd rather see the TMU focus on the most common formats and let the shader core handle the rest; however, whether that makes sense is a complex performance discussion that is hard to have objectively without more raw data.

My expectation is that it's likely undesirable unless the batch size is substantially smaller for the pixel shader, which obviously has problems of its own. What might be interesting is allowing the pixel shader to do pure bilinear or trilinear filtering during 'idle cycles' where the texture filtering unit is busy doing advanced anisotropic filtering; but I would actually be very surprised if that was worth the effort or scheduling overhead.

Well then that's where the confusion is, as I call that stuff "setup". I want end-to-end doubling in polygon throughput per clock.

I suppose doubling what you call setup would be useful in that culled/clipped polygons get removed faster, and that should net the IHVs a real-world polygon throughput increase of near 50%. Maybe that's a good first step.
Oh, I want end-to-end doubling too, my point just was that it wouldn't make any real sense to do HiZ/Rasterization/Compression in SW just to be able to increase the peak throughput. I'd rather 'just' have that fixed-function hardware be able of, I don't know, 4 triangles/clock on an ultra-high-end chip. Yes, it's hard to make it work, but as I said you just can't get away from that complexity in the long run.

However, I do agree that there are much worse first steps than what you just proposed there!

Well with unified shaders you gained a lot of functionality in vertex programs, especially in terms of usable texturing. For setup, though, there's nothing to enhance, and compared to pixel shading the load upon it has increased far slower. Not slow enough to warrant the same setup speed for 6 years, but definately not fast enough to warrant scaling setup speed with shader count, particularly when scanline rasterization etc are very hard to make that fast.

This is one place where sticking with fixed function hardware makes a lot of sense. Just like texture filtering. :p
The point isn't that triangle setup should be able to go as fast as the shader core would let it; the point is that triangle setup should be able to go faster than it does today, and once you've got the necessary complex infrastructure to parallelize it, it seems rather absurd to just implement four times as many fixed-function triangle setup units. It would seem much more sensible to just offload it by then...

As for texture filtering, I'm not sure. My theory here is that you're just picking on everything Larrabee does in software in order to later claim I said Larrabee would be a disaster! :D You seem to forget you can employ a similar rick to MMX: full-speed INT32, double-speed INT16, quadruple-speed INT8. Yes, it does require you to have 32x32 multipliers instead of 27x27 multipliers, but it seems like RV770 is handling that horrible inefficiency just fine! *grins* I am more worried about the dataflow, but if Larrabee's engineers can't even get that critical part of the design right, then they really are hopeless.

Realistically though, one problem with this theory is that you can't filter a INT8 texture with INT8 hardware; that's just not legal, you need a lot more precision than that in the DX10/DX11 spec. I still think double-speed INT16 would be worth the trouble though...
 
I'm vying for a slightly faster bus from the TU to the shader and the capability to transmit not just the 4 neighboring point samples in it but also the weights which had to be calculated anyway. Yes, it would take time/bandwidth to transmit that, so I'm not expecting it to go full-speed, at least not initially.
Well how useful are the weights, really, when you want to do custom filtering yourself? There's bilinear PCF, but we already have a fast path there.

The only use of the weights is when you need to do a non-linear operation on each texel first and then bilinearly filter it. Not only is this rare, but doing that operation on all the texels probably takes enough time that current GPU speed for point sampling is good enough not to be a significant limitation, and getting weights in the same cycle isn't needed either.

I say just keep the point sampling rates the same and add a weight instruction (that doesn't have to be executed in the same clock). What do you know, that's what we've had for some time, though DX doesn't expose the weight instruction.

The goal isn't to replace the texture filtering hardware completely, but to complement it (improving software flexibility) and be able to eventually streamline it. If every texture was DXT1, the latter wouldn't make any sense, but that's obviously not the case and the TMU hardware does need to support a variety of formats quite fast indeed.
I don't think that's really an issue. There's DXT and for other formats it's just a matter of swizzling and dealing with both FP and INT in the filtering unit. As you commented yourself, that's not very hard.

Oh, I want end-to-end doubling too, my point just was that it wouldn't make any real sense to do HiZ/Rasterization/Compression in SW just to be able to increase the peak throughput.
I wasn't implying that. I was just confused with your suggestion of shader-based setup because I considered this part of setup too.

The point isn't that triangle setup should be able to go as fast as the shader core would let it; the point is that triangle setup should be able to go faster than it does today, and once you've got the necessary complex infrastructure to parallelize it, it seems rather absurd to just implement four times as many fixed-function triangle setup units.
Not to me. The amount of fixed function hardware saved is too minimal, IMO. All you save is the arithmetic logic in the setup block, and I'm thinking that's well under 1% of today's dies. Quadruple it and it still isn't relevent.

You seem to forget you can employ a similar rick to MMX: full-speed INT32, double-speed INT16, quadruple-speed INT8. Yes, it does require you to have 32x32 multipliers instead of 27x27 multipliers, but it seems like RV770 is handling that horrible inefficiency just fine!
I'm not forgetting at all; on the contrary, I also noted that the filter weights are only 9 bits or so. If you were only doing fixed point filtering like 360 does, all you need are four 8x9 multipliers for the same characteristics you mentioned. Similar simplifications can be made to include FP, except you need a few more adders/shifters.

This is exactly why we should keep all the fixed function filtering, because it's much cheaper than the ALU logic.
 
Back
Top