Beyond3D's GT200 GPU and Architecture Analysis

Discussion in 'Architecture and Products' started by Arun, Jun 16, 2008.

  1. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    http://www.anandtech.com/weblog/showpost.aspx?i=453 but I'm honestly not sure whether that's G80 or G92... (as for the 1.2B transistor count, I have some reasons to believe that's likely GT200b)
    I agree, 128-bit GDDR5 + embedded memory is a much better approach in the PS4/XBox720 generation. The only thing I'm worried is the burst size for the CPU (I'm skeptical separate DDR3 would make sense) but heh. Of course, if you used Rambus memory then you might not get pad limited fast enough to justify using embedded memory...
    Yeah, I've said it many times already but I'll repeat it yet again: being able to do triangle setup in the shader core for small triangles would be a huge deal. During the shadow pass, the shader core is just idle most of the time; this is absurd and highly suboptimal. Yes, you do need to beef up the rasterization hardware in consequence (needs to be hierarchical) but that's honestly not such a big deal AFAIK. I fully stand by my comment in the GT200 article about triangle setup & texture filtering/pixel blending, just haven't had the time to properly defend it yet - guess it'd probably be worth a new thread too! :)
    Hmmm, very good point; I guess that would hardly be impossible to do in fixed-function hardware with that level of precision. Whether it would truly be desirable and worth the silicon cost is another question though, of course.
     
  2. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Are you sure Rambus uses fewer pins? They do differential signalling. 16-bit data busses on their chips use the same 32 lines as a GDDR1/2/3/4/5/6 chip's data bus. Unless Rambus can achieve greater than twice the speed of GDDR5, I don't see where the advantage is.

    Not sure if you've read my posts on this matter, but your suggestion is simply not a good idea.

    Setup is very much about scan conversion and interpolator initialization. The math required for culling/clipping/perspective division isn't expensive at all, and it would be dwarfed by the routing and extra scheduling logic needed to move setup calculations into the shader core. This is the last part of the graphics pipeline which has not even been parallelized yet, and there could be a reason for that.

    The same is true for texture filtering. Blending is a dependency issue, and you want to make your start-to-finish pipeline as small as possible to limit the complexities there.
     
  3. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I seem to remember reading some comparisons of bandwidth/pin where Rambus held a very substantial advantage with their next-generation technology (that I wouldn't expect to be eliminated completely via GDDR5). However, I could be remembering wrong or have miscalculated; definitely worth a recheck on my part!

    I did, and I'll have to humbly disagree. A couple of points:
    - Triangle Setup can already be done in the shader core in the ATTILA and the 'shader' they use for it is public in one of their papers; it's nothing extraordinary.
    - Passing several position-only triangles around is very cheap and quite a bit of logic from other subsystems can be reused for this purposed.
    - It's not like the entire shader core had to be able to handle triangle setup; even just one or two clusters (or one SM per cluster) would already be very fast.
    - As for texture filtering, it's not really about making things faster as much as it is about improving programmability. It could even be half-speed for a given datasize and that'd more than good enough.
    - Same for ROP blending, but of course I agree that making it programmable is more complex there (although the difficulty shouldn't be overestimated; some NVIDIA patents handle it in a rather efficient way IMO).
    Yes, but the reason isn't fundamentally all that different from why geometry shaders cannot be infinitely parallelized. Simply extending GS logic a bit might be able to do the trick... (although I'd have to think more about it to be certain)
     
  4. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    As far as I know, that's the one used for GT200-press material, no idea what's up with 1.2bil vs 1.4 elsewhere, though
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    They seem to mislead people a lot with BW per differential pair being confused for BW per pin. XDR2 is only 8GHz, and again needs twice the wires for the same bus width so the advantage is nil.

    ATTILA is a GPU simulator, not a hardware design. They don't have to route data around and worry about various HW limitations.
    It doesn't matter, because the data flow doesn't fit the framework of the shaders.
    You already have programmability with point sampling and/or fetch4. Going beyond that is rather pointless.

    Arun, you need to get a handle on the size of arithmetic logic. Remember that R420, at 160M transistors, has the same setup rate, HiZ rejection rate, and scanline rasterization rate of RV770. You don't need full FP32 arithmetic for most of the operations. You don't need access to a shader program or pick values from 64K of register space.

    The tough part is not the computational resources. It's managing the data flow. You have a list of primitives and they index into a post-transform cache. If you want to do multiple triangles per clock, you need to multiple ports for this cache. You need multiple ports in the HiZ ram, and have to worry about updating some tiles twice per clock. Handling 16 different 4x4 tiles per clock is much easier than 32 tiles that aren't necessarily unique. You have to worry about the first triangle's Z/stencil affecting the second. There's probably a ton of other issues that I can't think of.

    Die cost isn't the issue here. It's just a matter of tiptoeing through the minefield of parallelizing a part of the graphics pipeline that has always been serial. There are so many endcases that cause incorrect output if you aren't perfect.
     
  6. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Duh. However, ATTILA is afaik what influenced Intel to implement triangle setup in their shader core for their current SM3/SM4 IGP architecture. The only reason why I was mentioning it, anyway, is that it's the only source I've ever seen for an arithmetic shader-like 'triangle setup' program.

    I am rather skeptical that it doesn't loosely fit the framework of the geometry shader. Of course, the geometry shader's peak rate is rather... unimpressive right now, but you can't get away from the fact that in the future it will progressively have to go incrementally faster. That same hardware could be partly shared for the data flow & synchronization of triangle setup (in fact, the two could possibly be merged into the same program).

    Getting full programmability for free would be absurd, yes. However, point sampling and/or fetch4 is so far from optimal it's pretty funny. Okay, let's look at this another way: what do you need to do texture filtering in the shader core?
    - The texture colors and the weights for each bilinear operation, for every colour.
    - The number of bilinear operations to perform and the n-1 weights between them.

    For an INT8 texture, the colors & weights are likely all being transmitted in 16-bit format (and converted by a 'free' converter somewhere). Note that I am assuming this (rather than FP32 for everything) in order to be pessimistic for my own estimates. Therefore, it should be possible with proper packing (ugh, I know) to transmit the colors in 2 cycles and the weights in 1 cycle. So 3 cycles total for bilinear, 6 cycles for trilinear. This isn't fundamentally different from how the same paths are reused for a 4xINT8 texture or a 2xFP16 texture IMO...

    I'm not saying no new scheduling or routing logic would be required. Duh, of course it would. However, I think you are overestimating its size, and underestimating the logic it could save by always running certain modes in And much more importantly, I think history will prove you wrong... ;)

    I think I have a much better handle on that than you think I do... :) Furthermore, you strangely put HiZ and scanline rasterization into the equation; yes, those are things that are also hard to parallelize, but unlike triangle setup their computational requirements are massively different from the shader core. It's all about a large number of very small operations; and obviously doing that in 'software' would be pure madness: there's nothing to gain there.

    On the other hand, while triangle setup is indeed hard to parallelize, the operations are much higher precision (in fact, the best arguement I've heard so far for not doing it in hardware is the shader core is that FP32 just isn't enough! Although I do wonder if that doesn't depend on the algorithm given that Intel seems to be doing it just fine...) and it looks much more like a classic shader program. So yes, it's hard to do, but the reward may very well be worth the initial R&D cost and the slight die size cost.

    My point is really this: it will be necessary in the future to be able to parallelize it; it's just not fast enough otherwise for a huge 500mm² 40nm chip, you can't get around it. And once you did find an acceptable way to parallelize it, then simply adding more fixed-function hardware for the job would be very expensive and inefficient because (unlike HiZ or rasterization) those aren't just small and cheap operations at all. So like it or not, I don't see how you can get around having to do this within the next 2 years...

    Where did I say it was easy? However, creating an highly efficient unified shader core isn't easy either, nor is a good redundancy mechanism. That doesn't mean you should give yourself the luxury of avoiding them until you're really really forced to; otherwise you'll just turn out the way of 3dfx.

    I would be very surprised if neither IHV considered it very seriously for the DX11 generation anyway given this: http://www.gamedev.net/columns/events/gdc2006/article.asp?id=233

    I could definitely still be wrong here, and if I ever change opinion I'll gladly admit so, but I certainly don't think the arguements are anywhere near as clearcut as you make them out to be; of course, I'll also gladly admit that they're not anywhere near as clearcut as *I* originally made them out to be!
     
  7. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    What exactly are you vying for? Improved point sampling ability (i.e. 4 neighboring point samples in the time of one) or removing the current filtering logic also?

    If it's the former, fine, I can agree with you there, as it's mostly just a bigger bus from the TU to the shader. If it's the latter then I totally disagree. That is low precision math (one operand is ~8 bits, allowing huge savings) inserted into a very specific point in the data pipeline, and costs a tiny fraction of the general ALU power needed to replace it.

    Well then that's where the confusion is, as I call that stuff "setup". I want end-to-end doubling in polygon throughput per clock.

    I suppose doubling what you call setup would be useful in that culled/clipped polygons get removed faster, and that should net the IHVs a real-world polygon throughput increase of near 50%. Maybe that's a good first step.

    Well with unified shaders you gained a lot of functionality in vertex programs, especially in terms of usable texturing. For setup, though, there's nothing to enhance, and compared to pixel shading the load upon it has increased far slower. Not slow enough to warrant the same setup speed for 6 years, but definately not fast enough to warrant scaling setup speed with shader count, particularly when scanline rasterization etc are very hard to make that fast.

    This is one place where sticking with fixed function hardware makes a lot of sense. Just like texture filtering. :razz:
     
  8. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,027
    Likes Received:
    90
    Pardon the interruption, WRT FF texturing - isn't it widely accepted by now that FF tex hw is going to go the way of the dodo, and probably within the next several product cycles (perhaps DX11/compute shader timeframe)?
     
  9. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Having had access to the press materials, I never saw this shot outside of the financial presentation.

    Counting the "trenches" between each block of supposed ALUs, there are 16. Each related to eight ALUs gives us 128, which IMO strongly indicates, that we're seeing a G92/G80-die here.
     
  10. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    It definitely looks like an 8 cluster chip. So assuming the 1.2 billion transistors isn't a cock-up, maybe GT200 on 55nm is 8 clusters (192 shaders), with higher clocks to compensate?
     
  11. igg

    igg
    Newcomer

    Joined:
    May 16, 2008
    Messages:
    63
    Likes Received:
    0
    Is there any information on the GT200b available yet?
     
  12. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I'm vying for a slightly faster bus from the TU to the shader and the capability to transmit not just the 4 neighboring point samples in it but also the weights which had to be calculated anyway. Yes, it would take time/bandwidth to transmit that, so I'm not expecting it to go full-speed, at least not initially.

    The goal isn't to replace the texture filtering hardware completely, but to complement it (improving software flexibility) and be able to eventually streamline it. If every texture was DXT1, the latter wouldn't make any sense, but that's obviously not the case and the TMU hardware does need to support a variety of formats quite fast indeed. Down the road, I'd rather see the TMU focus on the most common formats and let the shader core handle the rest; however, whether that makes sense is a complex performance discussion that is hard to have objectively without more raw data.

    My expectation is that it's likely undesirable unless the batch size is substantially smaller for the pixel shader, which obviously has problems of its own. What might be interesting is allowing the pixel shader to do pure bilinear or trilinear filtering during 'idle cycles' where the texture filtering unit is busy doing advanced anisotropic filtering; but I would actually be very surprised if that was worth the effort or scheduling overhead.

    Oh, I want end-to-end doubling too, my point just was that it wouldn't make any real sense to do HiZ/Rasterization/Compression in SW just to be able to increase the peak throughput. I'd rather 'just' have that fixed-function hardware be able of, I don't know, 4 triangles/clock on an ultra-high-end chip. Yes, it's hard to make it work, but as I said you just can't get away from that complexity in the long run.

    However, I do agree that there are much worse first steps than what you just proposed there!

    The point isn't that triangle setup should be able to go as fast as the shader core would let it; the point is that triangle setup should be able to go faster than it does today, and once you've got the necessary complex infrastructure to parallelize it, it seems rather absurd to just implement four times as many fixed-function triangle setup units. It would seem much more sensible to just offload it by then...

    As for texture filtering, I'm not sure. My theory here is that you're just picking on everything Larrabee does in software in order to later claim I said Larrabee would be a disaster! :D You seem to forget you can employ a similar rick to MMX: full-speed INT32, double-speed INT16, quadruple-speed INT8. Yes, it does require you to have 32x32 multipliers instead of 27x27 multipliers, but it seems like RV770 is handling that horrible inefficiency just fine! *grins* I am more worried about the dataflow, but if Larrabee's engineers can't even get that critical part of the design right, then they really are hopeless.

    Realistically though, one problem with this theory is that you can't filter a INT8 texture with INT8 hardware; that's just not legal, you need a lot more precision than that in the DX10/DX11 spec. I still think double-speed INT16 would be worth the trouble though...
     
  13. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Well how useful are the weights, really, when you want to do custom filtering yourself? There's bilinear PCF, but we already have a fast path there.

    The only use of the weights is when you need to do a non-linear operation on each texel first and then bilinearly filter it. Not only is this rare, but doing that operation on all the texels probably takes enough time that current GPU speed for point sampling is good enough not to be a significant limitation, and getting weights in the same cycle isn't needed either.

    I say just keep the point sampling rates the same and add a weight instruction (that doesn't have to be executed in the same clock). What do you know, that's what we've had for some time, though DX doesn't expose the weight instruction.

    I don't think that's really an issue. There's DXT and for other formats it's just a matter of swizzling and dealing with both FP and INT in the filtering unit. As you commented yourself, that's not very hard.

    I wasn't implying that. I was just confused with your suggestion of shader-based setup because I considered this part of setup too.

    Not to me. The amount of fixed function hardware saved is too minimal, IMO. All you save is the arithmetic logic in the setup block, and I'm thinking that's well under 1% of today's dies. Quadruple it and it still isn't relevent.

    I'm not forgetting at all; on the contrary, I also noted that the filter weights are only 9 bits or so. If you were only doing fixed point filtering like 360 does, all you need are four 8x9 multipliers for the same characteristics you mentioned. Similar simplifications can be made to include FP, except you need a few more adders/shifters.

    This is exactly why we should keep all the fixed function filtering, because it's much cheaper than the ALU logic.
     
  14. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,027
    Likes Received:
    90
    raises hand

    my turn?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...