Beyond3D's GT200 GPU and Architecture Analysis

Nice article, as usual.

Couple of things:
We truly wonder when IHVs will get a clue and move triangle setup to the shader core (to improve performance) and make texture filtering/ROP blending programmable (even if it hurts performance when running custom code).
Moving these things to the shader core require gobs of operand bandwidth. For custom filtering, you can already do point samples and whatever you want from there.

One thing you should realize is that pure math logic isn't very expensive at all. It's the routing and temporary storage of data that uses most of the transistors. Filtering alone needs only a fraction of the logic of a shader core, and triangle setup needs to be done in front of the triangle rasterization. I'm not too sure why triangle setup hasn't been improved beyond once per clock, but I think there may be difficulties in parallelizing while preserving order of the triangles and their quads throughout the pipeline. I don't see anything that can't be overcome, though.

Also, on the last page, don't you mean 1/10th of a terazixel instead of petazixel? I have a tough time believing a 1500 fold increase over G80. ;)
 
No, no, it's 1/8th speed DP. Is there any place where that's not clear? What's full-speed are denormals (still trying to get a bit more details about potential catches there if any). As I said, our access to the chip was... very limited though, so we haven't had the opportunity to test any of that, at least not yet. As for that analogy, it's not like Rys and Scott never talked... ;)

Is it 1/8th or 1/12th? The DP unit cannot issue the MUL along with MAD ?
 
The reason why the FP64 unit cannot be used at the same time is that there is no scheduling hardware dedicated to it (you can't issue one extra instruction per cycle for it), it shares the register file (so it needs part of the FP32 ALUs' register files' banks/ports to be reserved for it while used), and so forth. As we say in the article, it was claimed that it could be used along with FP32 up to a certain extend, so we speculate that it can feed from the same ports as the SFU/MUL which would then have to idle meanwhile.

As for memory support, NVIDIA has been claiming they could use GDDR4 for the last three centuries or so, but it never happened. And it's never going to happen either. There are both political and technical reasons for that; in the G8x/G9x case, the architectures were fully optimized around GDDR3's burst length and could have not supported GDDR4 at equally high efficiency. In the GT200's case, I'm not completely sure about that part given that GT2xx will use GDDR5 eventually, but I would be very supported if either the MCs or the PHYs supported GDDR4.

As for Direct3D 10.1... Well, not sure what we can or cannot say, so uhhh we'll let you know eventually if we can!

INKster said:
So, it seems Anand has let the cat out of the bag and introduced the missing SKU's for the Tesla family
Heh, the NDA has been lifted, I just haven't had the chance to finish my article yet... :)
Mintmaster said:
Also, on the last page, don't you mean 1/10th of a terazixel instead of petazixel? I have a tough time believing a 1500 fold increase over G80.
Bah, how much of this thread needs to be dedicated to Rys' jokes? ;)
Mintmaster said:
Is it 1/8th or 1/12th? The DP unit cannot issue the MUL along with MAD ?
The unit itself is 1/8th the width of the main ALUs, so considering the MUL the theoretical peak flop rating is 1/12th that of SP, correct. Similarly, in AMD's case with RV770/Firestream 9150, it seems to be 1/5th of the SP rate (for a much smaller die size), although presumably still without denormal or rounding errors. Should be fine for most apps anyway probably.
 
Also, on the last page, don't you mean 1/10th of a terazixel instead of petazixel? I have a tough time believing a 1500 fold increase over G80. ;)
Heh, yeah :p

I see Arun threw in his rant about shader core triangle setup too, without really asking :devilish:
 
It took Nvidia too many space to add double precision hardware. It is a shame it´s not usable for gaming. Are Ati´s double precision units usable for gaming when not used for gpgpu ?
I would say that Larrabee has a lot to do with the desing of this new chip, and that has implied Nvidia to forget looking in the rear mirror to ATI.
 
I wonder then how much work and or extra transistors it would take then to expose the FP64 unit alongside the others. Would it be reasonable to assume a future derivative might do this?

*Edit* Perhaps even replacing some more of the FP32 units to work in tandom with the others.
 
Last edited by a moderator:
In the GT200's case, I'm not completely sure about that part given that GT2xx will use GDDR5 eventually, but I would be very supported if either the MCs or the PHYs supported GDDR4.

I'm positive the GT200 will never support using you as its main memory. :LOL::LOL:
 
Swwwweeet, thanks for the article guys. I love it when the first thing I can read about new hardware hotness is the low-level tech details :) Then I can happily move on to the benches and see how it pans out in reality.

So awesome job as always guys and keep up the great work!

And lol @ "petazixel" :D
 
Does the Purevideo VP2 unit in GTX 200 GPUs now support full VC-1 and MPEG-2 decoding on GPU like the Geforce 8200 mGPU does or still only partial on GPU decoding ala G92?
 
So does anyone think NVidia is going to be a bit worried this time around? The GT200 seems to really hurt their performance per mm2, whereas all indications are that ATI really improved it with RV770. Even their HPC line may be in trouble with Firestream's 200 GFlops DP performance.

Half of the problem with GT200's cost effectiveness is the lower shader clock, and half with the extras that bloated the tranny count. I'd expect both to go away with value models. We may see a situation like the Geforce 6xxx series where NV43 provided much more perf/$ than the high end parts.

Nonetheless, after RV670 achieved near parity with NVidia (again, in terms of the cost effectiveness of an architecture), ATI looks like they'll be even better this time around.
 
Does the Purevideo VP2 unit in GTX 200 GPUs now support full VC-1 and MPEG-2 decoding on GPU like the Geforce 8200 mGPU does or still only partial on GPU decoding ala G92?

I think GT200's VP2 does H.264/MPEG 2 only, just like G84/G86/G94/G92, etc.
And, if we think about it, it does makes sense not to spend more transistors for it in a product such as the GTX 2xx.

No one in their right mind would pair a top-end card like this to a Celeron or even a Pentium Dual-Core, and as such (seeing as VC-1 is not as compute intensive as H.264), it's only logical to provide VP3 capabilities on platforms already potentially limited by their low-end CPU's, such as the ones using IGP's (MCP78/7A) and low-end GPU's (G98, etc) from Nvidia.


edit
Scratch all that above. Apparently it does have full VC-1 decode capabilities, therefore matching G98's VP3 and AMD's UVD engines.
 
Last edited by a moderator:
I loved the article, guys - it's why I came here in the first place - technical information in an informal tone.

Although I had to google Kim Kardas.. Kar.. whatever her name was. :D
 
remember that full-speed FP16 also means that 32-bit floating point pixels made up of three FP10 channels for colour and 2 bits for alpha also go faster for free
Shouldn't that be 32-bit integer pixels made up of three int10 channels and 2 bits for alpha or does a fp10 float format really exist?

Also, I got to wonder... 8800GTX/G80 didn't really seem to suffer from a lack of blend rate, and the bandwidth per ROP didn't increase neither, so does increasing the blend rate per ROP really help there?
 
So, it seems Anand has let the cat out of the bag and introduced the missing SKU's for the Tesla family:

http://www.anandtech.com/video/showdoc.aspx?i=3334&p=21

1 TFLOP at last (and a healthy 4GB per GPU).
I'd really want to see a pic of the card (without the cooler...). Does someone really sell 2gbit gddr3 ram chips or are they using 2 1gbit chips in parallel, requiring an even more complex pcb (and hence the quite a bit lower sdram clock)?
 
Shouldn't that be 32-bit integer pixels made up of three int10 channels and 2 bits for alpha or does a fp10 float format really exist?

Also, I got to wonder... 8800GTX/G80 didn't really seem to suffer from a lack of blend rate, and the bandwidth per ROP didn't increase neither, so does increasing the blend rate per ROP really help there?
It's an FP format, s6e3 if I remember rightly (could be wrong there though, I'll check). And it looks like Tridam couldn't find full speed FP10 or FP16....
 
Back
Top