NVIDIA GT200 Rumours & Speculation Thread

CarstenS · May 30, 2008

Both probably right - as soon as you run badly out of memory on the former cards.

AnarchX · May 30, 2008

Yes, 2x 9800GX2 would be in some cases < 2x 8800 Ultra (scroll down).

But the 7k in Vantage Extreme seemed to be a GTX 280 SLI or some people only misunderstood the "2x 9800GX2".

Lukfi · May 30, 2008

I was just checking Charlie Demerjian's news and comparing it to NDA info I have myself and I realized that although Charlie's interpretation is very anti-nVidia, some of it is right. So I'm inclined to believe Charlie knows GTX 200 specs and clocks, just his info about RV770 is way off, which is why he claims that single RV770 will be on par with the GTX 260.

Love_In_Rio · May 30, 2008

AnarchX said:
According to a Mod from known PCinlife, who has also die-shots and other informations in his hands, GTX 280 will offer 2 times 8800 Ultra performance.

And it has 3 times its floating point performance ? So, the missing mul maybe is still missing...

AnarchX · May 30, 2008

Love_In_Rio said:
And it has 3 times its floating point performance ? So, the missing mul maybe is still missing...

933GFLOPs / 384GFLOPs is only 2.4x and I would think/heard some times ago, modern code is more demanding MADD.

nicolasb · May 30, 2008

For what it's worth:

http://www.fudzilla.com/index.php?option=com_content&task=view&id=7604&Itemid=1

We chat with a lot of knowledgeable people in the industry and we've learned some quite interesting truth behind GT200 chip. Our developer friends called it brute force chip, with not so much brains.

Putting 240 Shader units in the chip that basically reminds on G80 and G92 design will naturally get things faster. More Shader units at faster clock will always make your card faster, especially at higher resolutions.

G92 with 65nm design has 128 Shaders while GT200 has 240 or almost twice as much. The die size of GT200 is much bigger than G92 and that is how you get the fastest chip around.

Our developer friends added that the last innovation that Nvidia did was G80 and that G92 is simply a die shrink of the same idea. You can look at GT200 as G92 with 240 Shaders.

This results in GT200 running quite hot but it will compensate with sheer power, so let’s just hope that Nvidia's yields of such a huge chip (rumoured bigger than 550mm2) will be acceptable.

and: http://www.fudzilla.com/index.php?option=com_content&task=view&id=7603&Itemid=1

Just like any other American company, Nvidia plans to continue ripping the Europeans off and we believe that Europeans actually got used to it. While Geforce GTX 260, the slower of two GT200 based cards will end up selling for about $450 in USA we heard that in European countries that are loyal to Euro should end up paying between 400 and 450 Euros.

Today 400 Euro converts to $620 while €450 converts to $697.5 which is a huge difference from the suggested US prices.

Geforce GTX 260 will be slower clocked with both memory and the GPU but it should still end up as the runner up to the fastest thing around.

Geforce GTX 280 will end up with similar price difference and we expect a price of around $600 in the USA and about €550 to €600 in Europe.

igg · May 30, 2008

Nvidia GT200 sucessor tapes out

This doesn't look very good, maybe even disappointing:

You heard that right, the successor for the GT200 chip has already taped out, and it too will be nothing special. Documents seen by the INQ indicate that this one is called, wait for it, the GT200b, it is nothing more than a 55nm shrink of the GT200. Don't expect miracles, but do expect the name to change.

...

The GT200 is about six months late, blew out their die size estimates and missed clock targets by a lot. ATI didn't. This means that buying a GT260 board will cost about 50 per cent more than an R770 for equivalent performance. The GT280 will be about 25 per cent faster but cost more than twice as much. A month or so after the 770 comes the 700, basically two 770s on a slab. This will crush the GT280 in just about every conceivable benchmark and likely cost less.

...

The GT200b will be out in late summer or early fall, instantly obsoleting the GT200. Anyone buying the 65nm version will end up with a lemon, a slow, hot and expensive lemon.

...

What are they going to do? Emails seen by the INQ indicate they are going to play the usual PR games to take advantage of sites that don't bother checking up on the 'facts' fed to them. They plan to release the GT200 in absurdly limited quantities, and only four AIBs are going to initially get parts.

CJ · May 30, 2008

/Offtopic... I wish people would actually turn back a few pages to check if a link wasn't already posted.

trinibwoy · May 30, 2008

CJ said:
The documents talk about "Improved Dual Issue"... so make of it what you will.... Also mentioned are "2x Registers" and "3x ROP blending performance".

What's the current blending rate ?

Jawed · May 30, 2008

juan789123498 said:
Is today, die shot's day?

Ok, i will contribute with a G80 die shot, that even babies will understand.
Jawed it's a little easier to discern blocks with this picture , no?

Thanks.

And yes, G80 clusters seems to be 50:50 multiprocessors and TMUs.

One thing that I'm a bit dubious about is the way that a cluster is split into multiprocessor and TMU. There are parts of a cluster that aren't either, as far as I can tell, relating to scheduling/instruction issue.

So I'm wondering if some of the area that's being counted as TMU is something else.

And I almost forgot to say I saw shared memory per multiprocessor gets doubled in GT200 vs G80 (32KB vs 16KB). And remember there are 30 multiprocessors vs 16 in G80.

I'm curious about the batch size (number of elements in a hardware thread) on GT200. G80 has an underlying batch size of 16 but I have the impression that it'll be 32 in GT200. I wonder if this leads to some simplification of the multiprocessors, or at least in the scoreboarding/scheduling/instruction-issuing.

I'm also wondering if the register file and shared memory bandwidth is doubled. G80 is capable of fetching 16 elements per clock (from each, though I'm not sure if it can fetch from both simultaneously at that rate), so I'm wondering if GT200 fetches 32. I've long suspected there's a register bandwidth bottleneck in G80, so it'll be interesting to see if this has changed and therefore a factor in improved TMU utilisation and usage of the MUL - both of which I suspect are bound by register bandwidth in G80.

Jawed

Jawed · May 30, 2008

CJ said:
The documents talk about "Improved Dual Issue"... so make of it what you will.... Also mentioned are "2x Registers" and "3x ROP blending performance".

I'm sure there'll be a lot of CUDA-using people jumping for joy over the register file increase :smile:

Jawed

Arun · May 30, 2008

Jawed said:
One thing that I'm a bit dubious about is the way that a cluster is split into multiprocessor and TMU. There are parts of a cluster that aren't either, as far as I can tell, relating to scheduling/instruction issue.

No, that's part of the SMs. There are indeed things which are only present at the cluster level (I think constant cache is one of them, but I can't remember right now) and there's definitely some basic scheduling there too. However, it's probably fair to say that a significant majority of it is related to texturing or fetches in general.

Jawed · May 30, 2008

Anarchist4000 said:
Sounds like they just fixed the MUL issue by adding register space and made a marketing issue out of it. I guess it could be viewed as discovered since it wasn't really available for general use, even through they counted it towards performance.

Of course the missing MUL was a serious hit to NVidia's claims for the efficiency of their ALUs. If it routinely achieves 2/3 of the "headline" GFLOPs rating - then it's better to just pretend it doesn't exist, which is why we have G80 as 346GFLOPs, not 518GFLOPs, etc.

Presumably as far as GT200 marketing is concerned, it has more than twice the GFLOPs of G92.

Jawed

trinibwoy · May 30, 2008

Jawed · May 30, 2008

Arun said:
No, that's part of the SMs. There are indeed things which are only present at the cluster level (I think constant cache is one of them, but I can't remember right now) and there's definitely some basic scheduling there too. However, it's probably fair to say that a significant majority of it is related to texturing or fetches in general.

OK. One thing I'm still unclear about is whether NVidia's architecture has dedicated point-samplers in addition to the "TMU"s or whether in fact the modular configuration of texture fetching/filtering (e.g. addressing unit, fetching, filtering) allows them to run all fetches through the same samplers. It seems likely to be the latter.

Additionally modularisation implies an overhead, as each unit theoretically needs to independently handle its own instruction/operand fetching and resultant posting.

Jawed

trinibwoy · May 30, 2008

trinibwoy said:
What's the current blending rate ?

B3D article:

Further, it can natively blend pixels in integer and floating point formats, including FP32, at rates that somewhat diminish with bandwidth available through the ROP (INT8 and FP16 full speed (measured) and FP32 half speed). Each pair of ROPs share a single blender (so 12 blends per cycle) from testing empirically.

Twinkie · May 30, 2008

From the slide posted by trini

"Do not distribute"

I kind of feel sorry for those who bought a 9800GX2. I mean, there would be no point in nVIDIA to spend alot of resources especially on the driver side of things to maintain scaling for the GX2 in newer titles to come since its pretty much reached EOL (due to GT200). Quad SLi becomes meaningless once again just like the 7950GX2.

fellix · May 30, 2008

Jawed said:
I'm curious about the batch size (number of elements in a hardware thread) on GT200. G80 has an underlying batch size of 16 but I have the impression that it'll be 32 in GT200. I wonder if this leads to some simplification of the multiprocessors, or at least in the scoreboarding/scheduling/instruction-issuing.

I dare to propose, that the batch-size is untouched from the G80, e.g. an additional SIMD array of eight SPs per cluster just increases the threading parallelism...
So now GT200 has a kind of triple-instruction issue, to name it.

I wonder, how this would impact the interpolation rate, compared to G80/92.

trinibwoy · May 30, 2008

fellix said:
I dare to propose, that the batch-size is untouched from the G80, e.g. an additional SIMD array of eight SPs per cluster just increases the threading parallelism.
So now GT200 has a kind of triple-instruction issue, to name it.

I wonder, how this would impact the interpolation rate, compared to G80/92.

If I understand correctly the purpose of a cluster is just to share the texturing unit. All of the multiprocessors are independent and contain their own scheduling and operand issue hardware, register file and shared memory. So the number of multiprocessors per cluster won't affect any ALU processing rate but it should increase TMU utilization.

CUDA blocks are distributed at the multiprocessor level and all inter-thread communication is limited to a single multiprocessor.

Arun · May 30, 2008

BTW, here's a thought: I think the only way Charlie could possibly be right about the GT200b die size being 'a little more than 400mm²' while GT200 is presumably 576mm²... is if GT200b uses GDDR5. 576*0.81 is 466mm², but you could assume the I/O for GDDR5 per-bit isn't massively higher than GDDR3 so you could save maybe 20mm² there. And if you save a bit on the MCs too, there you go, low 400s.

NVIDIA GT200 Rumours & Speculation Thread

CarstenS

Moderator

AnarchX

Lukfi

Love_In_Rio

AnarchX

nicolasb

igg

CJ

trinibwoy

Meh

Jawed

Jawed

Arun

Unknown.

Jawed

trinibwoy

Meh

Jawed

trinibwoy

Meh

Twinkie

fellix

trinibwoy

Meh

Arun

Unknown.

Similar threads