NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
How closely are TA and TF in G80-style texture units interconnected? If that's at all possible - maybe they've moved the TA into the shader-clock domain and going back to a 1:2 ratio. That'd hurt bi-performance not that much, if at all. Plus, you'd get data pretty fast into the ALUs in CUDA.

Again, I've no idea, what i am talking about. ;)
 
How closely are TA and TF in G80-style texture units interconnected? If that's at all possible - maybe they've moved the TA into the shader-clock domain and going back to a 1:2 ratio. That'd hurt bi-performance not that much, if at all. Plus, you'd get data pretty fast into the ALUs in CUDA.

Again, I've no idea, what i am talking about. ;)
I've said this again and again: my understanding is that the TA and TF units are actually one and the same, and the exact same ALUs are used for both tasks. Right now they've got enough units for full-speed trilinear but not full-speed bilinear (which requires as many units being used for TF but more for TA) which explains their lower bilinear rates; in FP32 mode, other units are used for TF (which have shut down most of the time to save power) so there are enough ALUs for TA.

Sooner or later, FP32 TF will move to the shader core to get rid of that mostly idling silicon. Also, it may be desirable to increase the number of units in the shared TA-TF unit so that bilinear is full rate but some are idling under trilinear; after all, this is still more efficient than a traditional design with discrete TA and TF units, and is desirable both in the ultra-high-end with very high resolutions (->more bilinear) and in the low-end where there is no AF and texture settings are lower. In the mid-range, it might be less desirable, but whether you want to bother tuning it from chip to chip is very debatable.

EDIT: For all we know, maybe the majority of the shared TA/TF units are already double-pumped (but not in the shader clock domain; i.e. it's be 2x600MHz on a 8800GT, not 1500MHz). This would obviously save die space. Actually, maybe they didn't have the time to do that on G9x which would help explain the transistor count increases... And also why they can scale G98/MCP78/etc. to a 4-wide TMU, while G86 was stuck to an 8-wide TMU even for the 8300GS. However, this is obviously VERY speculative.
 
A cut-down GT200? Well, it can't have half the units, since its dual setup would be slower than GT200. It could have 3/4 the units + 384bit bus maybe? But then in would perhaps make more sense to use faulty GT200's and disable 1/4 units and two memory channels. Still I think that the power requirements of such hypothetical card would make its existence impossible.

I was thinking of something with 3/4th the units and 4 ROP partitions.

It's much more easier to make a dual-chip card if the memory bus width is only 256 bits, that not being the case with anything that could potentially be faster than GT200 and still be based on the same, GDDR5-unfriendly architecture.

Even if GT200 doesn't support GDDR5, it wouldn't take theoretically any significant resources to lay out a future chip for GDDR5. For such a theoretical thing as above even 1.8GHz GDDR5 on a 4*64bit MC sounds sufficient to me.

Still, with G92, nVidia ended up with an awkward sandwich that is expensive to make and its cooling systems is noisy and difficult to replace.

Well all I'm doing here is using the same absurd reasoning that made the 9800GX2 possible.

I think the next generation high-end won't be here at least until a year from now. Basically the same as with G80.

Personally I've nothing to object against it either. With the G80 I felt that my investment was just as worthy as in the past with the 9700PRO.
 
Is that a return to G80 style TMUs? 1 TA and 2 TFs?
Uhuh.
I was thinking of something with 3/4th the units and 4 ROP partitions.
But even G92 is limited by memory bandwidth. On the GX2, it's especially noticeable in high resolutions (1920x1200 and above) with AA. In 2560x1600 with AA enabled, GX2 is even slower than a single G80 GTX/Ultra.
Even if GT200 doesn't support GDDR5, it wouldn't take theoretically any significant resources to lay out a future chip for GDDR5. For such a theoretical thing as above even 1.8GHz GDDR5 on a 4*64bit MC sounds sufficient to me.
In that case, it would be technically possible. (Although I still belive that GT200 will be superseded by a completely new architecture, not a GX2.)
 
Well it seems we're back to G80 style "ditch stuff over the side, the die's too big" mode...
I thought that theory was debunked?
And Lukfi, uhhh, I'm skeptical that information is correct; as I said, 1TA/2TF makes basically no sense to me. Oh well, I'm not under NDA yet, will hopefully know soon enough.
EDIT: On the other hand, 2TA/3TF would make a little bit of sense. And actually that'd make the iGT209 codename more logical. Hmmm. Still incredibly unlikely.
 
But even G92 is limited by memory bandwidth. On the GX2, it's especially noticeable in high resolutions (1920x1200 and above) with AA. In 2560x1600 with AA enabled, GX2 is even slower than a single G80 GTX/Ultra.

Not just memory bandwidth if you don't mind. Have you sat down and analyzed what's the memory consumption in case X in 2560*1600 w/ AA exactly is? If it's anywhere above 512MB (which isn't all that unlikely if it's a recent game) the 256MB extra ram on the 8800GTX/Ultra will most certainly benefit the latter too.

I wouldn't be in the least surprised if GT200 besides the 1GB framebuffer comes along with some compression tweaks for AA in ultra high resolutions to widen the lead of its performance compared to G8x/9x even more. And of course the predicted much more memory bandwidth.
 
Did the additional TAs obtain any advantage above G86/G84?

It seems that they have a significant transitor-count-cost(754M for only 8C/4RPs) and in bilinear mt-tests every G8x/G9x with TAs=TFs was far away from theoretical maximum, so I believe these transistors can be invested in something better.
 
I've said this again and again: my understanding is that the TA and TF units are actually one and the same, and the exact same ALUs are used for both tasks. Right now they've got enough units for full-speed trilinear but not full-speed bilinear (which requires as many units being used for TF but more for TA) which explains their lower bilinear rates; in FP32 mode, other units are used for TF (which have shut down most of the time to save power) so there are enough ALUs for TA.

Sooner or later, FP32 TF will move to the shader core to get rid of that mostly idling silicon. Also, it may be desirable to increase the number of units in the shared TA-TF unit so that bilinear is full rate but some are idling under trilinear; after all, this is still more efficient than a traditional design with discrete TA and TF units, and is desirable both in the ultra-high-end with very high resolutions (->more bilinear) and in the low-end where there is no AF and texture settings are lower. In the mid-range, it might be less desirable, but whether you want to bother tuning it from chip to chip is very debatable.

EDIT: For all we know, maybe the majority of the shared TA/TF units are already double-pumped (but not in the shader clock domain; i.e. it's be 2x600MHz on a 8800GT, not 1500MHz). This would obviously save die space. Actually, maybe they didn't have the time to do that on G9x which would help explain the transistor count increases... And also why they can scale G98/MCP78/etc. to a 4-wide TMU, while G86 was stuck to an 8-wide TMU even for the 8300GS. However, this is obviously VERY speculative.
Very interesting read, thanks Arun!
But i was under the impression, that especially for TF you'd need some highly specialized circuitry to run it (bilinear) single-cycle, after the necessary data has been fetched. So i cannot see both units sharing a majority of transistors - unless someone ;) might take the time and explain further.
 
Very interesting read, thanks Arun!
But i was under the impression, that especially for TF you'd need some highly specialized circuitry to run it (bilinear) single-cycle, after the necessary data has been fetched. So i cannot see both units sharing a majority of transistors - unless someone ;) might take the time and explain further.
Bilinear filtering can be implemented in just 4 DP4s if you got the data from the TAs in the right form. It's nothing magical. Similarly, TA is also full of ADDs and MULs; it also requires some other operations, however, which are not sharable with TF, but these are presumably noticeably cheaper.

Jawed: Because the die size is smaller, yes; but there are multiple reasons for why you'd want to do the I/O separately on a very large die AFAICT... The process variant if you want I/O is also different from if you don't; I wouldn't be surprised if that affected cost somehow, but I'm not completely sure.

As for filtering being a fixed-function block, those are 2004 patents. Duh, of course that was the case in that timeframe - there was no evidence whatsoever of the contrary. What I'm saying is that as of G84/G86, there's now a substantial amount of sharing between TA and TF.
 
Before G80 was launched, I heard some rumours about game devs emulating DX10 on R580 shaders. I suppose it's possible since the ALUs are highly versatile, but the performance sucks so it can't be used for real. Current nVidia chips can also do almost anything through CUDA, but not everything might be really usable. Besides, DX10.1 isn't about new technologies, but about speeding up the current ones, so emulation wouldn't make any sense. nVidia says they don't care about DX10.1 because the game devs have enough problem, but in my opinion it's just smoke & mirrors maneuver for G80 being built primarily for DX9 and not for all those new DX10.1 features. The R600 was clearly designed with these in mind, though the architectural flexibility cost ATi more transistors, forcing them to use the 80nm process... and you know the rest.

Two years? No I don't think so. GT200 is a slightly or heavily modified G80, but its principles won't last another two years. Three years on the market is long enough for even the best architectures to grow obsolete. R300 was launched in August 2002, R520 came in September 2005 (with a three-month delay or so) and it was just about time the old architecture was replaced. So, just as Megadrive1988 says, Q4'09 could be the right time to release a new, DX11 based product.

That is, forgive me, total bullshit. Every manufacturing process, even from the same company, has its specifics and chips must be designed from scratch in order to be able to use it. In the past, AMD transitioned from traditional bulk process to SOI with no trouble. In the past, nVidia fabbed some of its chips (notably the famous NV30) in IBM fabs before settling at TSMC for good.

Something tells me you don't mean knowing as in Arun's (was it Arun?) definition. Four GPUs on a card, if we're talking about RV670 or RV770, that is - sorry - also total bullshit. The purpose of CrossFireX is to allow for X2 cards to work together, or with a single card. Just as you can't put two chips of the G80/R600/GT200 calibre on one card, you can't put four RV670/RV770s on a card.
By the way, Quad CrossFire scaling sucks so badly ATi would be only shooting itself in the foot by marketing it as a usable graphics solution.

First of all, GT200 is the replacement for G80 and nVidia expects to use it for two years. That is normal for new architecture. No one seriously expects DX11 before 2010 after Vista 7 launches. Link me otherwise, please.

Yes, 4 GPUs on a card - http://www.visiontek.com/products/cards/retail/3870x4.html

[sorry =O, however .. Derek Wilson posted this:]

http://forums.anandtech.com/messageview.aspx?catid=31&threadid=2172588&enterthread=y&STARTPAGE=1
... heh ...

I'm serious when I say I'll be testing something like that very soon (a week or two).

no I'm not kidding, even though it is april 1st ...
so 2 3870x2s on a single PCB? i dunno .. cut down a lot

and we know for sure there are 3 GPUs - a x3 on a single PCB

http://www.bit-tech.net/news/2008/03/28/asus_shows_off_its_hd_3850_x3_trinity/1
Asus dropped down to bit-tech offices yesterday afternoon to show off its new HD 3850 X3 graphics card - yes, that's right THREE 3850 GPUs on a single PCB. How does it achieve this? Using MXM modules and some clever use of heatpipes and watercooling - the cores all face towards the board and the memory on the back is heatsinked.
of course it is ES, but you know AMD is thinking about it.

As to SMIC ... yes, it is certainly possible. Read my last today's post and you will get more brand-new links then you bargained for

http://forum.beyond3d.com/showthread.php?p=1163638#post1163638
 
Jawed: Because the die size is smaller, yes; but there are multiple reasons for why you'd want to do the I/O separately on a very large die AFAICT... The process variant if you want I/O is also different from if you don't; I wouldn't be surprised if that affected cost somehow, but I'm not completely sure.
If G92 was supposed to be a summer 2007 GPU then surely it would have benefitted from NVIO too? If GT200 was following ~4 months later then time to market for such logic on 65nm shouldn't have been an issue if it wasn't an issue for G92. So, why pair GT200 with NVIO but not G92?

Now, if you're suggesting that GT200 and G92 are on different processes (assuming they are both 65nm) then that's territory I don't know. If GT200 is NVidia's first 55nm GPU?...

Still, we come back to GT200 being rumoured as at least as big or considerably larger than G80. Size seems like the primary driver to me.

As for filtering being a fixed-function block, those are 2004 patents. Duh, of course that was the case in that timeframe - there was no evidence whatsoever of the contrary.
That document relates specifically to G8x architecture, not to anything prior to it, as far as I can tell.

What I'm saying is that as of G84/G86, there's now a substantial amount of sharing between TA and TF.
If there's some documentation that relates to this then it would be nice to see. I certainly won't deny the possibility - after all the whole lot can be done programmably - and idling sub-units are bad for overall utilisation.

Jawed
 
First, NVIO: as I said, part of the point is to save a process step I think. I don't know all the details but I'm sure someone with more of an engineering background could explain that better than I could, and also point out some of the other possible reasons... (including reducing the number of average required respins because of analogue integration)

---

Ohh, the first patent, right, I thought it was another one sorry. In fact, that is clearly NOT used in G8x - it is a generic patent, possibly aimed at NV's DX11 arch or possibly not. As the summary clearly says, it is a way to use the shader pipeline to do texture addressing & texture filtering work to reduce bottlenecks; it doesn't replace the TA/TF pipeline, but complements it; i.e. you've got the benefits of both fixed-function units and being able to allocate all of your resources to it (like Larrabee would, just without as much fixed-function stuff).

Because it's a generic patent, they likely didn't bother making the drawings or anything else really look like the real-world implementation would work. Their patent isn't any less valuable because of it and so forth - so why would they?

---

There's not any document that claims that, but it's an incredibly simple and efficient way to explain the bilinear rates for INT8 and FP32 respectively on G84/G86/G9x; the latter is actually higher than the former when used in scalar form, while peak bilinear rates in INT8/INT16/FP10/FP16/etc. are impossible to achieve.
 
Status
Not open for further replies.
Back
Top