Nvidia GT300 core: Speculation

Status
Not open for further replies.
  • the techniques for calculating all kinds of transcendentals have common hardware structures, so splitting these structures into distinct units is simply a waste

True, that wouldn't make sense. I'm trying to make sense of the four blue bits. Maybe those are all of the transcendentals :shrug:

  • the general trend should be for less acceleration of transcendentals, not more - in general computation transcendentals are much less commonly used (about 5% if I remember right) than the hardware provides for (~25%)
Jawed

Except RCP is in the SFU, and that seems wrong to me -- from the same usefulness perspective.

-Dave
 
Anyone have a guess at the GF100 die area? If transistor density increases by 1.75x (as for ati going 55nm -> 40nm), then nv's 3.0bln 40nm tran = 575mm^2 and 3.2bln = 615mm^2.

Extremely hard to do right now and wouldn't be accurate due to having nothing other than G200 to base it on.

Would be much easier if we knew some more details on the GT2xx derivatives, specs, diesize and tranny count. Even if we knew that info it would still be pretty inaccurate due to G300 being a new architecture.

Also, ATi's transistor density is much better than Nvidia's, so basing any estimates on how much ATi increased their transistor density isn't going to be very accurate.

Purely going on linear shrinks, I am getting about the same number as you 3.2b but with a smaller diesize ~580mm2, so I would estimate somewhere around 3.1-3.3b for a diesize around 560-600mm2, covering the most recent rumors of a G200 sized die.
 
We can certainly hope for something like this. But RV670->RV770 was accomplished by eliminating mostly obvious mistakes in the R600 design
There's wanton excess in NVidia too - just for some reason a lot of people lose critical faculties whenever the subject is raised. I'm certainly not going to rake through my long standing arguments about the terrible inefficiencies there.

and by some magic which allowed them to pack 2.5 times more ALUs in almost the same complexity (transistors and die size).
Magic, hmm...

So while we can hope for GF100 to somewhat repeat that success i'd say that counting on it as a "minimum" is highly unrealistic.
There's no doubt, GT21x has been a bracing failure so far, not exactly shining a positive light on NVidia's current abilities to deliver.

Jawed
 
There's wanton excess in NVidia too - just for some reason a lot of people lose critical faculties whenever the subject is raised. I'm certainly not going to rake through my long standing arguments about the terrible inefficiencies there.
What inefficiencies are there in GT200 besides the MSAA 8x performance? Even separate DP units can't really be called an obvious mistake of design, it was just a choice they made back then which wasn't necessarily bad considering the results.

Magic, hmm...
I still haven't seen any information on how they did it beyond the pointless marketing buzz.

GT21x has been a bracing failure so far, not exactly shining a positive light on NVidia's current abilities to deliver
Was it NVIDIA GT21x specifically or TSMC 40G in general though?
 
The arbitrary R/W in the LDS is important too.
You mean write anywhere? That's a feature of R700 LDS too.

That's not to say that R800 LDS doesn't work better than R700 LDS.

I'm not convinced there are any big architectural leaps left to make,
You mean on this side of GPU design, as opposed to the Larrabee-like future?

DWF seems something which can handled in software ...
But only for clause lengths > X, many times greater than a hardware implementation? Also, is DWF able to stand-up to the strain of nested branching?

the only important leap left to make IMO is to fold the pixel cache into L2 (making it read/write, with coherency being guaranteed by relatively simple fences ... doesn't give the low latency cross core coherency of Larrabee, but I don't think that's really necessary).
This is one of my big questions about D3D11, as it seems to declare open day for out of order pixel shader memory-accesses.

R800, by the sound of it, has beefed-up buffers as a step in this direction. Additionally the ability of TUs to read render targets sounds like there's a connection of data from RBE cache to L2 (which is for TU), in order to provide a monster pixel data bandwidth into the ALUs. (That's a guess).

But I'd still like to know more about what's happening there.

After that I don't really see how it will be much more difficult to program than say Larrabee, if you want to use the option of using the LDS with their comparitively huge gather bandwidths it will be harder to program ... but it's good to have options.
L2 in Larrabee with 32 cores at 1.5GHz provides about 3TB/s of bandwidth. We're looking at 1TB/s L1/LDS (guessing LDS bandwidth) in RV870 and 435GB/s L2->L1. GT200's shared memory bandwidth is about 1.4TB/s, it would be reasonable to expect ~doubling in GF100.

I still think shared memory is a short-term fix that'll hobble programming these things later on.

Oh and Ct is getting closer:

http://makebettercode.com/ct_tech/survey.php

even if Intel appears to believe that it's an interim thing.

Jawed
 
Except RCP is in the SFU, and that seems wrong to me -- from the same usefulness perspective.
I don't really understand what you mean by "usefulness" - you're referring to some absolute capability? You're saying that it should be at 50%, or higher, throughput compared with MUL?

Jawed
 
What inefficiencies are there in GT200 besides the MSAA 8x performance?
The whole damn thing.

Even separate DP units can't really be called an obvious mistake of design, it was just a choice they made back then which wasn't necessarily bad considering the results.
Compared with x86 it's pretty appalling.

I still haven't seen any information on how they did it beyond the pointless marketing buzz.
Well journalists have their chances to find out more.

Was it NVIDIA GT21x specifically or TSMC 40G in general though?
GT218 is a case in point. We've seen the power/performance comparisons with RV710. RV740 doesn't need any such excuses. What's NVidia doing?

Jawed
 
Which gets decimated (literally) with relatively random gathers.
Shared memory in GT200 is only slightly less decimated by the same kinds of patterns - and that's provided you play ball within a murderously tight budget of allocated memory per strand. Of course GF100 could be way better. Gotta wait and see.

Jawed
 
How big is the market for that? I know about DirectCompute but I'm wondering how much is Microsoft is willing to invest on that to have software vendors adopt it. I'd rather they had adopted OpenCL instead of pushing their proprietary and incompatible version, though.

Well,
NVIDIA Collaborates with Microsoft on High Performance GPU Computing

i believe that Nvidia is betting 'the house' that GPU computing will become as important as CPU processing. That is what their GTC is about; their future.

i am going to check it out. i was at Nvision08 and i am packing right now and heading for San Jose tonight to report on GTC for my site

.. and no worries, i will ask Jensen about Fermi there (or someone else will) .. but my own sources have already confirmed it
 
Well this is a speculation thread, and since I'm new here I hope you guys don't mind me posting my speculation. :p This is what I am guessing:

GF100 (Saw-zall GTX)
40nm DX11 Cuda3
~590mm^2 ~3.2 billion transistors
24.5 x 24.5mm2 die
1536mb .4ns samsung/hynix 5gbps
~233gb/sec bandwidth
700c / 1750s / 1250m
512 MIMD / 128 tmu / 64 rop
195watt TDP
launch nov 25th, major retail christmas/jan.
$549 - $599
 
I don't really understand what you mean by "usefulness" - you're referring to some absolute capability? You're saying that it should be at 50%, or higher, throughput compared with MUL?

Jawed

I'm saying that, comparatively speaking, that 5% number for transcendentals wouldn't hold for RCP. Or, it's more obvious (to me, a non-gfx, non-sci-analysis programmer) why MUL and RCP would be 1:1 than it is for, say, ADD and MUL.

-Dave
 
But only for clause lengths > X, many times greater than a hardware implementation?
If I had done the work to prove my opinion I wouldn't have said I think :)
Also, is DWF able to stand-up to the strain of nested branching?
That's just a question of heuristics ... maybe profile guided branch probability hints could help? In the end nothing beats MIMD, but the assumption is that upto a point the trade off in area remains worth it.

I wonder if Intel has any automated tools for dynamic strand formation yet.
I still think shared memory is a short-term fix that'll hobble programming these things later on.
Coincidentally that's what I think providing snooping cache coherency does for Larrabee ... just teaches bad habits with something which is convenient but scales like shit.
 
I'm not saying local stores are necessary, I'm saying that removing the need to think carefully about data communication by just throwing lots of snooping bandwidth at it and allowing each and every cache to contain a copy of a memory location is a bit too extreme.
 
I'm not saying local stores are necessary, I'm saying that removing the need to think carefully about data communication by just throwing lots of snooping bandwidth at it and allowing each and every cache to contain a copy of a memory location is a bit too extreme.
(HW) implementation details aside, programmers that really care about performance will simply try to keep snooping traffic low. Which sounds simpler than managing 27 different-all partially incoherent-memory types.
Perhaps tomorrow someone will find a way to make it easy for the sw developers and simple from an hw implementation standpoint, although I doubt it will ever happen :)
 
The whole damn thing.
That's a bit of an overstatement.

Well journalists have their chances to find out more.
So how come nobody did?

GT218 is a case in point. We've seen the power/performance comparisons with RV710. RV740 doesn't need any such excuses. What's NVidia doing?
GT218 is a 60mm^2 GPU. I don't think that you can compare it to the 140mm^2 RV740. And you surely can't compare it to a GPU made on another process.
In other words we need more information before any conclusion on GT21x being a failure can be made. One review of GT218 isn't enough for such conclusion.
 
(HW) implementation details aside, programmers that really care about performance will simply try to keep snooping traffic low.
The software will be written for the hardware of the day ... the next generation of hardware will be the designed partly for the software written for the hardware of yesterday.

Once they go down this road it will be hard to make a turn.
Which sounds simpler than managing 27 different-all partially incoherent-memory types.
There are many ways to guarantee coherency.
 
Status
Not open for further replies.
Back
Top