Nvidia GT300 core: Speculation

Status
Not open for further replies.
The other important question is, what the fuck is "pull model interpolation" in D3D11?

Previously the interpolator provided the interpolated values to the shader. In SM 5.0 the shader can ask for interpolated values by itself. There are some functions for it: EvaluateAttributeAtCentroid(), EvaluateAttributeAtSample() and EvaluateAttributeSnapped().

I'm saying that, comparatively speaking, that 5% number for transcendentals wouldn't hold for RCP.

5% would be a pretty reasonable estimate IMO. Might even be lower for optimized shaders. RSQ is probably used more, might be in the 10% range due to normalize() being used fairly frequently.
 
R800, by the sound of it, has beefed-up buffers as a step in this direction. Additionally the ability of TUs to read render targets sounds like there's a connection of data from RBE cache to L2 (which is for TU), in order to provide a monster pixel data bandwidth into the ALUs. (That's a guess).
Any such data path puts RV870 one step closer to fully closing the write/read loop in the manner CPU caches do.
It would probably still be less flexible and have higher latency, but at least there's an on-chip path.

L2 in Larrabee with 32 cores at 1.5GHz provides about 3TB/s of bandwidth. We're looking at 1TB/s L1/LDS (guessing LDS bandwidth) in RV870 and 435GB/s L2->L1. GT200's shared memory bandwidth is about 1.4TB/s, it would be reasonable to expect ~doubling in GF100.
The LDS bandwidth make sense, assuming the 64-byte data path in RV770 remains without further elaboration.

As a side note, I'm curious about the additional non-texture L1 that was added alongside the regular texture cache, as mentioned in the Anandtech article. What this brings to the table at that size compared to the larger texture and LDS, I'm not sure. It would help with problems with thrashing, if graphics and compute shaders hit the same SIMD, I suppose.
In a GPGPU situation, what would it offer over using the larger L1?
 
With now both NVIDIA and ATI interpolating in the shader cores divisions are used even more often :)
 
Why is GF100 late? How can you judge in this direction? I reported in January that Nvidia's next generation chip will come in Q4/2009. So where do you see a delay?
You also 'reported' that GT300 taped out in Jan-Mar (massive lol) and that it was 512b, I'm waiting for you to backtrack on that bit and conveniently jump on the 384b bandwagon thanks to the hint from Rys. FYI, you also 'reported' that Cypress was 1200SPs and what not .. :rolleyes:

Anyone care to reason why the change from 384b to 512b? (Assuming it was 512b to start with) Does that improve yields? And if it does, is that significant? This also comes at an awkward proposition, irregular memory configurations (Arun?). A 384b GF100 would mean 1.53GB (53% more memory wrt Cypress) would be an additional cost, no?

FYI, if I'm reading CJ correctly G300/Fermi/GF100 recently returned from a spin.
 
5% would be a pretty reasonable estimate IMO. Might even be lower for optimized shaders. RSQ is probably used more, might be in the 10% range due to normalize() being used fairly frequently.

Thanks for data re RSQ in shaders. Interesting. I figured 5% was about right for divisions in my own code, but then the percentage for transcendentals in my own code would be roughly 0% :) I also figured the number of adds vs. multiplies would put muls at around 10%, but then a lot of adds are address related and loop related. I'd be hard-pressed to take an educated guess at the ratio absent those two items, but my stab in the dark would say that there are more adds per mul than there are more muls per div. :shrug:

At anyrate, from my limited experience, if I wanted to make my ALUs more generic, DIV would have to be close to the top of my list....

-Dave
 
Well this is a speculation thread, and since I'm new here I hope you guys don't mind me posting my speculation. :p This is what I am guessing:

GF100 (Saw-zall GTX)
40nm DX11 Cuda3
~590mm^2 ~3.2 billion transistors
24.5 x 24.5mm2 die
1536mb .4ns samsung/hynix 5gbps
~233gb/sec bandwidth
700c / 1750s / 1250m
512 MIMD / 128 tmu / 64 rop
195watt TDP
launch nov 25th, major retail christmas/jan.
$549 - $599

Seems like you are going with the 384bit, but your ROPs don't match it. Should be either 24 or 48ROPs, 48ROPs sounds a bit more likely.
Not sure how you are getting 233GBps, 384b w/ 5ghz GDDR5 is 240GBps, probably going to have slightly lower clocks though to save on power consumption unless they are heavily bottlenecked by bandwidth.
 
Let me try. ^^

40nm DX11
~450mm^2
~2.8 billion transistors
2048 MB GDDR5
320gb/sec bandwidth
512bit memory interface
700c / 1600s / 1250m
512 MIMD / 160 tmu / 32 rops at much higher frequency
170watt TDP
launch nov, availability december
$450 for top model
 
Why is GF100 late? How can you judge in this direction? I reported in January that Nvidia's next generation chip will come in Q4/2009. So where do you see a delay?

You cannot say that GF100 has delayed only because of the fact that AMD has got first DirectX 11-chips few weeks before.
But I will not deny that Nvidia's chip has got some problems in the summer and could be already in the market.

As some others already pointed out, you don't have the best track record either on reliable reports ;)
 
Well this is a speculation thread, and since I'm new here I hope you guys don't mind me posting my speculation. :p This is what I am guessing:

GF100 (Saw-zall GTX)
40nm DX11 Cuda3
~590mm^2 ~3.2 billion transistors
24.5 x 24.5mm2 die
1536mb .4ns samsung/hynix 5gbps
~233gb/sec bandwidth
700c / 1750s / 1250m
512 MIMD / 128 tmu / 64 rop
195watt TDP
launch nov 25th, major retail christmas/jan.
$549 - $599

or 27th ;)
 
HD5890 numbers are huge, but it doesn't seem to be any better than its predecessor when it comes down to use all that raw power, it's actually worse given that is more bw constrained.
fb4c7583.jpg


You sure ? ---> FiringSquad
 
Hardly conclusive tests. Increasing the memory clock could simply generate more data transfer errors. You can't tweak a couple of knobs and expect a complex architecture to show a linear behaviour.
 
Hardly conclusive tests. Increasing the memory clock could simply generate more data transfer errors. You can't tweak a couple of knobs and expect a complex architecture to show a linear behaviour.
Except that increasing engine clock 9% alone was enough to gain 5%. Increasing memory clocks by 9% as well couldn't get you more than additional 4%, so engine clock has more impact that memory clock.

-FUDie
 
Increasing memory clocks by 9% as well couldn't get you more than additional 4%
You clearly haven't read what I wrote about data transfer errors. We are dealing with GDDR5, it won't fail, it will scale badly or even impact perf. Moreover no app is entirely ALU limited or bw limited, bottlenecks are dynamic and constantly change while rendering a single frame.
BTW..not just talking about the memory modules 'failing', the GDDR5 interface can fail as well.
 
Status
Not open for further replies.
Back
Top