Nvidia GT300 core: Speculation

Humus · Sep 29, 2009

Jawed said:
The other important question is, what the fuck is "pull model interpolation" in D3D11?

Previously the interpolator provided the interpolated values to the shader. In SM 5.0 the shader can ask for interpolated values by itself. There are some functions for it: EvaluateAttributeAtCentroid(), EvaluateAttributeAtSample() and EvaluateAttributeSnapped().

dnavas said:
I'm saying that, comparatively speaking, that 5% number for transcendentals wouldn't hold for RCP.

5% would be a pretty reasonable estimate IMO. Might even be lower for optimized shaders. RSQ is probably used more, might be in the 10% range due to normalize() being used fairly frequently.

3dilettante · Sep 29, 2009

Jawed said:
R800, by the sound of it, has beefed-up buffers as a step in this direction. Additionally the ability of TUs to read render targets sounds like there's a connection of data from RBE cache to L2 (which is for TU), in order to provide a monster pixel data bandwidth into the ALUs. (That's a guess).

Any such data path puts RV870 one step closer to fully closing the write/read loop in the manner CPU caches do.
It would probably still be less flexible and have higher latency, but at least there's an on-chip path.

L2 in Larrabee with 32 cores at 1.5GHz provides about 3TB/s of bandwidth. We're looking at 1TB/s L1/LDS (guessing LDS bandwidth) in RV870 and 435GB/s L2->L1. GT200's shared memory bandwidth is about 1.4TB/s, it would be reasonable to expect ~doubling in GF100.

The LDS bandwidth make sense, assuming the 64-byte data path in RV770 remains without further elaboration.

As a side note, I'm curious about the additional non-texture L1 that was added alongside the regular texture cache, as mentioned in the Anandtech article. What this brings to the table at that size compared to the larger texture and LDS, I'm not sure. It would help with problems with thrashing, if graphics and compute shaders hit the same SIMD, I suppose.
In a GPGPU situation, what would it offer over using the larger L1?

nAo · Sep 29, 2009

With now both NVIDIA and ATI interpolating in the shader cores divisions are used even more often

MfA · Sep 30, 2009

nAo said:
Sure. Any favourite model of yours?

Directory based with no replication at all of writeable data.

nAo · Sep 30, 2009

MfA said:
Directory based with no replication at all of writeable data.

Well, in the short term NVIDIA and/or Intel have the chance of make you happy. Although I get the feeling you are quite skeptic..

trinibwoy · Sep 30, 2009

Jawed said:
GT200's shared memory bandwidth is about 1.4TB/s, it would be reasonable to expect ~doubling in GF100.

How do you figure that?

I get 30*16-banks*4-bytes*600Mhz =~ 1.15TB/s

Arty · Sep 30, 2009

KonKort said:
Why is GF100 late? How can you judge in this direction? I reported in January that Nvidia's next generation chip will come in Q4/2009. So where do you see a delay?

You also 'reported' that GT300 taped out in Jan-Mar (massive lol) and that it was 512b, I'm waiting for you to backtrack on that bit and conveniently jump on the 384b bandwagon thanks to the hint from Rys. FYI, you also 'reported' that Cypress was 1200SPs and what not ..

Anyone care to reason why the change from 384b to 512b? (Assuming it was 512b to start with) Does that improve yields? And if it does, is that significant? This also comes at an awkward proposition, irregular memory configurations (Arun?). A 384b GF100 would mean 1.53GB (53% more memory wrt Cypress) would be an additional cost, no?

FYI, if I'm reading CJ correctly G300/Fermi/GF100 recently returned from a spin.

dnavas · Sep 30, 2009

Humus said:
5% would be a pretty reasonable estimate IMO. Might even be lower for optimized shaders. RSQ is probably used more, might be in the 10% range due to normalize() being used fairly frequently.

Thanks for data re RSQ in shaders. Interesting. I figured 5% was about right for divisions in my own code, but then the percentage for transcendentals in my own code would be roughly 0%

I also figured the number of adds vs. multiplies would put muls at around 10%, but then a lot of adds are address related and loop related. I'd be hard-pressed to take an educated guess at the ratio absent those two items, but my stab in the dark would say that there are more adds per mul than there are more muls per div. :shrug:

At anyrate, from my limited experience, if I wanted to make my ALUs more generic, DIV would have to be close to the top of my list....

-Dave

LordEC911 · Sep 30, 2009

jaredpace said:
Well this is a speculation thread, and since I'm new here I hope you guys don't mind me posting my speculation. This is what I am guessing:

GF100 (Saw-zall GTX)
40nm DX11 Cuda3
~590mm^2 ~3.2 billion transistors
24.5 x 24.5mm2 die
1536mb .4ns samsung/hynix 5gbps
~233gb/sec bandwidth
700c / 1750s / 1250m
512 MIMD / 128 tmu / 64 rop
195watt TDP
launch nov 25th, major retail christmas/jan.
$549 - $599

Seems like you are going with the 384bit, but your ROPs don't match it. Should be either 24 or 48ROPs, 48ROPs sounds a bit more likely.
Not sure how you are getting 233GBps, 384b w/ 5ghz GDDR5 is 240GBps, probably going to have slightly lower clocks though to save on power consumption unless they are heavily bottlenecked by bandwidth.

mapel110 · Sep 30, 2009

Let me try. ^^

40nm DX11
~450mm^2
~2.8 billion transistors
2048 MB GDDR5
320gb/sec bandwidth
512bit memory interface
700c / 1600s / 1250m
512 MIMD / 160 tmu / 32 rops at much higher frequency
170watt TDP
launch nov, availability december
$450 for top model

Kaotik · Sep 30, 2009

KonKort said:
Why is GF100 late? How can you judge in this direction? I reported in January that Nvidia's next generation chip will come in Q4/2009. So where do you see a delay?

You cannot say that GF100 has delayed only because of the fact that AMD has got first DirectX 11-chips few weeks before.
But I will not deny that Nvidia's chip has got some problems in the summer and could be already in the market.

As some others already pointed out, you don't have the best track record either on reliable reports

CJ · Sep 30, 2009

Arty said:
FYI, if I'm reading CJ correctly G300/Fermi/GF100 recently returned from a spin.

Actually, that's not what I said.

I said it needs a spin and not that it returned from a spin.

Unknown Soldier · Sep 30, 2009

jaredpace said:
Well this is a speculation thread, and since I'm new here I hope you guys don't mind me posting my speculation. This is what I am guessing:

GF100 (Saw-zall GTX)
40nm DX11 Cuda3
~590mm^2 ~3.2 billion transistors
24.5 x 24.5mm2 die
1536mb .4ns samsung/hynix 5gbps
~233gb/sec bandwidth
700c / 1750s / 1250m
512 MIMD / 128 tmu / 64 rop
195watt TDP
launch nov 25th, major retail christmas/jan.
$549 - $599

or 27th

Arty · Sep 30, 2009

Unknown Soldier said:
or 27th

Which was posted last week in this same thread, unless you like regurgitated news at that link.

itaru · Sep 30, 2009

NVIDIA - GPU Features?

http://www.techarp.com/showarticle.aspx?artno=88&pgno=5
http://www.techarp.com/showarticle.aspx?artno=88&pgno=6
http://www.techarp.com/showarticle.aspx?artno=88&pgno=7

ninelven · Sep 30, 2009

Clearly wrong... Quick game of find the inconsistencies.

Wirmish · Sep 30, 2009

nAo said:
HD5890 numbers are huge, but it doesn't seem to be any better than its predecessor when it comes down to use all that raw power, it's actually worse given that is more bw constrained.

You sure ? ---> FiringSquad

nAo · Sep 30, 2009

Hardly conclusive tests. Increasing the memory clock could simply generate more data transfer errors. You can't tweak a couple of knobs and expect a complex architecture to show a linear behaviour.

FUDie · Sep 30, 2009

nAo said:
Hardly conclusive tests. Increasing the memory clock could simply generate more data transfer errors. You can't tweak a couple of knobs and expect a complex architecture to show a linear behaviour.

Except that increasing engine clock 9% alone was enough to gain 5%. Increasing memory clocks by 9% as well couldn't get you more than additional 4%, so engine clock has more impact that memory clock.

-FUDie

nAo · Sep 30, 2009

FUDie said:
Increasing memory clocks by 9% as well couldn't get you more than additional 4%

You clearly haven't read what I wrote about data transfer errors. We are dealing with GDDR5, it won't fail, it will scale badly or even impact perf. Moreover no app is entirely ALU limited or bw limited, bottlenecks are dynamic and constantly change while rendering a single frame.
BTW..not just talking about the memory modules 'failing', the GDDR5 interface can fail as well.

Nvidia GT300 core: Speculation

Humus

Crazy coder

3dilettante

nAo

Nutella Nutellae

MfA

nAo

Nutella Nutellae

trinibwoy

Meh

Arty

KEPLER

dnavas

LordEC911

mapel110

Kaotik

Drunk Member

CJ

Unknown Soldier

Arty

KEPLER

itaru

ninelven

PM

Wirmish

nAo

Nutella Nutellae

FUDie

nAo

Nutella Nutellae

Similar threads