Nvidia GT300 core: Speculation

Ailuros · Sep 30, 2009

nAo said:
You clearly haven't read what I wrote about data transfer errors. We are dealing with GDDR5, it won't fail, it will scale badly or even impact perf. Moreover no app is entirely ALU limited or bw limited, bottlenecks are dynamic and constantly change while rendering a single frame.
BTW..not just talking about the memory modules 'failing', the GDDR5 interface can fail as well.

Albeit I don't know anything yet but assuming the 384bit bus for GF100 is true, what guarantees that we might see something similar here too?

rpg.314 · Sep 30, 2009

I on the other hand believe that CPU style caches dont scale. LRB's rendering pipeline is an ample proof of that. We'll need scratch pad memories, just like cell/gpu's of today. However, the one thing that I'll change over cell is to allow vector scatter gather from global memory as well, and not just async. dma's.

Cell programmers might be banging their heads against walls, stones etc. But gpu programmers have got on pretty fine in the last 2.5 years on CUDA.

nAo · Sep 30, 2009

rpg.314 said:
. But gpu programmers have got on pretty fine in the last 2.5 years on CUDA.

If you believe that you haven't read enough CUDA based research papers

edit: sooner or later nvidia & ati will add proper coherent r/w caches to their architectures, it's just a matter of time.

FUDie · Sep 30, 2009

nAo said:
You clearly haven't read what I wrote about data transfer errors. We are dealing with GDDR5, it won't fail, it will scale badly or even impact perf. Moreover no app is entirely ALU limited or bw limited, bottlenecks are dynamic and constantly change while rendering a single frame.

Yes, I did read what you wrote and I do understand it. And nothing you say contradicts the fact that Crysis scaled better with engine clock. It doesn't matter if the memory wasn't scaling as well due to errors: 9% engine clock gave 5% performance boost. If both engine and memory were increased by 9% the maximum gain we'd expect would be 9%. So 9% memory clock increase could give at most 4% more performance.

Engine clock is having a larger impact here. Note that engine speed regulates more than just ALU speed, it also controls ROP performance, vertex rates, etc.

-FUDie

rpg.314 · Sep 30, 2009

nAo said:
If you believe that you haven't read enough CUDA based research papers

May be. But I'd like to see someone using r/w coherency of caches on a say O(50) core chip with high performance to be convinced otherwise.

edit: sooner or later nvidia & ati will add proper coherent r/w caches to their architectures, it's just a matter of time.

I am in the software managed caches camp for now. r/w coherent caches hurt more than the help in the O(50) cores regime, as your compute increases as O(p) but your communication increases by O(p^2).

nAo · Sep 30, 2009

rpg.314 said:
your communication increases by O(p^2).

With naive/simple hw implementations.

rpg.314 · Sep 30, 2009

May be it is possible to reduce the O(p^2) to something lower, but I am still waiting for something that uses the r/w coherency of caches on an O(50) core chip with high performance.

Rys · Sep 30, 2009

Anteru said:
Seriously, if Rys is already doing diagrams of GF100 (while those for HD5k are still not out yet?), I'll definitely wait for the GF100 before deciding where to sink my money.

Those for HD 5870 are done, and were done before I started work on GF100 (thanks Alex!). We'll publish on it soon.

Davros · Sep 30, 2009

Ailuros said:
the 384bit bus for GF100 is true,

GF100 ? where did this come from I know about G300, but Gf100 ???

edit: and Gt212 what the bloody hell is that ?

DuckThor Evil · Sep 30, 2009

Davros said:
GF100 ? where did this come from I know about G300, but Gf100 ???

Go back to post nr. 2548 and read forward.

AnarchX · Sep 30, 2009

GPU specifications
This is the meat part you always want to read fist. So, here it how it goes:
* 3.0 billion transistors
* 40nm TSMC
* 384-bit memory interface
* 512 shader cores [renamed into CUDA Cores]
* 32 CUDA cores per Shader Cluster
* 1MB L1 cache memory [divided into 16KB Cache - Shared Memory]
* 768KB L2 unified cache memory
* Up to 6GB GDDR5 memory
* Half Speed IEEE 754 Double Precision

BSN

trinibwoy · Sep 30, 2009

What makes a single unit important is the fact that it can execute an integer or a floating point instruction per clock per thread.

MfA · Sep 30, 2009

built-in ECC features inside the GDDR5 SDRAM memory

Is he just being disingenuous here or does he still don't get it will generally only corrects transfer errors?

Ailuros · Sep 30, 2009

Davros said:
GF100 ? where did this come from I know about G300, but Gf100 ???

edit: and Gt212 what the bloody hell is that ?

GT212 was IMHO a 40nm/D3D10.1 project which would had been a pretty dumb release considering that it also had a 384bit bus and 32SPs/cluster. It wouldn't had come close to GF100 though but most likely a future performance iteration of it. I'd say that if they had any common sense when they cancelled that project they moved its human resources into a GF10x performance GPU project.

Since you're asking questions I hope now some come can understand why the intentional false information in supposed roadmaps. They just "named" the D12U something like GTX280 1.5GB.

rpg.314 · Sep 30, 2009

Isn't there supposed to be 32 kb shared mem per block in dx11?

DegustatoR · Sep 30, 2009

rpg.314 said:
Isn't there supposed to be 32 kb shared mem per block in dx11?

48>32?
Ah, I see, it's Theo again.
He's talking about L1 cache there. Considering there is 1 MB of memory total and 16 KB L1 per SM and 16 SMs (512/32=16) how do you get to 1 MB from 16x16KB?

trinibwoy · Sep 30, 2009

3dilettante said:
Maybe it could happen, though like Charlie I would question whether it would be wise to try to out-Larrabee Larrabee.

Looks like that's exactly what they're trying to do. Strange that there's no mention of any graphics specific bits so far. Not saying there aren't any but the focus seems to have veered sharply away from graphics.

A clean-sheet design that would basically abandon a huge chunk of the G80-GT200 framework would take time and resources to bring about. Given the time cycles for something like that, the roughly four years since the completion of G80 (assuming GT200's somewhat underwhelming improvements meant it was a secondary effort) would be a frighteningly tight timeline to architect a general purpose VLSI architecture.

That's true, but the same could be said for G71->G80 which was an even bigger change. Though they are trying to do more stuff now which could have put a strain on resources.

A big problem I see, as was noted in the discussion concerning the latency of Nvidia's atomic ops, was how the read-write-read process for GPUs with their read-only caches was so very long. As far as general computation is concerned, the rearchitecting of how caches interact would be something Nvidia would be interested in looking at...

It's probably safe to assume that if they're serious about computing, performance of atomics would have been high on their todo list. Side question - are the existing caches on GPUS generally useful for non-texture data (not referring to the specialized caches like PTVC)?

trinibwoy · Sep 30, 2009

DegustatoR said:
48>32?

Heh, where did you see 48? Theo didn't mention it

Ah, I see what you did thar! 1024/16-16=48

Rys · Sep 30, 2009

DegustatoR said:
Considering there is 1 MB of memory total and 16 KB L1 per SM and 16 SMs (512/32=16) how do you get to 1 MB from 16x16KB?

There isn't 16KB of L1 per SM.

DegustatoR · Sep 30, 2009

Rys said:
There isn't 16KB of L1 per SM.

There might be -)

Nvidia GT300 core: Speculation

Ailuros

Epsilon plus three

rpg.314

nAo

Nutella Nutellae

FUDie

rpg.314

nAo

Nutella Nutellae

rpg.314

Rys

Graphics @ AMD

Davros

DuckThor Evil

AnarchX

trinibwoy

Meh

MfA

Ailuros

Epsilon plus three

rpg.314

DegustatoR

trinibwoy

Meh

trinibwoy

Meh

Rys

Graphics @ AMD

DegustatoR

Similar threads