Nvidia GT300 core: Speculation

Status
Not open for further replies.
For those of us who don't read tech-German (I suspect most can figure it out more or less ;)) here's a short overview:

Unnamed Nvidia-insiders / source close to Nvidia bla bla: GT300 will be a generation jump like NV40 and G80 were and the true heir to the G80 throne, where GT200 is basically just considered G80+. It'll be on 40nm, have DX11 support, and they're shooting for the 4th quarter of 2009 (this info seems to coincide with the CUDA 3.0 roadmap)

Some more buzzwords in there; apparently they're moving from SIMD towards a kind of Multiple Instruction Multiple Data, clusters should be arranged more dynamically, more use of crossbar technology and additional buffers. Big changes coming in power- and memorymanagement.

First silicon supposed to be running but still at low clockspeeds.
 
You should not believe in anything this guy KonKort is posting, he's just trying to promote his own website and he's spreading that bullshit everywhere on german forums.
 
Or we could entertain a little insanity? So is MIMD total BS, or technically possible for GT3xx, and if so what type of MIMD?

(1.) Are we talking SIMD hardware which automatically regroups divergent branches into common instruction SIMD vectors for computation?

(2.) Or hardware which keeps same data vector but allows scaler lanes to have different instructions?

(3.) Or hardware which is full on scaler?
 
Going forwards, "work groups" (OpenCL) are going to grow considerably as time goes by. Hundreds and thousands of work items per work group - simply to enable a reasonably large radius for inter-work-item data sharing.

Apart from synching amongst themselves in order to be well behaved in data sharing, the work items can all act independently of each other.

In graphics there's still texture-gradient to bind 4 work items together, so texturing is always going to be a bit of a bottleneck.

But apart from that, when throwing control flow into the mix, which in effect creates sub-work-group synch points, the work items are all free to do what they want.

When a work group is scheduled it's broken down into warps on NVidia hardware. In my view there's nothing about a warp (apart from texture-quads for gradient purposes) that keeps the work units in a warp together.

If a work group is 32x32 work items, in general it doesn't matter if (2,2)(2,4)(4,8)(4,12)(16,8)(16,9)(16,10)(16,11) find themselves in the SIMD at the same time on a series of clocks (instructions).

In G80, NVidia's built a fairly complicated instruction/operand windowing unit, allowing the dimensions of warp and instruction to be executed out of order. It seems to me that by extending this logic further it's possible to dynamically construct warps by going into the third dimension: work-items.

By tagging each work-item with a synch-point, the windower can identify valid combinations of data and instructions to be issued. The synch-points are determined either by explicit data-share synch statements (barrier) in the instruction stream, or by control flow.

With barriers windowed, the SIMD can execute an arbitrarily constructed warp consisting of a stream of instructions that are valid and available within the window.

So in a work-group of 1024 work-items, if only 10 of them want to execute a loop, then the SIMD wastage is restricted to loop count * (clock-length of a warp (4) * SIMD-width (8) - 10), i.e. loop-count*22. It doesn't matter how randomly these 10 work-items are spread throughout the work-group of 1024. The windower will collect them all together within a single warp, for each instruction and every iteration of the loop.

I'm figuring that the "expensive", fine-grained scheduling hardware that NVidia's put together in G80 et al, could turn into "very cheap" scheduling hardware in GT300, once it starts to execute any code with intricate control flow (including barriers and atomic accesses to memory).

So, it's sort of MWMD - multiple warp multiple data :D

Jawed
 
For those of us who don't read tech-German (I suspect most can figure it out more or less ;)) here's a short overview:

Unnamed Nvidia-insiders / source close to Nvidia bla bla: GT300 will be a generation jump like NV40 and G80 were and the true heir to the G80 throne, where GT200 is basically just considered G80+. It'll be on 40nm, have DX11 support, and they're shooting for the 4th quarter of 2009 (this info seems to coincide with the CUDA 3.0 roadmap)

Some more buzzwords in there; apparently they're moving from SIMD towards a kind of Multiple Instruction Multiple Data, clusters should be arranged more dynamically, more use of crossbar technology and additional buffers. Big changes coming in power- and memorymanagement.

First silicon supposed to be running but still at low clockspeeds.
If they have silicon in house and are shooting for Q4 they're not being very aggressive.
 
If they have silicon in house and are shooting for Q4 they're not being very aggressive.

If memory serves well it wasn't any different with G80 before its release. If GT3x0 is really as "revolutionary" as some "leaks" want to indicate, then they better be mighty careful that the result will have close to 0 drawbacks or hiccups; even more so since its a new technological generation.

Besides it really depends what anyone could mean with "samples" or "in house silicon". I have severe doubts that GT3x0 is 'ready' as NV means it when they say it. "They're working on it" should be more accurate IMHLO.

As for the rumour mongering it's funny since its easy to recognize from where each tidbit comes from. Too bad none of them is able to give a decent explanation to each buzzword thrown in the air LOL ;)
 
So in a work-group of 1024 work-items, if only 10 of them want to execute a loop, then the SIMD wastage is restricted to loop count * (clock-length of a warp (4) * SIMD-width (8) - 10), i.e. loop-count*22. It doesn't matter how randomly these 10 work-items are spread throughout the work-group of 1024. The windower will collect them all together within a single warp, for each instruction and every iteration of the loop.

If Nvidia does something like this, while technically impressive I don't really see it being a boon for games though. There are some more branchy algorithms out there but I don't think we're at a point where this sort of efficiency is better than just throwing more units at the problem in terms of bang per transistor. Of course the GPGPU/CUDA guys would be in GPU heaven.
 
You should not believe in anything this guy KonKort is posting, he's just trying to promote his own website and he's spreading that bullshit everywhere on german forums.

And u should keep quiet until u can show us better infos. Just because he has a website doesnt mean his infos are wrong.
 
If Nvidia does something like this, while technically impressive I don't really see it being a boon for games though. There are some more branchy algorithms out there but I don't think we're at a point where this sort of efficiency is better than just throwing more units at the problem in terms of bang per transistor. Of course the GPGPU/CUDA guys would be in GPU heaven.
I'll be keeping an eye out for NVidia presentations that encourage the use of dynamic branching.

But note I think this can also be used to improve the efficiency of barriers and gather/scatter operations. All of these operations create incoherency. I'm thinking that allowing the GPU to window-and-randomly-issue-from a larger population of work-items will enable it to improve the coherency of all these actions - not just dynamic branching.

D3D11 features that involve read-write access to resources seem like a big target for improved coherency. Since all gathers/scatters have variable latency (even if cached) finding a way to re-sort these to match the latency experienced, rather than just doing a dumb latency-hide (as texturing does) should prove beneficial.

Dumb latency hiding works for textures because the access patterns are almost always really neat and efficient - the latency doesn't explode in your face. Except dependent texturing, of course.

Jawed
 
Sorry KonKort, but how can anyone take you seriously when you claim GT300 samples are already back from the fab and functional? Maybe you believe your info, but here's a hint: it can only be wrong. I'd suggest not taking whichever of your sources told you that very seriously in the future...

Anyway a discussion on MIMD is always interesting so that shouldn't prevent us from continuing down that path of conversation as Timothy said! ;)
 
Sorry KonKort, but how can anyone take you seriously when you claim GT300 samples are already back from the fab and functional? Maybe you believe your info, but here's a hint: it can only be wrong. I'd suggest not taking whichever of your sources told you that very seriously in the future...
For what it's worth some of the D3D11-Compute Shader presentations seem to imply available D3D11 hardware:

http://s08.idav.ucdavis.edu/boyd-dx11-compute-shader.pdf

slide 24. Though I admit it's possible to interpret this differently. Specifically, HD4870 is way way faster at FFT than anything else due to changes in the ALUs and this graph might merely be referring to the program running on HD4870, even though it's shown after the D3D11 data-point as the graph "progresses into the future". It may be that HD4870, in this case, is functioning identically to a D3D11 GPU?

Jawed
 
One way to explain that would be this graph is simply a simulation... heh. Anyway one possibility is that NV and/or AMD has/have a DX11 shader core prototype; i.e. using TSMC's shuttle service (and no custom logic etc.) to reduce costs, probably on 65nm. That's a pretty far cry from a full-blown GT300 or RV870 sample though! And honestly I'm not sure how that kind of thing would leak out either.
 
do you know what I want to see on the gt300 ?
wait for it.....
triplehead - why not it shouldnt be too expensive to add
 
Sorry KonKort, but how can anyone take you seriously when you claim GT300 samples are already back from the fab and functional?

Hello Arun,

the GT300 was already running at Nvidia. That is fact!
That he already has all the features, I have not said. I dissociate myself strongly and emphasize that the frequency is still not at the level of the retail versions will be.
 
KonKort,

if NVIDIA already has a running GT300 then why there are GT212 in it`s plans?? GT212 is scheduled (according to what you have said) for May/June. If GT300 is already running then NVIDIA should have it`s final silicon about the same time (June `09) or one/ two months later at worst case.

Maybe GT212 is really canceled then? IMO even NVIDIA could release GT300 earlier than Q4/09, releasing GT212 will be right move because i think there will be no performance/mainstream GPU based on this chip this year so GT212 will be positioned at the same level like G71 was when G80 was released over two years ago.


PS. Konkort do you have any new info aout GT212 or other GT2xx 40nm NVIDIA GPUs? ;)
 
I write that GT21x is not dead. And yes, GT212 will be launched around the summer.

Following notice to GT300: G80 was running in Q1/2006 by Nvidia and the launch of the card was in Q4/2006. ;)
 
I write that GT21x is not dead. And yes, GT212 will be launched around the summer.

Following notice to GT300: G80 was running in Q1/2006 by Nvidia and the launch of the card was in Q4/2006.

If you know so much from NV directly, and since you are posting such stuff on public forums, I wonder if NV-legal has given you a call yet? Or are you allowed to do that?
 
Status
Not open for further replies.
Back
Top