Nvidia GT300 core: Speculation

TimothyFarrar · May 1, 2009

Jawed said:
If collisions can't occur due to addressing, then there's no need to use atomic operations - the obvious example is per thread registers which are entirely private.

I'd like to assume that shared-memory atomics are implemented with dedicated hardware instructions. If that is the case, would seem as if you are right in that nothing special happens at the shared memory level. Otherwise things would get rather messy, you'd have to serialize groups of instructions on address "collisions". Not sure about the complexity trade off in hardware between these two options.

The paper you linked seems to use either 64 or 256 bins per patch, each of which is writable by any number of threads (source bins), with one source bin per source patch first identified by an atomicMin. So the number of collisions here is variable and in fact the family of atomic variables is huge.

Yeah that paper presents nearly the worst case I can see for atomic operations, huge number of global atomic operations. Each global atomic should be 64-bytes of global memory traffic (GT200, 32-byte minimum transfer size: load, atomic op, store) for a single 32-bit integer.

The more I look at this, the more I think that fast atomic operations are in-fact more important than any type of dynamic warp formation (DWF), so I'm changing my prediction about DWF. DWF for bank conflict avoidance no longer seems worth it when you consider that you can just load data into shared memory at a bank offset based on thread index (can completely avoid bank conflicts). So about the only thing DWF gets you is better branch performance, but divergent branching messes up everything required for data locality, and tightly ordered synchronization. Which leads me to wonder about just what that "cGPU" buzz word actually means. Perhaps it is just Multiple Kernel - SIMD (MK-SIMD), better cross core load balancing, combined with some better more shared caching for atomics/ROP?

Jawed · May 1, 2009

TimothyFarrar said:
The more I look at this, the more I think that fast atomic operations are in-fact more important than any type of dynamic warp formation (DWF), so I'm changing my prediction about DWF.

Ah, the curse of the histogram, particularly one with an unknown bin count

I think the most grievous issue is simply that a kernel's domain (grid) cannot entirely fit on a GPU at any one time, i.e. a grid by definition runs out of global memory.

DWF for bank conflict avoidance no longer seems worth it when you consider that you can just load data into shared memory at a bank offset based on thread index (can completely avoid bank conflicts). So about the only thing DWF gets you is better branch performance, but divergent branching messes up everything required for data locality, and tightly ordered synchronization.

My hope for DWF is that it solves all waterfall scenarios, whenever any kind of random fetching/thread divergence crops up.

In the end, maybe dynamic branching, like atomic operations, comes under the heading of "use sparingly, if at all". Still too-early days to tell.

It seems that data-parallel-specific techniques such as scan can be used to work around these gotchas - but these techniques are, themselves, pretty expensive. The paper you linked seems to be an exemplar - the concept of the oracle to parse the input and build a tree is something that I expect we'll be seeing much more of as people grapple with "non-square" parallelism.

It's interesting that ATI has a variety of speed-ups for some of these kinds of things buried in the architecture, like transposing reads from shared memory and lane-shared registers. Maybe this is where NVidia will be putting in a lot of effort, to make data-parallel primitives function with less off-die/divergence/serialisation.

Which leads me to wonder about just what that "cGPU" buzz word actually means. Perhaps it is just Multiple Kernel - SIMD (MK-SIMD), better cross core load balancing, combined with some better more shared caching for atomics/ROP?

Well, NVidia's had a few years to think about this, so it could be as "radical" as shared memory was. Certainly something that's more fluid, less "square", has a naive attractiveness about it. Hard to know how long the shine would last, though. GPUs make awful ray tracers when you consider the raw computational throughput that's left on the table.

The definition of cGPU appears to be Larrabee.

Jawed

MfA · May 1, 2009

TimothyFarrar said:
divergent branching messes up everything required for data locality, and tightly ordered synchronization.

Still better than using 32 SIMD units for 1 ray in raytracing.

KonKort · May 5, 2009

Again in cooperation with BSON I can present more GT300 details.

The chip would have 512 SPs by a 512 Bit memory-interface.

Hardware-Infos

BSON

trinibwoy · May 5, 2009

And what do they plan to do with all that bandwidth if true? 512 GT200 class SP's won't be enough.

pjbliverpool · May 5, 2009

Its more than twice the shader power at less than twice the bandwidth so it seemsless wastefull in terms of bandwidth than GT200.

Besides, if Xenos can use than much bandwidth i'm sure a GT300 could

Scali · May 5, 2009

trinibwoy said:
And what do they plan to do with all that bandwidth if true? 512 GT200 class SP's won't be enough.

Depends on how fast they can clock those SPs... but yea, I somehow find the 512 bit AND GDDR5 to be a bit too much aswell.

bowman · May 5, 2009

Hey, it's about time. This thing should be a true 8800GTX successor.

On the other hand 'bright side of news' with Theo Valich and that other site isn't exactly the most reliable source of info in the world.

CarstenS · May 5, 2009

If 512 Bit and especially GDDR5 are true at all, then I wonder if and how Nvidias turn at making a compelling idle mode for this card will work out.

[edit: Yippie - 1k postings completed. Advance to next level[/i]

Scali · May 5, 2009

CarstenS said:
If 512 Bit and especially GDDR5 are true at all, then I wonder if and how Nvidias turn at making a compelling idle mode for this card will work out.

Well, one thing I could think of... why not disable half the memory altogether in 2d mode?
I mean, 1 GB or more memory is nice for 3d, but for a standard OS desktop it's way overkill. In fact, I guess even just 128 mb would be more than enough.
Did any manufacturer ever try something like that?

DegustatoR · May 5, 2009

pjbliverpool said:
Its more than twice the shader power at less than twice the bandwidth so it seemsless wastefull in terms of bandwidth than GT200.

Why less? What GDDR5 speeds are we expecting to be available at the end of the year?
And SP number probably doesn't mean anything besides a hint that they'll remain to be serial scalars from the number itself.

Edit:
GT300 delayed till 2010. So, what kind of GDDR5 memory are we expecting in the beginning of 2010? 8)

Jawed · May 5, 2009

Scali said:
Well, one thing I could think of... why not disable half the memory altogether in 2d mode?
I mean, 1 GB or more memory is nice for 3d, but for a standard OS desktop it's way overkill. In fact, I guess even just 128 mb would be more than enough.
Did any manufacturer ever try something like that?

Maybe NVidia's doing it already?

But Aero is a 3D app, so you'd also be turning off ROPs on current hardware to make this work, I guess.

ATI's IGPs seem to recognise a frozen image in the framebuffer and use that as a signal to turn off stuff. Something like that, IGP stuff is so interesting.

Jawed

Scali · May 5, 2009

Jawed said:
Maybe NVidia's doing it already?

But Aero is a 3D app, so you'd also be turning off ROPs on current hardware to make this work, I guess.

I'm not sure if there's any difference at all between 2d and 3d, as far as hardware is concerned.
That is, why would there be any specific 2d hardware when 3d texturing/shading hardware can perform the same operations?
So I don't think Aero or 'classic' Windows interface makes a difference.

Jawed · May 5, 2009

Yes, you're right.

I think it's just a question of how flexible the tiling of screen space would end up being, then. i.e. can you run aero solely on one MC? By definition the architecture can do this - the question is really about the dynamic switching...

Jawed

Scali · May 5, 2009

Yea, if I were developing a GPU, I'd want to look into that.
Aero itself is really light, even on my Intel X3100 it runs very well. So you can castrate a modern GPU down to X3100-levels (well let's see, I have 667 MHz dualchannel DDR2, 8 'stream processors' and a maximum of 384 mb), and people probably won't even notice. But the power saving would be huge. What does an X3100 use anyway? Less than 10W I suppose. A videocard idling at < 10W, that would be something

trinibwoy · May 5, 2009

pjbliverpool said:
Its more than twice the shader power at less than twice the bandwidth so it seemsless wastefull in terms of bandwidth than GT200.

Perhaps but don't we all expect required math:bandwidth ratio to increase rapidly going forward? Or will DX11 apps remain texture/bandwidth bound?

CarstenS · May 5, 2009

Scali said:
Less than 10W I suppose. A videocard idling at < 10W, that would be something

You should be getting yourself an HD 4670 then

Seriously, Aero is a no-go on my GMA500. I mean, it works, but veeeery slooooowly. So, sweet spot would be somewhere in between yours (X3100) and mine (GMA500).

Scali · May 5, 2009

CarstenS said:
You should be getting yourself an HD 4670 then

No, I want something that's efficient when idle, but is actually fast when running at maximum speed.
Heck, would an HD4670 even be faster than the 8800GTS I have currently? It would not support PhysX, and it won't have the same level of texture filtering anyway.

neliz · May 5, 2009

Scali said:
No, I want something that's efficient when idle, but is actually fast when running at maximum speed.
Heck, would an HD4670 even be faster than the 8800GTS I have currently? It would not support PhysX, and it won't have the same level of texture filtering anyway.

Do your PhysX reduce idle power? angle independancy AF has a 16X power savings mode? Oh wait, nV marketing does not condone the use of the Power of 3 in a discussion about power savings.

what about switching graphics just like on the notebook side? you'll pay $20 more for your videocard but it would (very very very very very very eventually) earn that back by switching GPU's under load.

GT300, 512 bit, super duper memory controller. delay after delay. NV30 and R600 finally can have a threesome.

Sxotty · May 5, 2009

neliz said:
Do your PhysX reduce idle power? angle independancy AF has a 16X power savings mode? Oh wait, nV marketing does not condone the use of the Power of 3 in a discussion about power savings.

what about switching graphics just like on the notebook side? you'll pay $20 more for your videocard but it would (very very very very very very eventually) earn that back by switching GPU's under load.

GT300, 512 bit, super duper memory controller. delay after delay. NV30 and R600 finally can have a threesome.

That post was full of crazy incoherent language.

Still we had that it was called hybrid Sli for power savings, only available on AMD chips (with Nv board) and died an early death. Maybe in the future it will rear its head again, preferably with OS support so that it could deal with AMD<-->Intel<-->Nvidia without it mattering.

Nvidia GT300 core: Speculation

TimothyFarrar

Jawed

MfA

KonKort

trinibwoy

Meh

pjbliverpool

B3D Scallywag

Scali

bowman

CarstenS

Moderator

Scali

DegustatoR

Jawed

Scali

Jawed

Scali

trinibwoy

Meh

CarstenS

Moderator

Scali

neliz

GIGABYTE Man

Sxotty

Similar threads