Nvidia GT300 core: Speculation

CarstenS · May 26, 2009

Jawed said:
It might reflect the supposedly abysmal state of 40nm or other factors to do with timeliness/re-design, I suppose. Leave DP until after the new architecture is shipping?

A DP variant to follow on 28nm by summer/autumn 2010?

Possible - but maybe Nvidia is going to risk this marketing butchery they're going to suffer, when they really do consumer-GPUs without DP.

IMO DP-GPGPU is really just developing and devs already have the GT200 to play around with and adjust their algorithms accordingly. In fact, I have recently talked to some researchers who are doing scientific visual computing and (though being far from represenative) they all agreed that it's basically too early for using much DP -
heck, even much of their source data doesn't even exist in that precision.

Plus, Nvidias in a tight economic situation, pressed hard from AMD and probably from Intel also. They might have to save die area where they can and thus reverted from GT300 to a back-up G300?

Ailuros · May 26, 2009

Theo Valich's hypothetical 7x-10x DP performance increase sounds to me that it could only stand a chance as a theory if DP would come from the ALUs themselves and not separate DP units (and even then the 7x times factor sounds to large to make sense).

Plus, Nvidias in a tight economic situation, pressed hard from AMD and probably from Intel also. They might have to save die area where they can and thus reverted from GT300 to a back-up G300?

Any changes in strategy for such scenarios cannot have happened IMHLO after the financial crysis started. Besides the tight economic situation with the notion of having two different cores for different markets builds its own nice little oxymoron.

CarstenS · May 26, 2009

Ailuros said:
Any changes in strategy for such scenarios cannot have happened IMHLO after the financial crysis started. Besides the tight economic situation with the notion of having two different cores for different markets builds its own nice little oxymoron.

Yes and no. Their financial problems may have been augmented by the global economy's decline, but they sure as hell already have felt it well before - June 2008 at the latest, when AMD unleashed their tera.

As for the oxymoron: Maybe DP only with refreshed, thus reinvesting some of the saved die space? Dunno.

Ailuros · May 26, 2009

A two GPU strategy (one for desktop and one for professional markets) would imply that the latter has in fact separate units for DP in order to save X die area for the first.

I can't know what they're planning exactly but despite NV insisting that dedicated units are more efficient it sounds likelier to me that it's only one GPU w/o dedicated DP units and the theoretical peak DP performance increase being quite a bit lower than the indicated =/>7.0x times. Could be completely wrong of course.

rpg.314 · May 26, 2009

Ailuros said:
Wouldn't you say that when engineers start with a chalkboard design of an architecture that they have defined in their minds whether they're about to design a CPU or a GPU? The fact that LRB is based on x86 doesn't make it a CPU or a hybrid CPU+GPU. Unless of course you can use LRB in a future system w/o a CPU and the unit can take over generel processing as well as graphics.

Why I think it is a CPU?

1) Michael Abrash says so.

2) The only graphics specific functionality they are using (texturing) is something they can't implement in sw efficiently enough.

3) [2] is really the crux of whether it is a gpu or a cpu for me. The graphics specific functionality is what makes gpu's, gpu's.

If you're to oversimplify things as much even today's GPUs are basically a bunch of arithmetic logic units in the heart of the GPU "decorated" with a number of ff units all around it. How each IHV implements X,Y,Z is another chapther but the outcome can always still be either a CPU or a GPU.

Yes. OK. See my comments above. GPU alu's have *some* instructions that rarely have use outside graphics though. saturate on nv for example.

That's what I would figure too and I don't think LRB is in realtime more than theoretical SP=1/4 DP as on today's Radeons (whereby if I recall correctly reality is more in the 1/5 ballpark but that's besides the point).

Yes, true peak DP:SP is 1:5 but that is not realistic as it includes t unit numbers, which is rarely utilized, just like nv's mul. I mentioned this in an earlier post though.

Ailuros · May 27, 2009

rpg.314 said:
1) Michael Abrash says so.

When someone asked here at the boards what a USC is, Simon Fenney from Imagination Technologies simply answered "a CPU". As long as you know what each and everyone means with X or Y there's no violation of terminology.

2) The only graphics specific functionality they are using (texturing) is something they can't implement in sw efficiently enough.

Can you use LRB as a CPU or better, can you use LRB in a system without a central processing unit?

3) [2] is really the crux of whether it is a gpu or a cpu for me. The graphics specific functionality is what makes gpu's, gpu's.

It's still meant and is a GPU.

Yes. OK. See my comments above. GPU alu's have *some* instructions that rarely have use outside graphics though. saturate on nv for example.

Since I mentioned Simon above, their SGX for SoCs has at it's heart (the unified scalable shader core called USSE) a META alike superthreaded processor. The latter is a heterogenous general purpose processor. Do they sell the SGX as a GPU or a CPU?

Yes, true peak DP:SP is 1:5 but that is not realistic as it includes t unit numbers, which is rarely utilized, just like nv's mul. I mentioned this in an earlier post though.

I said it's besides the point. The point in case is whether on LRB the DP is actually half of the
SP throughput which I don't think it is.

Lukfi · May 27, 2009

Ailuros said:
Can you use LRB as a CPU or better, can you use LRB in a system without a central processing unit?

Currently, no. Future generations, likely.

Ailuros · May 27, 2009

Lukfi said:
Currently, no. Future generations, likely.

If it would materialize in the less foreseeable future the only result I could think of would be a SoC replacement (which are in relative terms going to replace IGPs in the foresseable future). Even if Intel would have to radically change the basics of the underlying architecture and probably also depart from x86.

Integration can be a fine thing as long as the part to be integrated has a reasonable power consumption and captures the lowest possible die area. For the time being I don't see LRB to fit that description and that's exactly the reason why I expect Intel to sustain its chipset department for some time and also continue to license 3rd party IP for SoCs.

Jawed · May 27, 2009

I forgot about this:

http://forum.beyond3d.com/showthread.php?t=42140

where the design of reconfigurable SP/DP ALUs was covered. It suggests that implementing half-performance DP shouldn't cost much by dint of the SIMD configuration.

We have conservative area estimates for a double-precision adder and multiplier of 13,456 gates and 37,056 gates, respectively. The real overhead here comes from replacing each pair of single-precision FPUs with one dual-mode FPU, at an approximate cost of 815,744 gates over the entire fragment engine, or 0.4% of the 6800 GT’s total area.

Since 6800GT has 16 pixel shader pipes, each SIMD-4, that's 50512 gates per SIMD-4, or about 101000 gates per G80 multiprocessor, I guess. Though I can imagine at 1.5GHz+ there might be a pipelining multiplier? Perhaps there'd be additional complications in the operand collector too? But some of that, plus instruction-issue complication, should have already been implemented in GT200's DP. Anyway, it sounds ludicrously cheap.

By comparison it might turn out that the ATI implementation, while presumably being cheaper, could have been substantially faster on DP-MUL and DP-MAD by using just a little bit more hardware per SIMD (DP-ADD is half SP rate). ATI could be playing catch-up...

Jawed

trinibwoy · May 27, 2009

Is this "Small Scale Reconfigurability" concept similiar to what they did for the shared SFU/interpolator logic or is it even more fine grained than that?

Jawed · May 27, 2009

trinibwoy said:
Is this "Small Scale Reconfigurability" concept similiar to what they did for the shared SFU/interpolator logic or is it even more fine grained than that?

I'd say that's a useful comparison. Even et al's 1997 paper on dual-mode IEEE multipliers is the basis:

http://lyle.smu.edu/~seidel/publication/ISIS97.ps.gz

(from http://lyle.smu.edu/~seidel/publication/ )

I used this Postscript viewer to have a quick look:

http://download.cnet.com/Postscript-Viewer/3000-2094_4-10845650.html

I'm not sure it's worth getting into too deeply. There's clearly an NVidia connection here so it seems to me to be fait accompli. Fingers-crossed

Funnily enough just last night I was playing with multiplying 512-bit integers using 32-bit integer multipliers. Need to finish off untangling the carries though...

Jawed

Jawed · May 27, 2009

Arun said:
While I hope you're right (I'm AFR hater numero uno, I try not to mention it too much publicly nowadays and won't in the future either because I don't want to seem unfair to AMD), I suspect you are not at least for the GT300 generation because NV has been insistent recently, both publicly and privately, that they would go for a big chip then scale down rapidly.

I don't see any contradiction here. GT300 could be huge - say 400mm² - followed rapidly by small chips that are easier to engineer because they don't have ROPS/DDR on them as that's all handled by this new NVIO³ hub chip.

If you could have an external interface that's more area-efficient than GDDR5 (Rambus? heh) then certainly this could make financial sense given the wafer price difference between, say, 40nm and 90nm. However, as I said, I'm a bit skeptical this will happen in practice (especially for the multi-GPU sharing part)

NVidia, going with something entirely proprietary at clock-rates of its own choosing should be able to get a massive bandwidth boost through a relatively small port - similar to the small ports on Xenos for FSB and EDRAM. DDR IO is really poor in terms of bandwidth per mm² or mm of perimeter - it's really obvious on Xenos

Needless to say what I've described could tally with some noises that GT300 has no ROPs. That's because they're on NVIO³

Jawed

trinibwoy · May 27, 2009

Jawed said:
I'd say that's a useful comparison. Even et al's 1997 paper on dual-mode IEEE multipliers is the basis:

http://lyle.smu.edu/~seidel/publication/ISIS97.ps.gz

(from http://lyle.smu.edu/~seidel/publication/ )

Which begs the question why haven't we seen it yet 10 years later. Or is that how AMD's DP is setup?

Funnily enough just last night I was playing with multiplying 512-bit integers using 32-bit integer multipliers. Need to finish off untangling the carries though...

Playing? You're insane

Jawed · May 27, 2009

trinibwoy said:
Which begs the question why haven't we seen it yet 10 years later.

Something, at the very least quite similar, appears to be what's inside SSE on x86 CPUs. I am guessing here, but since they have a 2:1 SP: DP capability...

Or is that how AMD's DP is setup?

My theory is that it's not - and the performance (1/4 rate over the XYZW ALU lanes) doesn't tally with the characteristics of this kind of design it seems.

Playing? You're insane

The guy who's struggling with a 380MB source file

that won't compile is in severe pain, so I thought I'd make some suggestions and got diverted on the mechanics of the multiplication itself. I think he might be better off with a tree of functions that tackle smaller and smaller chunks, rather than a loop:

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=113820

but I havent suggested that as yet - was going to do that tonight.

All because Brook+ is much too constrained by "stream" philosophy and so has no concept of arrays. ARGH.

Jawed

Jawed · May 27, 2009

Whoops managed to forget to post this - so the ordering of this is going to seem strange.

TimothyFarrar said:
Other side read-only caches (instruction cache, constant cache, and texture cache) are all relatively small "L1" sized caches. Seems as if by design all major APIs (CUDA, OpenCL, and DX11) only support read-only tex+constant cache coherence at draw call (or compute kernel) boundaries. Likely with a command buffer command (hidden to programmer, but inserted at driver level) to flush those smaller read only caches when necessary.

Those are all the basic synchronisation primitives that graphics APIs automate for developers, preventing them from reading something they've just written, because it might have been overwritten by another thread before the read completes.

But ATI and NVidia GPUs both (and the compute APIs) support arbitrary un-fenced reads and writes to memory, so if you want that freedom you can have it. Just tread carefully.

Apparatus, System, and Method For Graphics Memory Hub NVidia patent filed Dec 2008, might provide a few clues on GT300 or later designs. If I am reading this right, seems as if they are looking to provide more bandwidth per pin by having on chip MCs communicate more efficiently to an off-chip memory hub which would interface with PCI-E and DRAM (for a small extra cost in latency). Also patent covers optionally placing ROP in the hub. So perhaps my speculation on atomic ALU operations happening out of SIMD units (and post routing to individual MCs) isn't too far off.

It's interesting that that document describes ROPs as latency intolerant

Why isn't that latency being hidden systematically?

Anyway, I posted my ideas for the implications on this over here:

http://forum.beyond3d.com/showpost.php?p=1296104&postcount=825

I think you'll enjoy them

Jawed

MfA · May 27, 2009

Jawed said:
DDR IO is really poor in terms of bandwidth per mm² or mm of perimeter - it's really obvious on Xenos

It would be a lot better too if they had designed it with the same constraints on trace length, signal isolation and substrate material as Xenos (or in other words, designed specifically for a MCM).

Jawed · May 27, 2009

MfA said:
It would be a lot better too if they had designed it with the same constraints on trace length, signal isolation and substrate material as Xenos (or in other words, designed specifically for a MCM).

I have to admit I see that as unlikely, at least for GT300, assuming it is huge.

Also I wonder if GT300 would end up needing 2 hubs in this kind of situation (each 256-bit). Or even 4 (128-bit hubs)? I have to admit those ideas start to make the whole concept seem rather foolish.

Jawed

TimothyFarrar · May 27, 2009

It's interesting that that document describes ROPs as latency intolerant Why isn't that latency being hidden systematically?

Anyway, I posted my ideas for the implications on this over here:

http://forum.beyond3d.com/showpost.php?p=1296104&postcount=825

I think you'll enjoy them

ROPs would be very latency intolerant if the atomic ALU operations were done in the PS shader kernel (like Larrabee?). With separate ROP units, perhaps latency intolerance has to do with needing to maintain correct fragment ordering.

BTW, I think your implications are right on.

Could get very interesting. I wonder if it would ever be a possibility for NVidia to go beyond just one compute chip to one memory/ROP hub chip. For instance pair two smaller higher yield compute chips with a single MEM/ROP HUB.

Jawed · May 27, 2009

TimothyFarrar said:
ROPs would be very latency intolerant if the atomic ALU operations were done in the PS shader kernel (like Larrabee?). With separate ROP units, perhaps latency intolerance has to do with needing to maintain correct fragment ordering.

The wrinkle here is trying to discern whether that's talking about ROPs purely for graphics operations being latency-intolerant or whether it's including the handling of atomics from shaders. I think it's purely graphics for what it's worth.

In that context basic scoreboarding of fragment issue and queuing of completed fragments for correctly ordered updates is all that's required as far as I can tell.

Larrabee style ROP latency is effectively bounded by L2 latency, 10 cycles in absolute terms, a trivial amount across 4 threads.

ATI has colour, Z and stencil buffer caches. What we don't know is the typical lifetime of pixels in them or associativity

It's interesting that the newest programming guide (for the Linux driver-writing community) suggests that for short shaders late-Z should be used, not early-Z. I guess this means that there's no point bottlenecking the setup/interpolator units (i.e. there's latency in early-Z processing between setup and interpolation) and letting the RBE work it all out. If that's the case that would imply a fairly meaty bit of caching for these pixel caches. But maybe I'm missing something...

Maybe it just means that the early-Z unit creates too much extra work for RBE for this to be worthwhile, so better to take a single-shot approach to Z, not dual-shot. That would imply either nothing in particular about caches or it might imply that it's the caches that would be straining :???:

Kind of stuff you'd need to simulate I guess.

Could get very interesting. I wonder if it would ever be a possibility for NVidia to go beyond just one compute chip to one memory/ROP hub chip. For instance pair two smaller higher yield compute chips with a single MEM/ROP HUB.

I see compute chips sharing like that as relatively unlikely, simply because bandwidth is still pretty important (thinking small scale configurations here).

But it would be interesting if a mesh of compute chips and hubs was formed creating a shared/distributed memory space of some type. In a sense it's only a strong-link version of SLI. But I suppose there'd be options to make the routing more intricate, e.g. with hubs shared by pairs of chips in a ring and all that other good stuff that Dally dreams about at night.

Jawed

TimothyFarrar · May 28, 2009

Yeah I forgot about depth/stencil issues. The latency between post ROP fine Z/stencil results getting to the coarse+fine early Z/stencil can be important. In the case of short shaders (as you described), beyond the issues you described, the latency between ROP to raster unit might easily be high enough so that the early data isn't yet valid at the start of the draw call. Some depth/stencil state cases might require waiting for early update data sync, which could place a large enough pipeline bubble to warrant turning off the early logic?

Nvidia GT300 core: Speculation

CarstenS

Moderator

Ailuros

Epsilon plus three

CarstenS

Moderator

Ailuros

Epsilon plus three

rpg.314

Ailuros

Epsilon plus three

Lukfi

Ailuros

Epsilon plus three

Jawed

trinibwoy

Meh

Jawed

Jawed

trinibwoy

Meh

Jawed

Jawed

MfA

Jawed

TimothyFarrar

Jawed

TimothyFarrar

Similar threads