Nvidia GT300 core: Speculation

Jawed · Feb 17, 2009

3dilettante said:
IEEE compliant DP erases a number of sins. As does a much more robust ecosystem of tools when it comes to debugging and in-silicon instrumentation, although the aged basis Larrabee works from may limit this in comparison to more modern x86.
Larrabee, as disclosed, will derive much of its FP performance from an enhanced vector set and will not have the massive OOE scheduler overhead or a pipeline designed to cater to high clock speeds.

As far as OpenCL and CUDA are concerned, toolsets are definitely immature, though CUDA's environment is much better than it was with debugging and metering/profiling.

I think HPC programmers/users want to write their code in FORTRAN/C-ish rather than assembly, or they want to use libraries. Obviously libraries should already be robust on x86 and be making excellent use of SSE. OpenCL/CUDA are going to take a long time to catch-up in the breadth and depth of library support.

And yes, there's a minefield between you and maximum GPU performance, but I don't think it's GPU functionality, per se, that's getting in the way now.

There's the one.
We know a more diverse software base will reveal other examples of code that hits utilization high enough to exceed TDP in the case of CPUs.

In HPC systems TDP will be observed more strictly, I expect. Lower clocks are an easy fix for specific configurations - e.g. 1U versus 4U configurations. It's not rocket science.

In AMD's case, a significant portion of the time the GPU is CPU-limited, or so I interpret from the number of times I've seen CAL having the finger pointed at it for non-ideal work units.

Let's find masses of GPUs being killed by GPGPU workloads before accusing them of being unable to sustain HPC workloads...

We have one anecdote saying it won't happen for GPGPUs.

I suppose both AMD and Nvidia can just point out the fundamental weakness of their slave cards always being at the mercy of the host processor, the expansion bus, and their software layer.

The initial instantiation of Larrabee should have the same problem, unless Intel shifts in its position with regards to Larrabee in HPC. Lucky for the GPGPU crew.

I think it's just a matter of the application, i.e. arithmetic intensity - there's no reason for the end-game to consist of being bogged down by external factors. You're no longer forced into lock-waiting the CPU while the GPU works, for example - that was just an aberration of early implementation.

F@H on ATI is clearly hobbled by the Brook environment's history and to the pure streaming approach it orginally enforced - OK, so Brook enabled it in the first place, but it's now a drag. Brook+ just seems doomed to me - it's perhaps a nice teaching language for throughput/stream computing but I just can't see people using it in anger for any longer than they have to.

GPGPU is past its version 1.0-itis, at least NVidia's.

Nvidia's Tesla TDP is 160 Watts, versus 236 for the related card not running CUDA.
So we can attribute close to 1/3 of the total heat output to the ROP, raster, and texturing special hardware.

40nm would allow for a blind doubling of everything. The improvement in power terms was modest and most definitely not halving of power consumption. If rumors turn out to be true, the power improvement may be smaller.

A CUDA-only load would be awfully close to 300 W, and would be over if power savings are close to 0 in some worst-case scenario. That is assuming no changes in the ratio of ALU to special-purpose hardware, though all the speculation seems to be upping the ALU load, not reducing it.

I think you're being too literal - NVidia doesn't have to specify Tesla to be within an inch of the relevant TDP. Against CPUs it's home and dry. Against Larrabee DP, well, 10% clocks aren't going to be the issue there

AMD's slack with an architecture with an even higher ALU:TEX ratio and smaller number of ROPs is much less.

Yes - but their FLOPs/mm are so high (2x NVidia's SP, way more DP) they can afford to downclock for margin.

G80's purported design cycle took 4 years, if we believe Anandtech.
What number do we give Larrabee, and how much more would that be?

Will Larrabee have taken 4 years by the time we can buy one? I don't know, close I think - and it's functionally a lot simpler than a GPU, massively so. My main point is that Intel's going to be in an environment where real new products up the ante every 6 months.

I also think that Intel only has a couple of years, maximum, of pain with Larrabee before they're in the lead for good.

CPU design cycles are in the range of 3-5 years. GPUs seem to be roughly 3 years, then going by recent history we add two or three quarters of delay on top of that.

Fingers-crossed the D3D11 generation is here on time - I want to build a PC based on one...

Nvidia's GPUs are not as big as the biggest CPUs Intel has produced.

That's why I was quite specific about consumer products.

Intel's advances in power regulation appear to exceed anything GPUs manage to do when they aren't forced by the driver to throttle.

Both RV770 and GT200 have on-board automation/optimisation of power usage that's BIOS/driver controllable. Have you seen the idle consumption of GTX285 and HD4670? 9W for the latter, FFS.

Intel's version of hardware scheduling is much more complex than what either Nvidia or AMD does, and that is part of the reason they can afford the ALU density they have.

I think that's moot once you take into account the monster register files, the hierarchy of automatic scheduling (cluster->batch->instruction/clause->ALU/TMU) and the tight integration of memory controller and cache threading. Intel had to find a way to spend transistors - once they built a fast "single-core" ALU their only options were cache and varieties of OoOE and superscalar issue.

God I wish Transputer had taken hold back in the day.

Also, I forgot to note that exceptions on the CPUs can be precise.

I'm outta my depth on what NVidia's doing there.

Intel's Penryn is 410M per 110 mm2.
It does lag 55nm, as it is 3.7 versus 4, yes.
The density is somewhat disingenuous, since so much is cache and the logic itself is significantly harder to shrink for a high-speed design with a lot of complexity.

Yes, there's a hell of a lot of cache on there:

http://www.intel.com/pressroom/archive/releases/20070328fact.htm

I don't know what degree of "custom" AMD has achieved in RV770. I'm also a bit suspicious of the 956M transistor count for RV770, because supposedly they don't actually count the transistors they implement. For all we know that count is nothing more than the area of the chip * the standard cell density

There's also a big fudge factor at the edges of the GPU with all that physical I/O stuff.

Taken at face value, TSMC's "45nm" is >700M per 100mm2 - that's ~ double Penryn. TSMC's 55nm referenced to 65nm is actually 0.9 scaling (as opposed to "theoretical" 0.72), so 55nm at TSMC isn't much of a clue to the gap between Intel and TSMC. Obviously staggered timescales play their part too...

In terms of density/clocks/yields I dare say ATI GPUs are "fully adapted" to TSMC, as Intel's CPUs are fully adapted to their own fabs. NVidia's current GPUs don't seem to be that well adapted - it looks like their desire to run the ALUs at 2x clocks is not a good match for what TSMC offers - for the kind of architectural balance they've chosen with batch size and instruction-issue they appear to have no choice about the clocking ratio.

(NVidia's considerably smaller batch size is still a radically untested comparison point in these architectures - nested branching may be a total disaster zone on ATI per mm2 but relatively painless per mm2 on NVidia.)

We don't rightfully know what Intel could manage if it tuned the design to target density with relaxed timing and power requirements.
It is also dependent on whether TSMC's 40nm quality is closer in success to 55nm or what R600 encountered on 80nm...

Yeah it's hard to disentangle what have been alluded to as strategic delays for 40/45nm by TSMC versus technical issues (which also seem to have led to the curtailment of 32nm options).

Larrabee's cores are not as complex and there are serious near-term concerns about transistor variation over large chips and leakage without a metal gate stack.

Yeah, Intel is constrained by not being able to sell Larrabee in large quantities for much more than $100 a go. Sure sales to the HPC crowd will rake in profit per unit, but...

Depends if Intel does to 45nm Larrabee what it did to the 45nm Nehalem dual cores.

TSMC's customers should have been at 40nm last year in theory, but maybe we'll see such products Q2 this year instead.

Intel's demoed running Windows on Westmere.
Where's the RV870 running Crysis?

AMD has nothing to gain by showing something it isn't ready to launch - unless it's R600 all over again

I'm not sure if any RV8xx GPUs are actually up and running - apparently sampling in Q2 if CJ's to be believed. Seems likely there'll be something running if that's true.

FP has more stages than the INT pipeline, but the latter has tracked clock speeds more closely for CPUs.

Does Intel vary the pipeline organisation for the slower clocked mobile parts?

Anyway, if Larrabee is ~2B transistors, can they fit 32 cores 8MB of L2 and 32 TMUs

Jawed

trinibwoy · Feb 18, 2009

Jawed said:
Then we get into a question of overall efficiency as the number of clusters varies, i.e. which is more efficient:

10 clusters each of 4 SIMDs (32 lanes total) and 8 TMUs

20 clusters each of 2 SIMDs (16 lanes total) and 4 TMU

Overall, I'd say cluster count should be minimal.

This is something I've been wondering about as well. Besides the TMUs there are some other shared bits per cluster handling pixel and geometry setup. I wonder if those are expensive enough to warrant a preference for fewer clusters.

Ailuros · Feb 18, 2009

Since the birth of NV's dual-GPU sandwich sollutions, they never introduced such boards with the first batch of boards from each "generation" but always when smaller manufacturing processes allowed smaller die sizes and in extension lower manufacturing costs.

There was no GX2 based on G70, as there was no GX2 based on G80 and no there wasn't a single chance either that we would have seen a GT200/GX2 under 65nm even if the GT200 die would had been smaller.

As for TFLOPs/mm2 appearing in some parts of the long winded quote orgies, pardon me but TFLOP != TFLOP.

Apart from a few vague indications that mean next to nothing at this point, no one really knows what AMD's or NVIDIA's D3D11 GPUs might look like. The picture is far clearer on Intel LRB's side since their marketing machine has to inevitably win early awareness for a new architecture. It still remains a question mark how evolutionary or revolutionary (well pick your poison) AMD's and NVIDIA's next generation architectures will be.

Engineers from all sides might make occassional wrong estimates within the boundries of human error, yet I'm confident that most if not all of them are aware what each market needs and most important developers/ISVs. I'd rather prefer to read at this point a sophisticated analysis on the opportunities for computing OpenCL for instance could introduce (even more so as an open standard), then assumptions based on today's architectures which might or might not be in the right direction.

PR/marketing from all IHVs can work both ways; if abilities X are not present they're marked as useless, if they are they're the greatest thing since sliced bread.

Jawed · Feb 18, 2009

trinibwoy said:
This is something I've been wondering about as well. Besides the TMUs there are some other shared bits per cluster handling pixel and geometry setup. I wonder if those are expensive enough to warrant a preference for fewer clusters.

I think it's more than just area cost within the cluster, it's also a question of the practicality of interfacing with the rest of the chip. If a cluster is too wide then it gets hard to move data across its borders in a timely fashion

But I don't think this gets us anywhere much.

Jawed

trinibwoy · Feb 18, 2009

Yeah true. And to Ail's point, G80 has had a great run. GT300 may look vastly different and the current TPC structure could well be thrown out the window so it's probably not worth trying to extrapolate at this point.

3dilettante · Feb 18, 2009

Jawed said:
And yes, there's a minefield between you and maximum GPU performance, but I don't think it's GPU functionality, per se, that's getting in the way now.

The incomplete nature of IEEE compliance is where the GPU makers took acceptable shortcuts for cases that did not matter for consumer graphics.
Some of their shortcuts simply are not acceptable for more finicky HPC clients, though for many it may be less of a problem.

The more limited forms of hardware instrumentation, and the lack of precise interrupts for debugging and troubleshooting can be a problem.

Let's find masses of GPUs being killed by GPGPU workloads before accusing them of being unable to sustain HPC workloads...

They needn't die.
They could just be CPU limited or throttled by the software layer, thereby cheating death (and the customer).

Both RV770 and GT200 have on-board automation/optimisation of power usage that's BIOS/driver controllable. Have you seen the idle consumption of GTX285 and HD4670? 9W for the latter, FFS.

In AMD's case, until a D3D window pops up, then it's full tilt all the way.
Just how many power states do we see?
3d, 2d, UVD?

Intel's Montecito core had an onboard microprocessor that could detect on the fly amperage, voltage, and temperature variation, and could autonomously vary clock with no driver hackery or input.
The technology was deactivated to deleterious effect to the chip's clock speeds, but not for technical reasons.

Intel's Nehalem has something similar, though less ambitious and more consistent when trying to verify performance.

I think that's moot once you take into account the monster register files, the hierarchy of automatic scheduling (cluster->batch->instruction/clause->ALU/TMU) and the tight integration of memory controller and cache threading. Intel had to find a way to spend transistors - once they built a fast "single-core" ALU their only options were cache and varieties of OoOE and superscalar issue.

Intel's CPUs have speculation past branches, past possible memory aliasing, and replay along with all of that.
The control hardware makes run-time decisions autonomously to handle looping, general branching, variable latency units, and allocates its registers entirely without software analysis.

GPUs to my knowledge do not do that, and for good reason.

AMD's scheme I would count as the simplest from a hardware standpoint, which is the point of VLIW.
Nvidia's is more complex than that, but I'd still put it at a lower level. There may be more threads to worry about, but it's more threads doing dumber execution.
Their idea of dual issue is a joke, just like their idea that every lane is a "core".

In terms of density/clocks/yields I dare say ATI GPUs are "fully adapted" to TSMC, as Intel's CPUs are fully adapted to their own fabs. NVidia's current GPUs don't seem to be that well adapted - it looks like their desire to run the ALUs at 2x clocks is not a good match for what TSMC offers - for the kind of architectural balance they've chosen with batch size and instruction-issue they appear to have no choice about the clocking ratio.

I'd say that's quite possible. The more ambitious split clocks and the queues needed to handle it can be more difficult to tune in a non-custom process.
If the process isn't suited to high performance, pushing transistors to the edge is bound to hit more headaches than a design that scales things back.

AMD's GPU clocks are probably more comfortably in the comfort zone of TSMC's process.

Yeah, Intel is constrained by not being able to sell Larrabee in large quantities for much more than $100 a go. Sure sales to the HPC crowd will rake in profit per unit, but...

The question here is what Intel's market segmentation schemes will do.
Larrabee could be a major player in HPC, unless Intel feels its margins elsewhere are threatened.

The question coming up is whether Intel has also made good on its promise to target the high end at the expense of the volume segments in consumer graphics, just when an economic downturn cutted the upper tiers of the market.

Does Intel vary the pipeline organisation for the slower clocked mobile parts?

There tends to be a lot of reuse of the same base design to amortize design effort, so not necessarily.

edit: Although in Atom there is a (smaller) difference, and with a signifcantly longer pipeline than has been needed before to reach its modest clocks. Atom could reach higher clocks in theory, and some of those stages may exist for other reasons besides clock scaling.

P4 versus PM would be an example where there was a big difference.

My primary comparison is between VIA's x86 processors and relatively equivalent x86s from AMD or Intel, though the latter two have far more powerful processors.

We can at least have chips that set out to do basically the same thing, to see that the quality of a process and the tuning possible can lead to clocks worlds apart.

Anyway, if Larrabee is ~2B transistors, can they fit 32 cores 8MB of L2 and 32 TMUs

At 32nm I can think the core count could be done (big die, though), if they try hard enough. Don't know enough about the TMUs.

I'm giving the edge to GPUs for now, but I believe if they want to stave off Intel they need to be not just better, but decisively better.

I don't think Larrabee or Larrabee 2 will lag so far behind that AMD and Nvidia have the luxury of flubbing their designs or only achieving mediocre improvements to hit that decisive point.

Jawed · Feb 18, 2009

3dilettante said:
The incomplete nature of IEEE compliance is where the GPU makers took acceptable shortcuts for cases that did not matter for consumer graphics.
Some of their shortcuts simply are not acceptable for more finicky HPC clients, though for many it may be less of a problem.

The more limited forms of hardware instrumentation, and the lack of precise interrupts for debugging and troubleshooting can be a problem.

This is all similar to the way that lack of ECC memory is a bar for certain people. Now it may be that NVidia is starting to find that CUDA is running into a wall because of this stuff.

The fact is, none of it is fixed for ever. The people who like the prospect of Larrabee may have simpler reasons for their preference - e.g. that they can stuff a rack full of nothing but Larrabee - unlike NVidia which must have some CPUs (even if it's only 1 CPU per 16 GPUs).

They needn't die.
They could just be CPU limited or throttled by the software layer, thereby cheating death (and the customer).

This is getting bizarre now.

In AMD's case, until a D3D window pops up, then it's full tilt all the way.
Just how many power states do we see?
3d, 2d, UVD?

When Power Play was introduced the chip was being too aggressive in certain scenarios:

http://forum.beyond3d.com/showthread.php?t=46956

that's the chip, not the driver.

Intel's CPUs have speculation past branches, past possible memory aliasing, and replay along with all of that.

ATI's pre-fetching textures, continuously doing coalesced memory operations, on-demand (de)compression. It's generating 32 vec4 interpolations per clock. The state per hardware thread can be KB, all held on die. The list goes on.

Intel's doing some fu. GPUs have to do way more fu per clock. If that wasn't the case, Intel wouldn't be splatting so much cache on the die. Larrabee could be around 10x the transistors of an i7 core but not 10x the cache. etc.

The control hardware makes run-time decisions autonomously to handle looping, general branching, variable latency units, and allocates its registers entirely without software analysis.

And only hand-crafted SSE assembly actually runs efficiently. Intel seems quite proud of Ct because it's getting SSE efficiency that people struggle to get with C.

GPUs to my knowledge do not do that, and for good reason.

GPUs are doing dynamic branching and the register allocation is obviously dynamic in a unified architecture where the number of VS, GS, PS (and DMA/Import/Export in ATI) hardware threads in flight at any time are variable.

AMD's scheme I would count as the simplest from a hardware standpoint, which is the point of VLIW.

That's ignoring the scheduling hierarchy I outlined for you. There's a processor at the level above the VLIW and texture and vertex fetch units. And there's another processor controlling work across the whole chip and talking to the CPU. And there's the UVD processor.

Nvidia's is more complex than that, but I'd still put it at a lower level. There may be more threads to worry about, but it's more threads doing dumber execution.
Their idea of dual issue is a joke, just like their idea that every lane is a "core".

I think they're prolly quite proud that they've implemented the minimum of complexity to get the kind of functionality they wanted. What do you think they should add?

AMD's GPU clocks are probably more comfortably in the comfort zone of TSMC's process.

Yeah - but if Intel could build a CPU where single-threaded performance was not the priority, they prolly wouldn't go for 3GHz+ for much the same reason - which is where Larrabee comes in.

The question here is what Intel's market segmentation schemes will do.
Larrabee could be a major player in HPC, unless Intel feels its margins elsewhere are threatened.

I think CUDA's shockwaves through the community will see to that.

The question coming up is whether Intel has also made good on its promise to target the high end at the expense of the volume segments in consumer graphics, just when an economic downturn cutted the upper tiers of the market.

Well falling ASPs in consumer graphics has been a trend for quite a while now, so some of that should be built-in to their modelling. Obviously a severe recession is a whole new kettle of fish.

I'm intrigued to know if Larrabee architecture is going to take over IGP duties any time soon after launch, i.e ~2 years-ish. Surely part of their strategy lies in covering all bases?

Jawed

3dilettante · Feb 18, 2009

Jawed said:
ATI's pre-fetching textures, continuously doing coalesced memory operations, on-demand (de)compression. It's generating 32 vec4 interpolations per clock. The state per hardware thread can be KB, all held on die. The list goes on.

There is a difference between bulk and complexity, and how much of what you've listed for GPUs would be considered speculation?

How much dynamic behavior of that system can from the point of view of software become incorrect and require rollback to a prior good state?
When do GPUs speculatively execute any software instruction?

And only hand-crafted SSE assembly actually runs efficiently. Intel seems quite proud of Ct because it's getting SSE efficiency that people struggle to get with C.

Effectiveness doesn't negate considerations of complexity. A superscalar OOE speculative processor has hardware tasked with extracting parallelism out of inherently serial code, with the expectation that it maintain fully the illusion that every instruction executed in order.

That is an inherently intractible problem and any finite implementation would be lacking.

A whole Nehalem chip can issue 20 independent instructions per cycle, or 16 if not using the macro-op fusion capability, and it checked each one out of the serial instruction stream and dynamically constructed the entire arrangement at run-time with the full capability of interrupting or negating on the fly.

RV770 issues 10 VLIW compiled instruction groups with no such provisions.

From a hardware point of view, I'd say the x86 one is harder.

GPUs are doing dynamic branching and the register allocation is obviously dynamic in a unified architecture where the number of VS, GS, PS (and DMA/Import/Export in ATI) hardware threads in flight at any time are variable.

They do not speculate past unresolved dynamic branches. Rather, they drop execution until the branch state is known, and currently (and from the point of view of a CPU: very stupidly) handle divergence by running both code paths.

Per-thread, GPU hardware is very well-behaved, and any possible points of interaction either stall or are (rather frequently) undefined.

A GPU thread is not asked to drop everything it is doing at unknown intervals to service an interrupt, then continue on as if nothing happened.
Handling precise exceptions is something that is not done simply, and the last I heard, it is not a feature of GPUs.

That's ignoring the scheduling hierarchy I outlined for you. There's a processor at the level above the VLIW and texture and vertex fetch units. And there's another processor controlling work across the whole chip and talking to the CPU. And there's the UVD processor.

There is no dependence analysis done by the hardware. The exact register data flow is present in the instruction groups inside of a clause. Nvidia's scheme at least does check for dependences during execution.

The scheduler above the per-SIMD scheduling handles a large number of separate threads that are individualy kept very simple.
Nehalem has a thread assignment processor that goes out of its way to subvert th OS scheduler to keep threads assigned in manner that allows it to more adequately control power consumption, so it's not a GPU-only deal.

I think they're prolly quite proud that they've implemented the minimum of complexity to get the kind of functionality they wanted. What do you think they should add?

How about actual dual-issue?
Nvidia's idea of dual issue is little more than overlapping multicycle operations in a pipeline.
The PR guys deserve the money they get for managing to market something that's been done for 40 years.

I'm intrigued to know if Larrabee architecture is going to take over IGP duties any time soon after launch, i.e ~2 years-ish. Surely part of their strategy lies in covering all bases?

It wouldn't be in 2010, since Intel CPUs with on-package IGPs show up then.
Intel's apparently taking a rather sedate pace with moving Larrabee integration along, and various divisions seem very eager to keep it locked up in the discrete graphics niche.

Jawed · Feb 19, 2009

3dilettante said:
There is a difference between bulk and complexity, and how much of what you've listed for GPUs would be considered speculation?

How much dynamic behavior of that system can from the point of view of software become incorrect and require rollback to a prior good state?
When do GPUs speculatively execute any software instruction?

I don't know why you're putting this stuff up on a pedestal. Give a GPU a single triangle with some colours for the vertices and it'll generate a few million shaded pixels.

Though, ahem, it can be kinda fun if you've never done it before:

http://www.vis.uni-stuttgart.de/~hopf/pub/Fosdem_2009_r600demo_Slides.pdf

It's pretty interesting.

Effectiveness doesn't negate considerations of complexity. A superscalar OOE speculative processor has hardware tasked with extracting parallelism out of inherently serial code, with the expectation that it maintain fully the illusion that every instruction executed in order.

That is an inherently intractible problem and any finite implementation would be lacking.

A whole Nehalem chip can issue 20 independent instructions per cycle, or 16 if not using the macro-op fusion capability, and it checked each one out of the serial instruction stream and dynamically constructed the entire arrangement at run-time with the full capability of interrupting or negating on the fly.

RV770 issues 10 VLIW compiled instruction groups with no such provisions.

From a hardware point of view, I'd say the x86 one is harder.

I'm not going to disagree - ATI's approach is admirably simple at the ALU level.

x86 OoOE is solving one little problem and it's solving a problem that's almost orthogonal to what the HPC guys want solved.

They do not speculate past unresolved dynamic branches. Rather, they drop execution until the branch state is known, and currently (and from the point of view of a CPU: very stupidly) handle divergence by running both code paths.

Overdraw is speculative execution

Per-thread, GPU hardware is very well-behaved, and any possible points of interaction either stall or are (rather frequently) undefined.

A GPU thread is not asked to drop everything it is doing at unknown intervals to service an interrupt, then continue on as if nothing happened.
Handling precise exceptions is something that is not done simply, and the last I heard, it is not a feature of GPUs.

What use is an interrupt in a throughput processor? I'm not sure what use precise exceptions will be either - you'll have to explain it. IEEE exceptions merely need to be raised as flags.

The scheduler above the per-SIMD scheduling handles a large number of separate threads that are individualy kept very simple.

It's scheduling work across ALUs, TUs, VUs, memory import/export, constant cache fetches, LDS/GDS, GS v VS v PS workload evaluation, instruction-cache paging.

How about actual dual-issue?
Nvidia's idea of dual issue is little more than overlapping multicycle operations in a pipeline.
The PR guys deserve the money they get for managing to market something that's been done for 40 years.

I think what they're doing is vastly too complex. You seem to think they should be doing more.

They could do more to get the effective hardware thread size down to 8 elements, say (the width of a SIMD). But is it worth the cost?

Interestingly, despite Larrabee's vector units being 16-wide, there's no reason why a shader program can't "swizzle" vector elements (e.g. vertices or pixels) to issue vec4 elements per clock across the SIMD instead of the default of 16 scalar elements. Any time there's control flow in the shader the program could simply form a "dynamic batch" like this to minimise the cost of incoherence. An effective batch size of four is pretty impressive (though it's moot if only scalar operations happen inside the clause).

AMD and NVidia should be looking on with envy.

Jawed

Ailuros · Feb 19, 2009

I'd be very surprised if LRB would scale down to IGPs or even worse into the mobile/PDA/handheld markets any time soon. The power envelope Intel itself has revealed per core does not sound that it would make sense for any of those. Take those figures, speculate what it would look like on X future manufacturing process and check the B3D frontpage for a change (LOL) http://www.beyond3d.com/content/news/711

I'm pretty confident most have read that one: http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3513 and there are going to come lower end chips soon with 3rd party IP integrated into them. One further thing to point out, is that the manufacturing roadmap of the central processing part is ahead of the graphics processing part which I don't expect to change either anytime soon (page 6 for further details).

3dilettante · Feb 19, 2009

Jawed said:
x86 OoOE is solving one little problem and it's solving a problem that's almost orthogonal to what the HPC guys want solved.

What HPC guys want solved is orthogonal to the question of which slab of silicon is more complex, which is what I was addressing.
I'm giving the edge to OOE x86 because its basic method of operation is one elaborate lie: one that handles so many corner cases to a far higher degree of rigor than current GPUs.

GPUs, particularly those using SIMD execution with the cache subsystem common today, make a lot of fun simplifying assumptions that permits them to achieve what they do.

GPUs do have a fair bit of obfuscation, but I penalize them because so much is handled by a driver layer and run-time compiler, which is in software, not hardware.

Overdraw is speculative execution

From a software point of view, I'd say not. If it were, every immediate mode renderer ever would have needed an immediate recall. Perhaps PowerVR's Kyro would have sold better if this were the case.
It's wasted work, yes. Speculation in CPUs at least involves a bit more danger.

The commands run on occluded fragments are software write to the shared framebuffer and are overwritten at some point. Overwriting in shared memory is fine and done in software all the time.
The system never enters an invalid state.

Speculation past a branch or aliased memory address that is incorrect is much worse. A long-duration operation that is smacked with an exception or interrupt is also a source of complexity, since that instruction's side effects and flags cannot be permitted to exist in an undefined state.

This cannot be permitted to write back to memory or modify any system state.

What use is an interrupt in a throughput processor?

If the GPUs are forever to be relegated to the role of slave card on an expansion bus, not too many.
It still speaks to design complexity.

I'm not sure what use precise exceptions will be either - you'll have to explain it. IEEE exceptions merely need to be raised as flags.

GPUs currently don't do flags, either.

Exceptions apply to more than just FP on CPUs. Their execution pipelines for both integer, memory, and floating point either have precise exceptions as a default or have the option to enable it. This is a source of complexity, which was the point I was addressing, not whether it was universally useful.

Although, from the point of view of trying to debug and troubleshoot actual code in a real environment when your coders aren't free grad students, I'd say knowing which instruction blew up your application might be good.

It's scheduling work across ALUs, TUs, VUs, memory import/export, constant cache fetches, LDS/GDS, GS v VS v PS workload evaluation, instruction-cache paging.

And that's a source of complexity, but in my rough estimation I feel that in aggregate the simplifications elsewhere on the GPU lower the average.

We can throw around unit types of all kinds and see that CPUs do a lot of the same things, which would follow since GPUs are also very programmable.

A CPU can schedule work across ALUs, LSUs, cache fetches, check performance monitors for thread prioritization, handle instruction and data cache paging.
Then it tacks on interrupts, exceptions, speculation, security permissions, cache coherence, and system traffic.

I think what they're doing is vastly too complex. You seem to think they should be doing more.

To the first point, I might agree with this.
To the second, I don't necessarily think they should. I'm just pointing out that what they do is not as complex as what is done in CPUs, and that Nvidia's PR writes checks their hardware can't cash.

They could do more to get the effective hardware thread size down to 8 elements, say (the width of a SIMD). But is it worth the cost?

Possibly not.
The fundamental assumptions GPUs make about their workload versus what CPUs must handle might mean there is no real upshot, since it is applying scheduling complexity closer to what CPUs do, but then doing it in bulk like a GPU.
From a power and transistor perspective: probably lose/lose.

Jawed · Feb 19, 2009

3dilettante said:
What HPC guys want solved is orthogonal to the question of which slab of silicon is more complex, which is what I was addressing.
I'm giving the edge to OOE x86 because its basic method of operation is one elaborate lie: one that handles so many corner cases to a far higher degree of rigor than current GPUs.

GPUs, particularly those using SIMD execution with the cache subsystem common today, make a lot of fun simplifying assumptions that permits them to achieve what they do.

GPUs do have a fair bit of obfuscation, but I penalize them because so much is handled by a driver layer and run-time compiler, which is in software, not hardware.

The kind of complexity you're so knee deep in has little value for throughput computing though - so of course the GPUs didn't go there. As AMD and NVidia shift in the direction of building a Larrabee-like architecture they will, as you said earlier, become subject to Intel's rules of engagement. AMD's just about to become entirely fab-less, so that's going to be interesting - a topic for another thread.

The interesting bit (before GPUs become utterly boring) is going to be the mid-term - how does Intel adapt to delivering something that isn't the junk graphics they've been spewing so far.

Much like NVidia struts around shoe-horning its proprietary graphics at every opportunity, Intel is attempting to ween games programmers with "native Larrabee". Consumers are going to judge them on D3D for the first few years, not native Larrabee. I for one am getting beyond sick of the proprietary land-grab ethic that drives NVidia in the PC games world and it'll be even more tedious once Intel starts. Killing D3D early by fragmentation due to the CUDA bullshit and the native Larrabee bullshit is simply going to kill off PC gaming.

"You really should play Project Offset on Larrabee". No thanks.

I don't think their fabs and their "x86 expertise" give them a significant advantage from the consumer's point of view when they enter the market. That stuff simply enables them to enter the discrete market on their own terms. There's a variety of technical reasons why they couldn't do it earlier.

When Larrabee arrives, will Crysis be slow enough on GT300 and fast enough on Larrabee to turn heads?

From a software point of view, I'd say not. If it were, every immediate mode renderer ever would have needed an immediate recall. Perhaps PowerVR's Kyro would have sold better if this were the case.
It's wasted work, yes. Speculation in CPUs at least involves a bit more danger.

The commands run on occluded fragments are software write to the shared framebuffer and are overwritten at some point. Overwriting in shared memory is fine and done in software all the time.
The system never enters an invalid state.

I was alluding to the various forms of tiling, early-Z and hierarchical-Z culling that are options for a GPU to reduce overdraw - despite optimisations and despite the potentially random order that GPUs shade vertices and fragments, the result is always "correct". GPUs are conservatively speculative about what work not to do. ATI has an on-die hierarchical min-Z and max-Z buffer that's pretty sizable, enough for 1280x800 I guess - quite a chunk of memory, not sure what Z precision though.

Speculation past a branch or aliased memory address that is incorrect is much worse. A long-duration operation that is smacked with an exception or interrupt is also a source of complexity, since that instruction's side effects and flags cannot be permitted to exist in an undefined state.

This cannot be permitted to write back to memory or modify any system state.

If speculative execution past branches would be useful in a SIMD processor that handles multiple strands per clock, how far away do you think it is?

Or do you think that any kind of speculative execution in Larrabee will require explicit coding?

If the GPUs are forever to be relegated to the role of slave card on an expansion bus, not too many.
It still speaks to design complexity.

Huh? Intel and AMD are both planning to pull throughput computing as close to x86 as they can get it, using it for graphics and any other apps that are amenable.

Although, from the point of view of trying to debug and troubleshoot actual code in a real environment when your coders aren't free grad students, I'd say knowing which instruction blew up your application might be good.

I'm not up to date on the way people are debugging CUDA code or debugging D3D code. Shader debugging is part of D3D11 SDK as far as I understand it.

And that's a source of complexity, but in my rough estimation I feel that in aggregate the simplifications elsewhere on the GPU lower the average.

OK, well neither of us is prepared to do the pseudo-code for an entire processor of our favourite type, so I guess we'll just have to leave it there.

We can throw around unit types of all kinds and see that CPUs do a lot of the same things, which would follow since GPUs are also very programmable.

A CPU can schedule work across ALUs, LSUs, cache fetches, check performance monitors for thread prioritization, handle instruction and data cache paging.
Then it tacks on interrupts, exceptions, speculation, security permissions, cache coherence, and system traffic.

The problem we have with GPUs is that the low-level function of the myriad blocks and associated buffers and scoreboards is far more opaque than in CPUs. The real implementation of scan conversion and attribute interpolation in NVidia's GPUs, for example, is a mystery. Even when you've absorbed the patent stuff you've still got to make implementation decisions, glue them together and work around the bottlenecks that only become clear when you try to stitch them into the rest of the GPU.

Possibly not.
The fundamental assumptions GPUs make about their workload versus what CPUs must handle might mean there is no real upshot, since it is applying scheduling complexity closer to what CPUs do, but then doing it in bulk like a GPU.
From a power and transistor perspective: probably lose/lose.

Whereas Larrabee's more back-to-basics approach will be win eventually, because it's a master of pure throughput.

It's funny, only writing this message have I realised that GT300 and RV870 should be the last major GPUs before Larrabee arrives. It's almost getting close enough to taste

Jawed

3dilettante · Feb 19, 2009

Jawed said:
As AMD and NVidia shift in the direction of building a Larrabee-like architecture they will, as you said earlier, become subject to Intel's rules of engagement. AMD's just about to become entirely fab-less, so that's going to be interesting - a topic for another thread.

And AMD the CPU manufacturer has already learned the result of attempting to out-Intel Intel.

Going through the exact same motions as Larrabee on an inferior process is not a recipe for success.
If they can't do anything differently, then all we get is Larrabee on a slower process.

If speculative execution past branches would be useful in a SIMD processor that handles multiple strands per clock, how far away do you think it is?

For Larrabee, it's going to happen in some fashion with regularity.
That's what a pipelined CPU with branch predictor does.

Or do you think that any kind of speculative execution in Larrabee will require explicit coding?

The P54 core has a branch predictor and a 5 cycle mispredict pipeline length.
Any instruction within 5 cycles of a conditional branch will be inherently speculative, with up to 10 instructions in the instruction stream past any given conditional branch possibly needing to be blocked from altering state.

This means any branching within the master emulation loop, and perhaps any branching in the shader programs, if using a branch that goes through the INT pipeline.
The exact type of branching capability in the vector pipe is unclear so many "branches" being emulated may resolve to a more explicit serialization. Emulating the way GPUs can short-circuit branch divergence checks if all items in the SIMD branch the same way would involve one or more branches if done in software.

It's funny, only writing this message have I realised that GT300 and RV870 should be the last major GPUs before Larrabee arrives. It's almost getting close enough to taste

Jawed

Yet possibly still not taped out...

trinibwoy · Feb 19, 2009

When or why would branch prediction become remotely useful in a throughput application?

Jawed · Feb 19, 2009

3dilettante said:
For Larrabee, it's going to happen in some fashion with regularity.
That's what a pipelined CPU with branch predictor does.

The Vector Unit uses predication to implement control flow, according to Seiler et al's Larrabee paper from Siggraph.

Finally, Larrabee VPU instructions can be predicated by a mask register, which has one bit per vector lane. The mask controls which parts of a vector register or memory location are written and which are left untouched. For example, a scalar if-then-else control structure can be mapped onto the VPU by using an instruction to set a mask register based on a comparison, and then executing both if and else clauses with opposite polarities of the mask register controlling whether to write results. Clauses can be skipped entirely if the mask register is all zeros or all ones. This reduces branch misprediction penalties for small clauses and gives the compiler’s instruction scheduler greater freedom.

Jawed

Jawed · Feb 19, 2009

trinibwoy said:
When or why would branch prediction become remotely useful in a throughput application?

It could only operate at the level of fibres (e.g. groups of fragments) in Larrabee's programming model, not at the level of strands (individual fragments, say). In fact it has to.

GPUs don't need branch prediction in this scenario because they simply switch to another hardware thread to hide the latency of the branch evaluation, jump and then instruction fetch (or even paging-in the instructions).

Well ATI always switches. NVidia has an heuristic about this stuff, based upon the length of the clause that could be jumped over. This appears to lead to pipeline flushes and state reset when the branch evaluation, which is effectively delayed, is computed.

Jawed

nAo · Feb 19, 2009

You don't want to predicate this:

Code:

float accum = 0;
if (0.0f == texture.Sample(sampler, uv).x)  
{
    for (int i = 0; i < 10000; ++i)
    {
        accum += texture.Sample(sampler, uv + i * dir).x;
    }
}

return accum;

3dilettante · Feb 19, 2009

Jawed said:
The Vector Unit uses predication to implement control flow, according to Seiler et al's Larrabee paper from Siggraph.

That doesn't eliminate branches in the emulation loop that controls the traversal of the fibers, or any other part of execution that goes through the INT pipe.

I'm also curious of the implementation detail where clauses can be skipped when the mask register is all 0s or 1s.
Skipping implies changing the program counter, so how is this skipping accomplished without a branch hitting the front end of the pipeline?

Jawed · Feb 19, 2009

3dilettante said:
That doesn't eliminate branches in the emulation loop that controls the traversal of the fibers, or any other part of execution that goes through the INT pipe.

Sorry, sloppy quoting on my part, I wasn't trying to imply that fibres/hardware threads can't benefit from branch prediction. I was merely trying to highlight that strands are dependent upon predication - the vector unit can't utilise branch prediciton internally.

I'm also curious of the implementation detail where clauses can be skipped when the mask register is all 0s or 1s.
Skipping implies changing the program counter, so how is this skipping accomplished?

In ATI the Sequencer (that's what it's called in Xenos, not clear on the name for it in PC GPUs) merely accesses the stack. I expect there's a stack for predicates in Larrabee's vector ALU which the integer unit will read/write as desired.

Presumably this results in a special set of integer unit control flow instructions targetted at the operation of the vector unit - pretty much like you'll find in R600 ISA I suppose. These instructions are called Control Flow instructions in R600 ISA (issuing clauses, memory operations, jumping etc.)

Jawed

3dilettante · Feb 19, 2009

Jawed said:
Sorry, sloppy quoting on my part, I wasn't trying to imply that fibres/hardware threads can't benefit from branch prediction. I was merely trying to highlight that strands are dependent upon predication - the vector unit can't utilise branch prediciton internally.

It's not necessarily about benefiting, just that if a fiber's clause is follows a branch of some kind, up to the first 10 instructions of that clause will always be speculative.
Unrolling a fiber's whole shader over all the qquads it contains sounds expensive space-wise, so it's possible it will be looped by the emulation code. Anything following that loop start for 5 cycles will be speculative.

In ATI the Sequencer (that's what it's called in Xenos, not clear on the name for it in PC GPUs) merely accesses the stack. I expect there's a stack for predicates in Larrabee's vector ALU which the integer unit will read/write as desired.

That would be in cache, then. Otherwise it would be an exception to the following:
"To simplify the design the scalar and vector units use separate
register sets. Data transferred between them is written to memory
and then read back in from the L1 cache.

Presumably this results in a special set of integer unit control flow instructions targetted at the operation of the vector unit - pretty much like you'll find in R600 ISA I suppose. These instructions are called Control Flow instructions in R600 ISA (issuing clauses, memory operations, jumping etc.)

How necessary would that be? The integer side would control the program counter, and so it would control instruction fetch.
By deciding not to branch to a certain portion of a fiber, it can dictate what the vector unit does.

Nvidia GT300 core: Speculation

Jawed

trinibwoy

Meh

Ailuros

Epsilon plus three

Jawed

trinibwoy

Meh

3dilettante

Jawed

3dilettante

Jawed

Ailuros

Epsilon plus three

3dilettante

Jawed

3dilettante

trinibwoy

Meh

Jawed

Jawed

nAo

Nutella Nutellae

3dilettante

Jawed

3dilettante

Similar threads