22 nm Larrabee

aaronspink said:
I would say that this is at best speculation. Using the same viewpoint, I could argue that CPUs maintains millions of active thread contexts.

And you would be flat out incorrect. I think you should read up on the rudiments of GPU architecture before lecturing others.
 
And that "cheaper" solution will take 2-40x the programming effort and deliver 1/10th to 1/5th the effective performance if current trends hold.

The current trend is that CPUs are slow and the vast majority of the burden for GPU programming falls to the IHV's. ISV's don't program GPUs, they write applications that call DirectX and OpenGL functions.

Or to put it simpler: CPU Flops >> GPU Flops.

If that's the case why are CPUs still so slow at rendering?

I would say that this is at best speculation. Using the same viewpoint, I could argue that CPUs maintains millions of active thread contexts.

Depends on your definition of active I guess. If thread state doesn't need to be restored before execution can proceed a hardware thread can be considered active.

Doesn't matter much though. Nick's argument is that somehow GPUs need the same level of ILP extraction as CPUs in order to be efficient at complex workloads. That's patently false.
 
And you would be flat out incorrect. I think you should read up on the rudiments of GPU architecture before lecturing others.

I'm well aware of current GPU architecture and don't see anything which contradicts my statement. The number of actual hardware contexts supported is much less than advertised. There are a lot of semantic games being played with the marketing materials.
 
The current trend is that CPUs are slow and the vast majority of the burden for GPU programming falls to the IHV's. ISV's don't program GPUs, they write applications that call DirectX and OpenGL functions.

For computational tasks this isn't at all true. The current data points to GPU squandering their raw flops advantage via programming model/architecture complexities and then falling well behind for time to solution due to the increased effort required to program any non-trivial program with them. PRACE has a lot of good data.



If that's the case why are CPUs still so slow at rendering?

They aren't. They are slower than fixed function hardware at simple rasterization and texture filtering. For other categories of rendering they are as fast if not faster and a hell of a lot more flexible. For computation, they also hold their own despite have an order of magnitude less raw flops.


Doesn't matter much though. Nick's argument is that somehow GPUs need the same level of ILP extraction as CPUs in order to be efficient at complex workloads. That's patently false.

If you are calling graphics complex workloads, then I think Nick wins this one ;)

The current data from people actually running computationally intense workloads suggest that GPUs squander a large portion of their flops due to quarks and limitations of the designs resulting in performance that is not much different that current CPUs on real world problems. In that regard I would say Nick is mostly right.

GPU still tend to excel at relatively simple problems where mostly fixed function hardware can be thrown at the problem: rasterization and texture filtering.
 
Not really. a large part of it is due to programming model interactions/issues. GPUs are very much inflexible and un-dynamic.

I think it would help if you were a little less terse.

Their mem hierarchy is less flexible, but that is about it. Other than that, a GCN integrated with x86 core at an MMU level should do just fine. Granted, I'd prefer to make the scalar unit in GCN a bit more useful. But the way I see it, the programming model that AMD is proposing is 80%-90% there.
 
The current data from people actually running computationally intense workloads suggest that GPUs squander a large portion of their flops due to quarks and limitations of the designs resulting in performance that is not much different that current CPUs on real world problems. In that regard I would say Nick is mostly right.

I'd wager many of those problems arise from being stuck on the wrong side of PCIe bus. As such, I would expect many of those would go away once they are properly integrated.
 
I'd wager many of those problems arise from being stuck on the wrong side of PCIe bus. As such, I would expect many of those would go away once they are properly integrated.
If by that you mean bringing them into CPU socket then that socket would need several times higher memory bandwidth as well. Big part of why GPUs are that much better in some tasks than CPUs comes down to their raw memory bandwidth.
 
I think some people here forget that every GPU needs a CPU. So why would someone try to make a CPU from a GPU if u already have one? (for amd its even a bigger nobrainer)

Also to the compute side. Its much cheaper to buy a Quad GPU setup for compute than a overpriced quad socket mobo with 4 overpriced Xeon CPU-s. ;)
 
No they're not. Lots of workloads look something like this:
Code:
sequential code

repeat N times
{
    independent iterations
}

sequential code

repeat M times
{
    dependent iterations
}

...
The only thing for which a GPU's thread scheduling can potentially approach out-of-order execution, is the loop with independent iterations. But even then the data is accessed 'horizontally' instead of 'vertically', making the caches less effective. When the loop is short, out-of-order execution will start executing multiple iterations simultaneously, but only for as many instructions as necessary to keep the execution units busy. By keeping the data accesses as local as possible, the benefit of the precious small caches is maximized.

No amount of OoO can work around true dependencies. Whatever independent ALU ops OoO can issue, the compiler can schedule as well. Whatever mem latencies OoO can hide, compiler/programmer can prefetch as well.

And unless N is ginormous, vectors and cores aren't going to do squat.
 
If by that you mean bringing them into CPU socket then that socket would need several times higher memory bandwidth as well. Big part of why GPUs are that much better in some tasks than CPUs comes down to their raw memory bandwidth.

a) More RAM.

b) Closer to IO

c) Ability to gang together multiple chips in a cache coherent manner.

d) Close integration with CPU.
 
Also to the compute side. Its much cheaper to buy a Quad GPU setup for compute than a overpriced quad socket mobo with 4 overpriced Xeon CPU-s. ;)

I think this is only true, if you compare consumer-level graphics boards with enterprise-level cpus. If you go down the firestream/Tesla route, GPUs aren't that cheap anymore (and they still need a decent host-system).
 
Ok let's entertain the idea of a super-CPU. An 8-core AVX2 chip @ 4Ghz gives you a single precision teraflop. Sounds impressive until you look at perf/flop or perf/$. That CPU on 22nm will be $400+ unless Intel becomes a charity. The equivalent 28nm GPU will be $150, earlier to market and also offer higher performance. Not so impressive after all - a $120 CPU and $150 GPU will be cheaper and faster than your homogeneous setup.
An 8-core Haswell CPU doesn't have to cost 400+ $ at all. Sandy Bridge is a tiny chip compared to AMD's 6-core Thuban, and the latter currently sells for 160-180 $ (and has no IGP). So producing a cheap 8-core Haswell CPU will hardly be a challenge.

The actual pricing really just depends on the competition. If AMD has even the slightest success with its APUs, Intel can drop its prices at will and still make a bigger profit. And if NVIDIA threatens their position in the HPC market, cheap and highly efficient teraflop CPUs will fix that right away.
And as mentioned before, you have to spec for peak, not average. Texturing workloads aren't spread evenly across a frame and no amount of CPU magic will change that. If you spec your CPU for avg texture throughput it will be severely bottlenecked during texture heavy portions of the frame.
And as I've said before, a CPU with AVX2 support would have higher peak texturing performance than the IGP. And it really does magically balance any workload since it's all just different bits of code executing on the same cores. So really all that matters is the sum of the average required performance for each task. There are no bottlenecks in software rendering, only hotspots.
GPUs have a lot of FF silicon but much of it is active in parallel and not bottlenecking anything.
When the TEX:ALU ratio is too high, GPUs easily get bottlenecked by the texture samplers. Also when the shaders are too short, they get ROP limited.
CPUs are more flexible but also slow - it's a simple tradeoff.
As I've pointed out before, it took only ~5% of extra die space to double the throughput between AMD's Brisbane and Barcelona architectures. AVX and FMA each double the peak throughput again at a minor increase in size. So just because current CPUs are primarily optimized for ILP doesn't mean they can't cheaply achieve high throughput as well. It's not a simple tradeoff in which you have to completely sacrifice one quality to obtain another.
You say that CPUs don't need to burn a lot of flops emulating FF hardware so where are the software renderers that prove that out?
SwiftShader is faster than it 'should be' when you add up all of the computing power it would take to emulate all of the GPU's fixed-function hardware at peak throughput. It currently merely uses 128-bit SIMD, destructive instructions, no FMA, fetching texels takes 3 uops each before filtering, it doesn't have lookup tables to compute transcendental functions, it has to compute gradients and pixel coverage for each polygon, explicitly read / test / write depth values, interpolate inputs, blend and convert output colors, schedule tasks, process API calls, etc. And yet despite that seemingly massive amount of work it achieves 20 FPS for Crysis on a 100 GFLOP CPU which also has to run the game itself and some processes in the background.

So don't underestimate what a software renderer could do with four times the GFLOPS per core, and gather.
Sorry, it's just not going to happen :)
Never say never, especially when you're just handwaving. Exactly what unique advantage would an APU have left over a homogeneous CPU with AVX2 and AVX-1024? Peak performance/area and performance/Watt would be highly competitive. And I've already debunked the importance of fixed-function components. Add to this the effect of out-of-order execution on cache effectiveness and I can't see what substantial advantage heterogeneous architectures could possibly have.
 
SwiftShader is faster than it 'should be' when you add up all of the computing power it would take to emulate all of the GPU's fixed-function hardware at peak throughput. It currently merely uses 128-bit SIMD, destructive instructions, no FMA, fetching texels takes 3 uops each before filtering, it doesn't have lookup tables to compute transcendental functions, it has to compute gradients and pixel coverage for each polygon, explicitly read / test / write depth values, interpolate inputs, blend and convert output colors, schedule tasks, process API calls, etc. And yet despite that seemingly massive amount of work it achieves 20 FPS for Crysis on a 100 GFLOP CPU which also has to run the game itself and some processes in the background.
Swiftshader is dx9. When swiftshader is dx11, and can beat Llano or it's equivalent, let us know.
 
Swiftshader is dx9. When swiftshader is dx11, and can beat Llano or it's equivalent, let us know.
What exactly would be the difference? More flexible shading? CPUs already have that, just swiftshader isn't ported over to it (yet). It's not like DX11 magically needs a whole lot more FLOPS or can use GPUs far more efficiently to show images than DX9.
 
What exactly would be the difference? More flexible shading? CPUs already have that, just swiftshader isn't ported over to it (yet). It's not like DX11 magically needs a whole lot more FLOPS or can use GPUs far more efficiently to show images than DX9.
Implementing a 7 year old pipeline in sw is different from implementing a modern API with very different data flow.

Hell, even Intel couldn't implement DX10 efficiently enough on a chip that had MANY more cores and wider vectors. Suffice it to say, that I am not underestimating the limits of pure software rendering. :)
 
They aren't. They are slower than fixed function hardware at simple rasterization and texture filtering. For other categories of rendering they are as fast if not faster and a hell of a lot more flexible. For computation, they also hold their own despite have an order of magnitude less raw flops.

GPU still tend to excel at relatively simple problems where mostly fixed function hardware can be thrown at the problem: rasterization and texture filtering.
So... if the only problem of CPUs (regarding 3D/gaming graphics) is lack of ff rasterizer and ff TFUs, would be a 250mm² CPU core equipped by these ff units faster than current ~250mm² GPU cores? I doubt it... LRB was equipped by ff texturing units (only rasterisation was emulated) and it was still nowhere near contemporary GPUs. I think ff hardware is only one of the reasons of GPU's rendering efficiency.
 
A GPU core currently maintains thousands of active thread contexts. A CPU core has at most two. Why would GPUs need to reduce their thread counts to CPU levels in order to support deeper stacks or more registers per thread? They are at a 3 orders of magnitude advantage!
First of all lets please compare apples to apples. I don't mind if you call a work item / strand / SIMD lane a thread if it's clear from the context what you mean, but with all due respect this is just ridiculous.

A quad-core CPU with Hyper-Threading concurrently runs 8 fully independent thread contexts. That's a kernel in GPU lingo, and it currently takes a high-end GPU to be able to run 8 kernels concurrently. Next, today's CPUs are equipped with dual 256-bit vector units. That's 64 scalar strands executing simultaneously. And you can statically schedule as many strands in a thread as you like (but you probably want to stop before the register pressure gets too high). AVX-1024 would allow a fourfold increase in strand count without increasing register pressure.

So a GPU has absolutely no advantage when it comes to strand count. On the contrary. You need to carefully balance it to achieve high throughput. But the amount of wiggle room depends on the workload; you can't just vary it at will. And you definitely can't expect developers to write a dozen implementations to get good performance on everything ranging from an IGP to a multi-GPU setup and across vendors!
Even a small reduction in thread counts would be of huge benefit and still not require the immense burden of ILP extraction that CPUs bear.
Even when the number of strands is reduced, the throughput of a GPU still has to come from jumping from one warp / wavefront to the next, and this gives you horrible cache hit rates for data reuse within a strand. With a CPU the out-of-order execution allows you to keep a high data locality. You're not forced to a single 'direction' though. You can use the 'vertical' access pattern within a strand, or the 'horizontal' access pattern across strands, or a mix of both.

Out-of-order execution really isn't such an "immense burden". It has been implemented in x86 processors since 1995. Nowadays only a minor fraction of the core is responsible for achieving out-of-order instruction execution. Using a physical register file made it even smaller, and AVX2 will further reduce the ratio of control logic versus data paths.
For OoO execution, nVidia is already most of the way there. I wouldn't be surprised to learn that arithmetic instructions can issue out of order in a narrow window. The scoreboarding patent didn't put any restrictions on instruction ordering.
First you call ILP extraction on a CPU a burden, and now you seem to think flexible scoreboarding on a GPU is a blessing? This is starting to sound like technological racism.
 
So a GPU has absolutely no advantage when it comes to strand count. On the contrary. You need to carefully balance it to achieve high throughput. But the amount of wiggle room depends on the workload; you can't just vary it at will. And you definitely can't expect developers to write a dozen implementations to get good performance on everything ranging from an IGP to a multi-GPU setup and across vendors!
Since the registers are tied not to the whole GPU but to the individual SIMDs, it shouldn't matter if you use one (or more) high-end or a single entry level card as long as they're from the same generation. The vendor issue still applies of course.
 
Even when the number of strands is reduced, the throughput of a GPU still has to come from jumping from one warp / wavefront to the next, and this gives you horrible cache hit rates for data reuse within a strand. With a CPU the out-of-order execution allows you to keep a high data locality. You're not forced to a single 'direction' though. You can use the 'vertical' access pattern within a strand, or the 'horizontal' access pattern across strands, or a mix of both.
That is a problem. But for that you don't need OoO. The latencies involved are too large to be covered by a normal sized ROB. What you need is more flexible on chip storage which will let you do effective vertical reuse with just a few threads.
Out-of-order execution really isn't such an "immense burden". It has been implemented in x86 processors since 1995. Nowadays only a minor fraction of the core is responsible for achieving out-of-order instruction execution. Using a physical register file made it even smaller, and AVX2 will further reduce the ratio of control logic versus data paths.
That is scalar OoO. Vector OoO is a different kettle of fish. As is, fetching multiple registers per clock.


First you call ILP extraction on a CPU a burden, and now you seem to think flexible scoreboarding on a GPU is a blessing? This is starting to sound like technological racism.
GCN is a blessing. No scoreboarding at all. :)
 
You don't need a scoreboard for that. Just a better compiler will do.
Compiler technology is about as good as it will get. Current progress is often globally measured in fractions of a percent. Also keep in mind that there's limited time to compile and optimize the code. So an algorithm which produces superior code may not be usable in practice.

Compilers have to deal with a lot of unknowns. Memory access latency can vary two orders of magnitude, they can't predict the control flow, nor what other code will get executed concurrently. So they have to settle for a compromise and hope for the best. I've encountered plenty of cases where less optimized code ran faster though.

No amount of static optimization can replace dynamic scheduling.
 
Back
Top