22 nm Larrabee

Nick · Jul 6, 2011

CarstenS said:
So, you're comparing a GPU architecture from 1H2008, which was basically the first useful attempt of Nvidia towards GPU-Computing, against a hypothetical CPU architecture for the timeframe of what, 1H2014?

No, the transistor budget would just grow proportionally for both. So you may as well factor that out and see what that graph would look like if you compared an equally sized APU against a homogeneous CPU with AVX3.

The CPU catches up in compute density and becomes a lot more efficient with gather support, while the APU has less room for its GPU cores. The APU can increase its efficiency a bit with dynamic scheduling and caches, but both further reduce the compute density.

So GPGPU has no future.

CarstenS · Jul 6, 2011

Nick said:
No, the transistor budget would just grow proportionally for both. So you may as well factor that out and see what that graph would look like if you compared an equally sized APU against a homogeneous CPU with AVX3.

The CPU catches up in compute density and becomes a lot more efficient with gather support, while the APU has less room for its GPU cores. The APU can increase its efficiency a bit with dynamic scheduling and caches, but both further reduce the compute density.

So GPGPU has no future.

I disagree. Using an analogue, I'd rather compare the 2008 GPU to the first Vector extensions for CPUs. And despite using more transistors for it, their capabilities have increased more than proportionally, as is the case with GPUs. From GTX 285 to GTX 580, transistor count has increased by 2.14x, whereas MADD/FMA-GFLOPs went up 2.11x. And I'd wager the guess the usability of those GFLOPS is better also.

Nick · Jul 6, 2011

rpg.314 said:
No amount of OoO can work around true dependencies.

CPUs still win in those cases since the absolute latencies are lower.

And any practical workload has a high amount of ILP which can't be extracted using in-order execution. Out-of-order execution helps achieve a high issue rate with fewer threads, lowering the cache contention and thus increasing effective performance.

Whatever independent ALU ops OoO can issue, the compiler can schedule as well. Whatever mem latencies OoO can hide, compiler/programmer can prefetch as well.

No it can't. The compiler cannot know that a certain memory access caused a cache hit and thus finished sooner than expected, allowing dependent instructions which are lower in the instructions stream than instructions which are still waiting on a miss to resolve, to be issued.

rpg.314 · Jul 6, 2011

Nick said:
CPUs still win in those cases since the absolute latencies are lower.

And any practical workload has a high amount of ILP which can't be extracted using in-order execution. Out-of-order execution helps achieve a high issue rate with fewer threads, lowering the cache contention and thus increasing effective performance.

Again, cost of vector OoO >> expected speedup even when few threads are in flight.

No it can't. The compiler cannot know that a certain memory access caused a cache hit and thus finished sooner than expected, allowing dependent instructions which are lower in the instructions stream than instructions which are still waiting on a miss to resolve, to be issued.

Compiler issues loads ahead of time. It doesn't track completion of loads. That's the job of scoreboard/memory fence.

3dilettante · Jul 6, 2011

Nick said:
AVX-1024 would allow a fourfold increase in strand count without increasing register pressure.

This does assume native 1024-bit registers. Otherewise, there is pressure, just not pressure that is obvious to software.

Out-of-order execution really isn't such an "immense burden". It has been implemented in x86 processors since 1995. Nowadays only a minor fraction of the core is responsible for achieving out-of-order instruction execution. Using a physical register file made it even smaller, and AVX2 will further reduce the ratio of control logic versus data paths.

Lumping together the FPU scheduler and rename stages and over 100 rename registers that's 1/4 of the FPU.
In the L/S section, the two units take up somewhat less than half of the non-cache area, and LS2 is responsible for holding results until the instruction is retired in the ROB, and this is a much less interesting memory pipeline than BD or any current Intel desktop chip.
Pretty much everything not an ALU or AGU in the integer section is part of the OoO engine or designed to work with it, so maybe 3/4.
I think half of the non-cache area of those parts would be heavily influenced by the OoO engine.

The front end is relatively free, since it is in-order at that point, with 1/2 of the non-cache area to wide x86 issue.

Nick · Jul 6, 2011

rpg.314 said:
Implementing a 7 year old pipeline in sw is different from implementing a modern API with very different data flow.

The front-end is quite different, but exactly what changed in a GPU at the hardware level to enable DX10/11, which you think would pose a big problem for a software renderer? None of the DX9 fixed-function hardware caused any trouble, so I'm curious why you think DX11 is any different.

Voxilla · Jul 6, 2011

I'm wondering if Nick sees any use of doing software rendering (by emulation of a hardware API), on CPUs that have build in GPUs.
All CPUs have now build in GPUs, except for the ultra high end or server CPUs, where IGPs would be a waste of silicon as they are paired with discrete GPUs or not used for rendering at all.

rpg.314 · Jul 6, 2011

Nick said:
The front-end is quite different, but exactly what changed in a GPU at the hardware level to enable DX10/11, which you think would pose a big problem for a software renderer? None of the DX9 fixed-function hardware caused any trouble, so I'm curious why you think DX11 is any different.

a) If it's all peachy, then Intel would have done it with lots of cores, vectors and scatter/gather.

b) The data flow is modified by geometry shaders and tessellation.

c) If emulating them is not a problem, then why is Swiftshader stuck at dx9? Why can't/doesn't it do dx11?

trinibwoy · Jul 6, 2011

aaronspink said:
For computational tasks this isn't at all true. The current data points to GPU squandering their raw flops advantage via programming model/architecture complexities and then falling well behind for time to solution due to the increased effort required to program any non-trivial program with them. PRACE has a lot of good data.

Agreed, however GPU programming models and architecture limitations aren't the only reasons for that inefficiency. CPU programming and its various vector extensions are a well known art. Don't underestimate the benefit of developer experience to the high efficiency of CPU based code. GPUs are continuously closing that gap as well.

They aren't. They are slower than fixed function hardware at simple rasterization and texture filtering. For other categories of rendering they are as fast if not faster and a hell of a lot more flexible. For computation, they also hold their own despite have an order of magnitude less raw flops.

Ok, that's fair. Now does that mean CPUs are just very slow at emulating texture filtering and rasterization? If that's the case why does Sandy Bridge have a full blown IGP and not just a few texture units slapped onto their x86 cores?

If you are calling graphics complex workloads, then I think Nick wins this one

Yes, graphics. That's what GPUs do after all and what Nick is claiming CPUs can do as well.

GPU still tend to excel at relatively simple problems where mostly fixed function hardware can be thrown at the problem: rasterization and texture filtering.

If it was that simple then CPUs would have incorporated those fixed function units and everything would be peachy. The facts just don't support these claims. The reality is that CPUs are slow as balls at these "relatively simple problems".

trinibwoy · Jul 6, 2011

Nick said:
The actual pricing really just depends on the competition. If AMD has even the slightest success with its APUs, Intel can drop its prices at will and still make a bigger profit. And if NVIDIA threatens their position in the HPC market, cheap and highly efficient teraflop CPUs will fix that right away.

That's a lot of assumptions. The past few years say otherwise - fast 8-core processors in 2013 won't be cheap. The workloads to drive those configurations are still far too specialized for Intel to start giving cores away. Even AMD with its many-core strategy will only have 2 FP pipelines on consumer level chips.

And as I've said before, a CPU with AVX2 support would have higher peak texturing performance than the IGP. And it really does magically balance any workload since it's all just different bits of code executing on the same cores. So really all that matters is the sum of the average required performance for each task. There are no bottlenecks in software rendering, only hotspots.

Ah, sorry didn't realize we were talking about software texture filtering. So no specific hardware bottlenecks as it's all just code but there's still the question of absolute performance. Texture filtering in software is very GPU friendly yet GPUs still have dedicated units. Obviously the perf/mm/watt advantage of FF hardware is still significant.

As I've pointed out before, it took only ~5% of extra die space to double the throughput between AMD's Brisbane and Barcelona architectures. AVX and FMA each double the peak throughput again at a minor increase in size. So just because current CPUs are primarily optimized for ILP doesn't mean they can't cheaply achieve high throughput as well. It's not a simple tradeoff in which you have to completely sacrifice one quality to obtain another.

AVX used existing data paths and FMA adds a single operand. Once you have FMA how do you scale from there? Does each core get ever wider vector units which is just more silicon that's idling for most workloads? Or do you add even more cores which won't be fully utilized by lightly threaded serial code?

SwiftShader is faster than it 'should be' when you add up all of the computing power it would take to emulate all of the GPU's fixed-function hardware at peak throughput. It currently merely uses 128-bit SIMD, destructive instructions, no FMA, fetching texels takes 3 uops each before filtering, it doesn't have lookup tables to compute transcendental functions, it has to compute gradients and pixel coverage for each polygon, explicitly read / test / write depth values, interpolate inputs, blend and convert output colors, schedule tasks, process API calls, etc. And yet despite that seemingly massive amount of work it achieves 20 FPS for Crysis on a 100 GFLOP CPU which also has to run the game itself and some processes in the background.

I would love to play around with SwiftShader if possible and see it in action first hand. I have quite a few DX9 games to throw at it. Is it stable and freely available?

Never say never, especially when you're just handwaving. Exactly what unique advantage would an APU have left over a homogeneous CPU with AVX2 and AVX-1024? Peak performance/area and performance/Watt would be highly competitive. And I've already debunked the importance of fixed-function components. Add to this the effect of out-of-order execution on cache effectiveness and I can't see what substantial advantage heterogeneous architectures could possibly have.

Well hand-waving is often the only available response to hand-waving by the other side. You're asking us to take many things on faith with no practical implementations to back them up. I actually don't think you've debunked anything with respect to GPU performance or their advantages over CPUs

Nick · Jul 7, 2011

CarstenS said:
I disagree. Using an analogue, I'd rather compare the 2008 GPU to the first Vector extensions for CPUs. And despite using more transistors for it, their capabilities have increased more than proportionally, as is the case with GPUs. From GTX 285 to GTX 580, transistor count has increased by 2.14x, whereas MADD/FMA-GFLOPs went up 2.11x. And I'd wager the guess the usability of those GFLOPS is better also.

GT200b: 55 nm, 1400 Mtr, 470 mm², 1063 GFLOPS
GF110: 40 nm, 3000 Mtr, 520 mm², 1581 GFLOPS

Eliminating the process shrink, that's a factor 0.63 reduction in computing density. And yes, it got more efficient, but I think that barely compensates for the loss.

i7-99X: 32 nm, 239 mm², 166 GFLOPS
i7-2600: 32 nm, 175 mm² (w/o IGP), 218 GFLOPS

That's a factor 1.8 increase in computing density. FMA2 should also nearly double the computing density. Of course FMA doesn't actually double the effective performance, but note that the peak performance/mm² would be quite close to that of the GPU.

So you're probably asking why some benchmarks indicate the gap to be much wider? The answer is gather support, and FMA2 is going to fix that too.

Last but not least, the GPU is worthless by itself so you have to take the area of its accompanying CPU into account as well. So again, it's a very simple conclusion: GPGPU has no future.

rpg.314 · Jul 7, 2011

Nick said:
GT200b: 55 nm, 1400 Mtr, 470 mm², 1063 GFLOPS
GF110: 40 nm, 3000 Mtr, 520 mm², 1581 GFLOPS

AFAIK, you are counting the MUL in gt200 as well, which you shouldn't include.

Last but not least, the GPU is worthless by itself so you have to take the area of its accompanying CPU into account as well. So again, it's a very simple conclusion: GPGPU has no future.

And a SB die is worthless if it's IGP doesn't work.

Nick · Jul 7, 2011

rpg.314 said:
If it's all peachy, then Intel would have done it with lots of cores, vectors and scatter/gather.

Larrabee stands or falls by 20% performance/dollar. To be worth the heavy investments, they needed the prospect of selling it to a significant percentage of consumers from day 1. That's no small task when the expectation is to match or exceed the graphics performance of competitors who have been creating hardware highly dedicated to running Direct3D / OpenGL for over a decade. Things clearly weren't entirely up to snuff for legacy graphics to take the plunge. And their own research papers observed that even for GPGPU applications the CPU's performance wasn't lagging far behind...

However, for replacing the IGP with software rendering on a homogeneous CPU, the economics are very different. You get the unification advantage of more die space for the fraction of money consumers are willing to spend on graphics, and the expected performance level is lower. Consumers in this market segment just want things to run adequately. Unlimited forward compatibility for graphics and other high throughput applications is also a bonus, and you get a faster multi-core CPU to boot. So unlike for a discrete card, a 20% performance deficit at graphics doesn't mean the end of it. Also, in comparison it's a tiny and spreadable investment to simply extend the vector processing capabilities of the CPU. It's guaranteed to pay off in the HPC market quite quickly, especially since they're releasing the instruction set specification years before the hardware. With CPUs available to the masses Intel doesn't have to write the majority of the software itself and there's plenty of time to make the transition.

So you can't conclude anything at all about the viability of software rendering as an IGP replacement, from the failure of Larrabee to launch as a discrete graphics card.

The data flow is modified by geometry shaders and tessellation.

You didn't answer my question. What changed in GPUs at the hardware level to enable this new data flow required by geometry shaders and tesselation, which would be an serious problem for CPUs to support? If DX9 texture samplers, setup engines, ROPs and rasterizers are not an insurmountable issue, why would DX10/11 features be worse? As far as I can tell, GPUs became more CPU-like when they outgrew DX9.

If emulating them is not a problem, then why is Swiftshader stuck at dx9? Why can't/doesn't it do dx11?

If hypothetically it's not a problem from a hardware perspective, does that mean DX11 software renderers will / should appear overnight? Does this question answer your question?

Nick · Jul 7, 2011

trinibwoy said:
Ok, that's fair. Now does that mean CPUs are just very slow at emulating texture filtering and rasterization?

Yes. They lack AVX2 support. Like I said before, it takes 3 uops to fetch a single texel when you have to emulate a gather operation. And AVX doesn't support 256-bit wide integer operations for filtering and rasterization. AVX2 fixes both.

If that's the case why does Sandy Bridge have a full blown IGP and not just a few texture units slapped onto their x86 cores?

Because Sandy Bridge does not have FMA / gather / AVX-1024 yet nor sufficient cache bandwidth to sustain the peak throughput. All four are required to efficiently execute shaders and perform other graphics tasks. It's just a matter of time though.

And I'm not convinced that dedicated texture samplers are required in the long run. The average TEX:ALU ratio continues to go down, but peak rate remains high, and there's a huge variation in sampling functionality (no filtering at all to anisotropic or even custom filtering, 4 bpp to 128 bpp, 2D / cube / 3D, fetch4 / PCF, etc.). So dedicated samplers are way bigger than what you'd need for the average utilization.

I wouldn't be the least surprised if future GPUs ditched the texture samplers in favor of more shading units. The most critical thing to support them in software is high performance gather operations.

Arun · Jul 7, 2011

Nick said:
And I'm not convinced that dedicated texture samplers are required in the long run. The average TEX:ALU ratio continues to go down, but peak rate remains high, and there's a huge variation in sampling functionality (no filtering at all to anisotropic or even custom filtering, 4 bpp to 128 bpp, 2D / cube / 3D, fetch4 / PCF, etc.). So dedicated samplers are way bigger than what you'd need for the average utilization.

I wouldn't be the least surprised if future GPUs ditched the texture samplers in favor of more shading units. The most critical thing to support them in software is high performance gather operations.

It seems to me that the most likely evolution for GPUs is a removal of texture filtering units while still keeping dedicated texture addressing. Filtering is indeed becoming very varied and there is demand for even much more flexibility, but addressing is still the same old thing and the computational cost per operation varies relatively less than for filtering.

Also I'm curious: how expensive is it to decode DXTC textures on a CPU? I assume it's not cheap but you can use 8-bit vector operations for that and it would benefit from AVX2 as well?

---

Nick, even if you're right (and I clearly think you're not), the fact that Intel could theoretically get away without an IGP does NOT mean that's what they will actually do. That's a very important distinction - there's plenty of internal politics that would nearly certainly delay something like that by several years even if it was possible. Also keep in mind that Intel reuses the same chips for desktops and notebooks, and has been sacrificing area efficiency for the sake of power efficiency for a long time - CPUs need to be competitive in terms of power and not just cost.

rpg.314 · Jul 7, 2011

Nick said:
Larrabee stands or falls by 20% performance/dollar. To be worth the heavy investments, they needed the prospect of selling it to a significant percentage of consumers from day 1. That's no small task when the expectation is to match or exceed the graphics performance of competitors who have been creating hardware highly dedicated to running Direct3D / OpenGL for over a decade. Things clearly weren't entirely up to snuff for legacy graphics to take the plunge. And their own research papers observed that even for GPGPU applications the CPU's performance wasn't lagging far behind...

Intel has the deep pockets necessary to sell LRB cheap enough if the perf/$ gap is only 20%. IIRC, that's what they were saying before reality struck as well.

However, for replacing the IGP with software rendering on a homogeneous CPU, the economics are very different. You get the unification advantage of more die space for the fraction of money consumers are willing to spend on graphics, and the expected performance level is lower. Consumers in this market segment just want things to run adequately. Unlimited forward compatibility for graphics and other high throughput applications is also a bonus, and you get a faster multi-core CPU to boot. So unlike for a discrete card, a 20% performance deficit at graphics doesn't mean the end of it. Also, in comparison it's a tiny and spreadable investment to simply extend the vector processing capabilities of the CPU. It's guaranteed to pay off in the HPC market quite quickly, especially since they're releasing the instruction set specification years before the hardware. With CPUs available to the masses Intel doesn't have to write the majority of the software itself and there's plenty of time to make the transition.

If you replace Intel IGP by some swiftshader clone, you haven't gained or won anything. You have replaced crap by even more stinky crap.

So you can't conclude anything at all about the viability of software rendering as an IGP replacement, from the failure of Larrabee to launch as a discrete graphics card.

Considering how much resources were invested in LRB, I'd say no amount of sw optimization would add more that 10-20% at best.

You didn't answer my question. What changed in GPUs at the hardware level to enable this new data flow required by geometry shaders and tesselation, which would be an serious problem for CPUs to support? If DX9 texture samplers, setup engines, ROPs and rasterizers are not an insurmountable issue, why would DX10/11 features be worse? As far as I can tell, GPUs became more CPU-like when they outgrew DX9.

If hypothetically it's not a problem from a hardware perspective, does that mean DX11 software renderers will / should appear overnight? Does this question answer your question?

5 years since dx10, and all we get to see is mediocre performance out of a dx9 renderer. Something tells me it's not gonna happen for quite a while.

rpg.314 · Jul 7, 2011

Arun said:
It seems to me that the most likely evolution for GPUs is a removal of texture filtering units while still keeping dedicated texture addressing. Filtering is indeed becoming very varied and there is demand for even much more flexibility, but addressing is still the same old thing and the computational cost per operation varies relatively less than for filtering.

Also I'm curious: how expensive is it to decode DXTC textures on a CPU? I assume it's not cheap but you can use 8-bit vector operations for that and it would benefit from AVX2 as well?

---

Nick, even if you're right (and I clearly think you're not), the fact that Intel could theoretically get away without an IGP does NOT mean that's what they will actually do. That's a very important distinction - there's plenty of internal politics that would nearly certainly delay something like that by several years even if it was possible. Also keep in mind that Intel reuses the same chips for desktops and notebooks, and has been sacrificing area efficiency for the sake of power efficiency for a long time - CPUs need to be competitive in terms of power and not just cost.

I think 8 bit filtering will still stick around for a while, even if fp16 filtering is removed, either as hw or as a dedicated instruction.

Arun · Jul 7, 2011

rpg.314 said:
If you replace Intel IGP by some swiftshader clone, you haven't gained or won anything. You have replaced crap by even more stinky crap.

Errr, as frustrating as some of these arguments can be, let's not make it personal... I hope you meant this strictly because of insufficient CPU performance/power efficiency rather because SwiftShader is still a very impressive piece of work.

rpg.314 said:
Considering how much resources were invested in LRB, I'd say no amount of sw optimization would add more that 10-20% at best.

I do not believe Larrabee could have been competitive anyway but FWIW, I heard from a reliable source that software was at least as big a problem as hardware for Larrabee.

rpg.314 said:
I think 8 bit filtering will still stick around for a while, even if fp16 filtering is removed, either as hw or as a dedicated instruction.

Maybe as a stopgap to save a little bit of power consumption, yeah... But the problem is that (as Nick points out) the peak filtering rate is already much higher than the average.

rpg.314 · Jul 7, 2011

Arun said:
Errr, as frustrating as some of these arguments can be, let's not make it personal... I hope you meant this strictly because of insufficient CPU performance/power efficiency rather because SwiftShader is still a very impressive piece of work.

If we are considering technical beauty, then yes, without a shadow of doubt, SwiftShader is one of the most amazing pieces of work out there.

But if I was offered a product that did only dx9, took 4x more area clocked 3x higher to beat an Intel IGP of all things and sucked battery life like <....> just to render my desktop, I would absolutely not buy it (and recommend the same to my friends), no matter how shiny it's internals are.

Nick · Jul 7, 2011

trinibwoy said:
That's a lot of assumptions. The past few years say otherwise - fast 8-core processors in 2013 won't be cheap. The workloads to drive those configurations are still far too specialized for Intel to start giving cores away. Even AMD with its many-core strategy will only have 2 FP pipelines on consumer level chips.

You think the massive range of workloads a homogeneous high-throughput CPU can run are too specialized, but APUs should get ever more GPU cores for the narrow range of applications which can benefit from OpenCL? Don't forget, homogeneous CPUs also also capable of running OpenCL, and every other high-throughput API / language out there...

Given the sizes and prizes of Thuban, Llano and Bulldozer, it's not going to take long for Intel to drop its quad-core prices and let an affordable 6-core CPU take the top spot, which then paves the way for an 8-core in 2013. Also note that they'll move to 16 nm just one year later. But again, it all depends on the competition's moves how much these parts will cost. What matters to the discussion is that Intel will have no trouble keeping up with APUs, and once software takes full advantage of homogeneous computing AMD will have no other choice but to go homogeneous as well.

AVX used existing data paths and FMA adds a single operand. Once you have FMA how do you scale from there? Does each core get ever wider vector units which is just more silicon that's idling for most workloads? Or do you add even more cores which won't be fully utilized by lightly threaded serial code?

Choices, choices...

Indeed lightly threaded scalar code wouldn't use wider vector units nor extra cores, but neither would it use an APU's GPU cores so that's a useless argument.

Personally I think they should first implement AVX-1024 using 256-bit vector units, given the power and latency advantages. Where they go from there will largely depend on semiconductor scaling trends and future applications. It's pointless to try and predict that far ahead. What matters is that they have plenty of options. There's no reason to be any more concerned about homogeneous scaling than about heterogeneous scaling.

I would love to play around with SwiftShader if possible and see it in action first hand. I have quite a few DX9 games to throw at it. Is it stable and freely available?

The DX9 evaluation demo is freely available: SwiftShader.

22 nm Larrabee

Nick

CarstenS

Moderator

Nick

rpg.314

3dilettante

Nick

Voxilla

rpg.314

trinibwoy

Meh

trinibwoy

Meh

Nick

rpg.314

Nick

Nick

Arun

Unknown.

rpg.314

rpg.314

Arun

Unknown.

rpg.314

Nick

Similar threads