22 nm Larrabee

Nick · Jul 7, 2011

rpg.314 said:
AFAIK, you are counting the MUL in gt200 as well, which you shouldn't include.

I used the Wikipedia numbers. Feel free to correct them.

And a SB die is worthless if it's IGP doesn't work.

I'm looking at how things would compare after AVX-1024 has been implemented and the IGP has been ditched (i.e. homogeneous versus heterogeneous). That could take many years but keep in mind that even on heterogeneous CPUs the concepts can be proven. So it's fine to subtract the area of the IGP in comparisons.

rpg.314 · Jul 7, 2011

I'm looking at how things would compare after AVX-1024 has been implemented and the IGP has been ditched (i.e. homogeneous versus heterogeneous). That could take many years but keep in mind that even on heterogeneous CPUs the concepts can be proven. So it's fine to subtract the area of the IGP in comparisons.

It would still trash your battery in just painting the desktop.

hoho · Jul 7, 2011

GPUs can be downclocked as well to save energy you know. Also you seem to be vastly overestimating how much computational power it takes to run a modern OS UI.

rpg.314 · Jul 7, 2011

Also you seem to be vastly overestimating how much computational power it takes to run a modern OS UI.

Compute power != electric power.

CarstenS · Jul 7, 2011

Nick said:
GT200b: 55 nm, 1400 Mtr, 470 mm², 1063 GFLOPS
GF110: 40 nm, 3000 Mtr, 520 mm², 1581 GFLOPS

Eliminating the process shrink, that's a factor 0.63 reduction in computing density. And yes, it got more efficient, but I think that barely compensates for the loss.

i7-99X: 32 nm, 239 mm², 166 GFLOPS
i7-2600: 32 nm, 175 mm² (w/o IGP), 218 GFLOPS

That's a factor 1.8 increase in computing density. FMA2 should also nearly double the computing density. Of course FMA doesn't actually double the effective performance, but note that the peak performance/mm² would be quite close to that of the GPU.

Please, don't quote the bogus FLOPS for GT200. I was under the impression we were talking about MADD/FMA(C)-FLOPS, at least I was.

With real, usable FLOPS, we arrive at 3,04 versus 2,85 GFLOPS per sqmm with GF110 at the longer end of the stick. And that's including additional functionality and programmability.

Also, I cannot follow your reasoning why you must subtract SBs IGP from the die area. Using the same logic, I could argue to remove ROPs, texture filtering and even ff-rasterization from the GPUs' die area.

Nick said:
So you're probably asking why some benchmarks indicate the gap to be much wider? The answer is gather support, and FMA2 is going to fix that too.

"The answers" costs die area, as will it on future CPUs too. Less maybe, but free lunch is over.

Nick said:
Last but not least, the GPU is worthless by itself so you have to take the area of its accompanying CPU into account as well. So again, it's a very simple conclusion: GPGPU has no future.

A CPU is equally worthless without some kind of display device somewhere... And of course, a tiny little chip somewhere on a remote terminal would suffice, but equally, a GPU won't need a massive modern multicore OoO CPU in order to run it's operating system. A small Atom- or ARM-like device would do.

Plus: CPUs have the benefit of way more iterations of hardware in order to perfect programming and compilers. GPUs on the other hand only had five or so years and even less generations in order to both evolve their cores' potential as well as their programming models. I definitely see more headroom for improvement here than there is on the software-side for CPUs.

I am not blind, I don't think GPUs are the be-all-end-all of processing, but neither are CPUs. They buy their strengths at general workloads with larger die area per core. GPUs buy their strength at more specialized workloads with their limited flexibility. CPUs adding more and wider vectors loose their efficiency at their historical roots, GPUs adding more and more flexibility loose theirs at immense speed for specialized tasks. So while I won't make a placative statement like "CPU-HPC" is dead, I'd rather say, compute is going to converge at some point where the optimum balance of flexibility and compute density is found. CPU manufacturers approach it from the high-speed serial side of things, GPU manufacturers from the massively parallel side of things.

Simon F · Jul 7, 2011

rpg.314 said:
I think 8 bit filtering will still stick around for a while, even if fp16 filtering is removed, either as hw or as a dedicated instruction.

FWIW, the amount of hardware needed to do bilinear filtering (at least with 8 bit) is tiny.

Nick · Jul 7, 2011

Arun said:
It seems to me that the most likely evolution for GPUs is a removal of texture filtering units while still keeping dedicated texture addressing. Filtering is indeed becoming very varied and there is demand for even much more flexibility, but addressing is still the same old thing and the computational cost per operation varies relatively less than for filtering.

Why not perform addressing in the shader ALUs, defining some new instructions if necessary?

Also I'm curious: how expensive is it to decode DXTC textures on a CPU? I assume it's not cheap but you can use 8-bit vector operations for that and it would benefit from AVX2 as well?

Why would it not be cheap?

Nick, even if you're right (and I clearly think you're not), the fact that Intel could theoretically get away without an IGP all it a fact that they could get away does NOT mean that's what they will actually do. That's a very important distinction - there's plenty of internal politics that would nearly certainly delay something like that by several years even if it was possible.

So? Intel is in the best position to go homogeneous in several years. If they don't, while it makes sense, it's their loss.

Also keep in mind that Intel reuses the same chips for desktops and notebooks, and has been sacrificing area efficiency for the sake of power efficiency for a long time - CPUs need to be competitive in terms of power and not just cost.

Which is why I'm proposing to execute AVX-1024 instructions on 256-bit units and take advantage of the many clock gating opportunities this creates. It's very close to how GPUs work, so why wouldn't the power efficiency be competitive?

And while AVX-1024 seems essential to me, I also think power efficiency is slightly overrated. How often does one really want to play hardcore games for hours on an unplugged laptop? There are plenty of laptops with quite power hungry GPUs, and there are many non-gaming applications which will drain your battery fast too. So I see little reason why a CPU wouldn't be allowed to consume up to its TDP while rendering heavy graphics. Consumers hardly ever look at the TDP at all and the advertised hours are for browsing the web or so.

3dilettante · Jul 7, 2011

Nick said:
Yes. They lack AVX2 support. Like I said before, it takes 3 uops to fetch a single texel when you have to emulate a gather operation.

Three uops?
Are you certain that the gather described in the AVX2 document can beat that?
I think it could be microcoded, and would likely split into many more than 3 in a more general multi-line gather. In the single-line case, it may split into multiple uops during execution, possibly more than 3.
It might save a few instructions in the cache, though.

Nick · Jul 7, 2011

rpg.314 said:
It would still trash your battery in just painting the desktop.

Nonsense. Even a GMA 950 can render Aero smoothly, and they use lots of optimizations to ensure the graphics load is as low as possible. Heck the windows are pixel aligned most of the time so with a software renderer you can just copy the texture data directly.

Nick · Jul 7, 2011

rpg.314 said:
5 years since dx10, and all we get to see is mediocre performance out of a dx9 renderer. Something tells me it's not gonna happen for quite a while.

Since AVX2 and AVX-1024 are prerequisites to replace the IGP with software rendering, why the rush? You think there's a market for DX11 software rendering right now?

rpg.314 · Jul 7, 2011

Nick said:
Nonsense. Even a GMA 950 can render Aero smoothly, and they use lots of optimizations to ensure the graphics load is as low as possible. Heck the windows are pixel aligned most of the time so with a software renderer you can just copy the texture data directly.

And what about rendering the texture in the first place? Since painting the desktop and now web browsing is the most used graphics application, it is critically important that they be as power efficient as possible. Even 0.5W extra is too much for these workloads.

rpg.314 · Jul 7, 2011

Nick said:
Since AVX2 and AVX-1024 are prerequisites to replace the IGP with software rendering, why the rush? You think there's a market for DX11 software rendering right now?

Well, AVX2 will land in 2013, right next to fully fused APU's. We'll see just how much software rendering on cpu's makes sense then.

Blazkowicz · Jul 7, 2011

Intel has been moving toward more dedicated hardware, I would say.
built-in and modernised IGP, cryptographic acceleration, and quicksync. my guess is that they will not remove them in two or three years just for the bragging rights.

Nick · Jul 7, 2011

3dilettante said:
Three uops?

One for the extract, one for the load, and one for the insert. So 24 uops for an 8-way gather.

Are you certain that the gather described in the AVX2 document can beat that?

The only sensible implementation is one that takes a single uop for an entire gather.

3dilettante · Jul 7, 2011

Nick said:
One for the extract, one for the load, and one for the insert. So 24 uops for an 8-way gather.

The only sensible implementation is one that takes a single uop for an entire gather.

I can see savings for an 8 element gather with 24 uops in total in cases where activity falls within a small number of cache accesses, which allows activity combined within a single gather instruction.

I dunno about it going down to 1 uop. There's some complex and interruptible activity described for the instruction.
Is there currently an instruction with that level of complexity that resolves to a single uop?

trinibwoy · Jul 7, 2011

Nick said:
You think the massive range of workloads a homogeneous high-throughput CPU can run are too specialized, but APUs should get ever more GPU cores for the narrow range of applications which can benefit from OpenCL? Don't forget, homogeneous CPUs also also capable of running OpenCL, and every other high-throughput API / language out there...

Yes, the applications that can take advantage of 8 (or even 4) CPU cores aren't used by the average consumer or gamer. I'm not sure what range of applications you're referring to but why are you ignoring the #1 application - i.e games.

Given the sizes and prizes of Thuban, Llano and Bulldozer, it's not going to take long for Intel to drop its quad-core prices and let an affordable 6-core CPU take the top spot, which then paves the way for an 8-core in 2013. Also note that they'll move to 16 nm just one year later. But again, it all depends on the competition's moves how much these parts will cost. What matters to the discussion is that Intel will have no trouble keeping up with APUs, and once software takes full advantage of homogeneous computing AMD will have no other choice but to go homogeneous as well.

I'll believe it when I see it. What do you imagine an 8-core CPU will be doing on a home desktop in 2013 that a dual-core can't? I can think of a few things, none of which involve graphics rendering.

Indeed lightly threaded scalar code wouldn't use wider vector units nor extra cores, but neither would it use an APU's GPU cores so that's a useless argument.

APUs aren't trying to be CPUs or run sequential code on the GPU cores so I'm not sure what point you're making. You're claiming that CPUs can venture outside their comfort zone and compete with GPUs and APUs at 3D rendering. Where's the beef?

The DX9 evaluation demo is freely available: SwiftShader.

Thanks!

rpg.314 · Jul 7, 2011

Nick said:
The only sensible implementation is one that takes a single uop for an entire gather.

What happens when it causes 8 TLB misses, costing 10^3-10^4 cycles for one instruction? The system would appear to hang to the user, ultimately creating a lot of consumer and OEM heartburn. You will quite likely lose quite a few interrupts, potentially requiring lots of IO syscalls to be repeated and lot of extra TCP/IP traffic. And what happens to coherency protocol while a core is stalled on a single memory uop?

Nick · Jul 7, 2011

CarstenS said:
With real, usable FLOPS, we arrive at 3,04 versus 2,85 GFLOPS per sqmm with GF110 at the longer end of the stick. And that's including additional functionality and programmability.

That's still a lower computing density when corrected for the process shrink.

Also, I cannot follow your reasoning why you must subtract SBs IGP from the die area. Using the same logic, I could argue to remove ROPs, texture filtering and even ff-rasterization from the GPUs' die area.

That would be fine if it did software rendering and ran every other application plus the operating system...

This thread evolved into a discussion about homogeneous versus heterogeneous architectures. That means a CPU without IGP, versus an APU or CPU + discrete GPU.

"The answers" costs die area, as will it on future CPUs too. Less maybe, but free lunch is over.

What's the area for 8 logarithmic right shift units for 64 bytes at byte granularity?

Nick · Jul 7, 2011

3dilettante said:
I dunno about it going down to 1 uop. There's some complex and interruptible activity described for the instruction.

That's what the mask operand is for.

aaronspink · Jul 7, 2011

rpg.314 said:
No amount of OoO can work around true dependencies. Whatever independent ALU ops OoO can issue, the compiler can schedule as well. Whatever mem latencies OoO can hide, compiler/programmer can prefetch as well.

People have made your argument for decades and it has been false for decades.

First there is so much that a compiler cannot know that makes it a basic impossibility to due what you are suggesting.

Second, a compiler cannot schedule dynamically.

Third, if what you said was true, VLIW would of taken over the world by now instead of being an abject failure.

Fourth, "In theory there is no difference between theory and practice. but in practice there is"

22 nm Larrabee

Nick

rpg.314

hoho

rpg.314

CarstenS

Moderator

Simon F

Tea maker

Nick

3dilettante

Nick

Nick

rpg.314

rpg.314

Blazkowicz

Nick

3dilettante

trinibwoy

Meh

rpg.314

Nick

Nick

aaronspink

Similar threads