22 nm Larrabee

Gipsel · Jun 17, 2011

Nick said:
CPU cores: 217.6 GFLOPS / 4 x 3.15 mm x 5.46 mm = 3.16 GFLOPS/mm²
IGP: 129.6 GFLOPS / 4.71 mm x 8.70 mm = 3.16 GFLOPS/mm²

That's for a Core i7-2600 as-is, at baseline frequencies. Haswell adds FMA to the mix...

Just for comparison, I make up some numbers based on the ancient RV770 (built in TSMC's 55nm):

1200 GFlop/s in 255mm² (including 256Bit memory controller, which could be shared with a CPU) => 4.7 GFlops/mm²

The shader core only (which still includes L1 caches and texturing hardware), measures ~105 mm² => 11.4 GFlop/mm²

Just the SIMD engines without TMUs (but of course with 2.5 MB register files) are roughly 74 mm² => 16.2 GFlop/mm²

Perfect length scaling (without taking any frequency changes into account) to 32nm would get you 33.7 GFlop/mm² for the whole shader core (~48 GFlop/mm² for the SIMD engines alone). That is one order of magnitude difference.

3dcgi · Jun 17, 2011

rpg.314 said:
Clock gating would be a win if you can find 2 (and more like 4 cycles) idle. I don't think that will happen realistically.

I think Intel is pretty happy with their design if only has 2 - 4 idle cycles. Maybe you're talking about a specific situation I didn't follow, but I bet there is usually far more idle time in the majority of a program's execution.

rpg.314 · Jun 17, 2011

My point was that the opportunity for clock gating will be fleeting and well dispersed, especially in fp intensive code that predicts well and has fewer branches anyway.

Nick · Jun 17, 2011

hoho said:
Any idea how big part of the chip is taken by fixed-function stuff that isn't counted in the FLOPS like texture filtering that has to be emulated in software on CPU?

No, but I do know the Intel HD Graphics 3000 has only 2 samplers. Also, sampling would take roughly 40 instructions for 16 filtered samples using AVX2. So with an ICP of ~2, the equivalent of one CPU core would be needed to provide the same peak texel rate.

If that sounds not terrible but not great either, keep in mind that texture units are largely underutilized. You hardly ever use the peak sampling rate (but the CPU does offer a higher peak performance should you ever need it). Furthermore, texture sampling in software allows eliminating any operations you don't need. One extreme (but not rare) example occurs when performing post-screen effects: When the texels are aligned to pixels, you don't need any filtering at all. Perspective correction isn't needed either. So the whole texture sampling operation can nearly be reduced to a plain load instruction!

Also note that when the IGP is ditched, it actually creates space for two more cores. The time that's not used for texture sampling, can be used to increase computing performance...

So seriously, performance really isn't an issue. The only thing left in favor of the IGP, is power consumption. But while it's a challenge, there are definitely ways to reduce it. Also keep in mind that for the IGP to become any more flexible, it has to sacrifice some power efficiency. So any way you look at it, eventually a homogeneous architecture becomes a superior solution.

Nick · Jun 18, 2011

Gipsel said:
You advocate processing in AVX like units extended to 1024 bits at least logically. That are 32 floats, exactly the logical vector size nvidia uses in its GPUs. AMD uses a vector length of 2048 bits logically. Both are using units processing 16 floats/clock (512bits) physically. I don't see much of a difference or a need at all to "assemble the data into a batch suited for processing by a specialized heterogeneous component", it will work just the same.

1024 or 2048 bits is hardly a batch of data worth sending to a heterogeneous component. You said so yourself: "You won't transfer just some hundred bytes from registers for a few additions, of course." There's just too much overhead.

If you would first need to "assemble a batch" to be suited for processing by wide vector units, it is probably better to resorting to plain old and maybe even scalar SSE

Batching is not about making it suitable for processing by wide vector units, it's about amortizing latency and synchronization overhead. It's a whole other scale.

Basically with a heterogeneous architecture you would constantly be exchanging data between the L2 and L3 caches. With a homogeneous architecture most data flow takes place between the L1 and L2 caches. The latter has far better latency and bandwidth.

Because you can't eat the cake and have it too? Hardware build for a special purpose will always be more efficient for this purpose. There is some barrier for extending a CPU (and the software support) in a way, that throughput oriented tasks get offloaded to a wide vector engine, but ultimately it is more efficient for sure.

You have to look beyond the efficiency at a single task. Workloads are getting more diverse, so even GPUs are undergoing major changes to become more efficient at a variety of tasks. The next logical step (which may still take many more hardware generations) is to fully unify the CPU and GPU.

The biggest hurdle is lowering the power consumption of out-of-order execution. But they're well on their way to tackling that problem.

MfA · Jun 18, 2011

At GPU compute densities L1s are filled very very quickly ... I dunno if you could have enough L1 to not stream transformed vertices through higher level caches at those densities, even with tiling (with sort middle direct mode rendering you can forget about it). In ye old diminishing return desktop processor cores with huge caches, sure ... but that leaves a lot of performance on the table.

Gipsel · Jun 18, 2011

Nick said:
1024 or 2048 bits is hardly a batch of data worth sending to a heterogeneous component. You said so yourself: "You won't transfer just some hundred bytes from registers for a few additions, of course." There's just too much overhead.

But if there is not much to do, you don't need 1024 bit wide vector units to start with. If it is worth to be done on such units, you won't sending anything over (because it will not be in the normal CPU's registers anyway) you will just start to process the whole thing in the appropriate place. If they share caches and memory space (and they will), you just load the data to different units in the first place.

Nick said:
You have to look beyond the efficiency at a single task. Workloads are getting more diverse, so even GPUs are undergoing major changes to become more efficient at a variety of tasks. The next logical step (which may still take many more hardware generations) is to fully unify the CPU and GPU.

The biggest hurdle is lowering the power consumption of out-of-order execution. But they're well on their way to tackling that problem.

Cutting a long story short, of course it is the stated goal of Fusion (you cited it) to unify CPU an GPU. But why the hell do you think the result will be completely homogeneous? Why can't you combine fast latency optimized OOO integer and FP units with wide throughput optimized in-order units in a single entity? Then add a few specialized circuits for video de-/encoding (Intel's QuickSync got quite some praise lately), rasterizing, texturing and ROPs maybe and you have something not looking very homogeneous anymore.

MfA · Jun 18, 2011

Esthetics.

Nick · Jun 19, 2011

rpg.314 said:
My point was that the opportunity for clock gating will be fleeting and well dispersed, especially in fp intensive code that predicts well and has fewer branches anyway.

Why would it be fleeting? Again, since the 1024-bit instructions would require 4 cycles, you get the same scheduling ability with four times fewer uops in the reservation station. So you can clock gate the dispatch logic (and anything closely related) for a very long time.

One way to implement this would be to have a counter for the total number of cycles the uops in the reservation station take. Most instructions take 1 cycle, but 1024-bit instructions take 4 cycles and division takes ~14 cycles. If the counter goes above a certain threshold, dispatch and related logic can be clock gated. Once the counter dips below another threshold, the clock can be enabled again so the queues start getting topped up again.

Note that stalls due to other events can also result in the queues to be full enough to clock gate dispatch. So for scalar code with an average IPC of 2 and a dispatch rate of 4, it can be clock gated half the time. For the 1024-bit instructions the IPC would go down to 0.5 (without reducing the actual throughput) so dispatch can be clock gated up to 87% of the time!

If this sounds a bit like a pie in the sky to you, keep in mind that fine-grained clock gating is a very active research field. So something that may increase the clock gating opportunity by a factor 4 is something they won't pass up on. Heck, I wouldn't be at all surprised if they already took advantage of dispatch clock gating. 1024-bit instructions would instantly make it more effective.

Nick · Jun 19, 2011

Gipsel said:
Just for comparison, I make up some numbers based on the ancient RV770 (built in TSMC's 55nm):

1200 GFlop/s in 255mm² (including 256Bit memory controller, which could be shared with a CPU) => 4.7 GFlops/mm²

That's great. Now do the same for double-precision. Also please compare against competing NVIDIA chips, just to make sure AMD didn't create a chip that can't sustain its peak GFLOPS. And what about TDP? If Sandy Bridge was allowed to consume 150 Watt wouldn't it reach higher GFLOPS? And did the move to VLIW4 increase or decrease computing density? Care to take a guess at what effect more dynamic scheduling will have on compute density? How about full IEEE-754 support, exception handling, a unified address space, etc. Also, what's the error rate of a GPU versus a CPU?

Don't get blinded by some raw numbers.

Gipsel · Jun 19, 2011

Nick said:
Again, since the 1024-bit instructions would require 4 cycles, you get the same scheduling ability with four times fewer uops in the reservation station. So you can clock gate the dispatch logic (and anything closely related) for a very long time.

I would vote for running the scheduler at 1/4th the clock rate to start with, think nvidian!

Nick said:
That's great. Now do the same for double-precision. Also please compare against competing NVIDIA chips, just to make sure AMD didn't create a chip that can't sustain its peak GFLOPS. And what about TDP? If Sandy Bridge was allowed to consume 150 Watt wouldn't it reach higher GFLOPS? And did the move to VLIW4 increase or decrease computing density? Care to take a guess at what effect more dynamic scheduling will have on compute density? How about full IEEE-754 support, exception handling, a unified address space, etc. Also, what's the error rate of a GPU versus a CPU?

Don't get blinded by some raw numbers.

Not by an order of magnitude.
You keep linking to AMD's new "graphics core next" architecture. Be assured I'm very well aware of it. Maybe you should view it in the light of an answer to your questions and needs.

Why do you want to make a GPU shader core out of the AVX units?

3dilettante · Jun 19, 2011

AMD's next architecture has exposed additional parts of the pipeline and accepted some additional complexity in scheduling. The CU is not significantly more dynamic than what was present in the VLIW5 SIMDS, however.
Instead, relative to RV770, it cut per-wavefront issue capability by a factor of 5 in exchange for 5 times as many simultaneous wave issues with some new restrictions.

Gipsel · Jun 19, 2011

3dilettante said:
AMD's next architecture has exposed additional parts of the pipeline and accepted some additional complexity in scheduling. The CU is not significantly more dynamic than what was present in the VLIW5 SIMDS, however.
Instead, relative to RV770, it cut per-wavefront issue capability by a factor of 5 in exchange for 5 times as many simultaneous wave issues with some new restrictions.

It's the wrong thread here, but the number of wave instruction issues can rise by more than just a factor of 5. GCN can issue up to 5 instructions from 5 different wavefronts each cycle (the 4 instruction buffers for the 4 SIMDs are served round robin, sustained maximum with reasonable code probably ~3/cycle: 1 vector, 1 scalar and every 2 cycles a LDS instruction issue + branches + what the memory pipeline delivers). Cayman issues just a single VLIW instruction every 4 cycles. So it can be a lot higher, dependent on the code from a minimum of factor 4 to a maximum (not realistic) of 20. Especially for formerly short ALU clauses with a lot of control flow it will really shine with the scalar ALU in comparison to Cayman. The scalar ALU as well as the branch unit can accept a new instruction each cycle, not every 4 cycles like the SIMDs.
But you are completely right on the first point, the beefed up issue capabilities doesn't change that there is not much dynamics going on. The instructions are plainly issued in order to a given and predetermined unit with no fancy stuff going on.

3dilettante · Jun 19, 2011

It can issue one instruction from 5 waves, as there is a 1 instruction per wave restriction.
The aggregate issue is the same, although the slides did say graphics-only instructions were not included in that description.

The issue hardware is more flexible, which will add some unspecified amount of complexity, though other parts seem to be reusing what Cayman already had.

Gipsel · Jun 20, 2011

3dilettante said:
It can issue one instruction from 5 waves, as there is a 1 instruction per wave restriction.

Yes, 5 instructions of different types, which have to come from different waves each cycle. That is definitely a bit more than what was traded for the number of operation per issue (4 or 5 max) every 4 cycles.

To sum it up, maximum aggregate issue is quite a bit up. It won't help ALU throughput limited stuff much (just code with low ILP), but for all other stuff (LDS, branches) it will help a lot. In principle you can get 100% utilization of your SIMDs with code that has more than 50% scalar instruction and branches.

Nick · Jun 20, 2011

Gipsel said:
I would vote for running the scheduler at 1/4th the clock rate to start with, think nvidian!

When a 1024-bit instruction has been issued you can just clock gate the port's scheduler for 3 cycles. You do still need the full clock rate for other instructions.

You keep linking to AMD's new "graphics core next" architecture. Be assured I'm very well aware of it. Maybe you should view it in the light of an answer to your questions and needs.

It's just another intermediate step toward full homogeneous computing.

Why do you want to make a GPU shader core out of the AVX units?

Because it would be so much more than a shader core.

Gipsel · Jun 20, 2011

Nick said:
When a 1024-bit instruction has been issued you can just clock gate the port's scheduler for 3 cycles. You do still need the full clock rate for other instructions.

Or another scheduler

Nick said:
Because it would be so much more than a shader core.

Yes, so much more power hungry and larger.

I guess, we should stop here, don't you think?

Nick · Jun 20, 2011

Gipsel said:
I guess, we should stop here, don't you think?

Absolutely not. GCN merely catches up with Fermi. NVIDIA's next architecture will no doubt be a lot more attractive to develop for. And Intel converged AVX2 toward LRBni for a reason...

For instance support for recursion on a GPU is still merely a gimmick. The stack space per thread is tiny, so you can also forget about deep function calls. Fixing this requires lowering the number of threads, which can only be achieved with things like data prefetching, branch prediction, and out-of-order execution. Light-weight versions of such techniques can dramatically increase effective performance at complex workloads while keeping power consumption in check. Just ask yourself: Why does AMD require more dynamic scheduling? Will GCN make those reasons disappear for all eternity?

And do you think PCIe based memory coherence for discrete GPUs is going to work well? Or are we expected to buy an APU plus a discrete GPU? Developers don't like programming two devices, so they'll downright hate programming three. There's really no other choice but to make the CPU fully homogeneous. And it's well within reach, so I'm sure Intel is looking into it right now.

Arun · Jun 20, 2011

Nick said:
For instance support for recursion on a GPU is still merely a gimmick. The stack space per thread is tiny, so you can also forget about deep function calls. Fixing this requires lowering the number of threads, which can only be achieved with things like data prefetching, branch prediction, and out-of-order execution.

That is extremely unlikely. And branch prediction on a GPU is downright absurd. It makes a lot more sense in extreme cases to allow spilling the stack to memory (i.e. hopefully L2).

And do you think PCIe based memory coherence for discrete GPUs is going to work well? Or are we expected to buy an APU plus a discrete GPU?

I don't know about AMD, but in NVIDIA's case their solution is clear: integrate at least one Project Denver CPU core on every GPU going forward. It doesn't have to be the main system CPU to be usable via CUDA!

Gipsel · Jun 20, 2011

Nick said:
Absolutely not. GCN merely catches up with Fermi. NVIDIA's next architecture will no doubt be a lot more attractive to develop for.

If you say so, you already know how it will look like?

Nick said:
For instance support for recursion on a GPU is still merely a gimmick. The stack space per thread is tiny,

Tiny in comparison to the amount of data in the regs, yes. Try swapping out hundred of kilobytes registers on a CPU. Wait, a CPU doesn't have so many registers. I guess the performance would tank on a CPU if you completely trash your caches frequently. Hmm.

Nick said:
Just ask yourself: Why does AMD require more dynamic scheduling?

Does it?

Nick said:
Will GCN make those reasons disappear for all eternity?

Will GCN be the end of all things?

Nick said:
And do you think PCIe based memory coherence for discrete GPUs is going to work well?

I thought we are talking about stuff integrated on the same die, sharing the last level cache and the memory controller anyway?

Nick said:
Or are we expected to buy an APU plus a discrete GPU? Developers don't like programming two devices, so they'll downright hate programming three.

What about an software stack allowing to write code that will run on any device or combination of devices? The performance will only improve, if you reduce communication delays by integrating everything in one device.

Nick said:
There's really no other choice but to make the CPU fully homogeneous. And it's well within reach, so I'm sure Intel is looking into it right now.

As you started referring to the AMD Fusion Developer Summit, maybe you should look for the talk of the ARM representative there. Or what Intel said about the same problem not very long ago.

And again, I really think we should agree that we disagree and stop here.

22 nm Larrabee

Gipsel

3dcgi

rpg.314

Nick

Nick

MfA

Gipsel

MfA

Nick

Nick

Gipsel

3dilettante

Gipsel

3dilettante

Gipsel

Nick

Gipsel

Nick

Arun

Unknown.

Gipsel

Similar threads