22 nm Larrabee

Nick · Jun 15, 2011

rpg.314 said:
Not necessarily. As each cycle lapses, there would be more instructions in the buffer, causing the scheduler to look for all possible multiple issue opportunities. It does nothing about the sub-scheduler responsible for serial instructions.

Besides, clock gating for >~3 cycles would be next to impossible in the sanctum sanctorum of the core for any reasonable workload, saving almost no power.

I guess some of it depends on whether the scheduler is unified or not. When executing 1024-bit instructions certain buffers would become full so things like register rename can be clock gated for a while. Note that Sandy Bridge's uop cache can cause fetch and decode stages to be clock gated. If I recall correctly it takes a couple cycles to start them up again, but it's well worth the power savings.

3dilettante · Jun 15, 2011

Nick said:
I think a non-pipelined instruction is different because the entire register is fed into the ALU on the first cycle. Since 1024-bit physical registers and data paths are not realistic I imagine four 256-bit registers are needed instead, accessed sequentially. But this complicates register renaming and scheduling since a single instruction could involve 12 physical registers (16 for FMA4).

This would be most immediately implemented by breaking the instruction into multiple uops. The allocation logic would lump together adjacent register locations, and the renamer would only care about the initial location when doing the renaming.
The uops would be dispatched in sequence, with a completion flag that would not be flipped until all have completed.

Nick · Jun 15, 2011

rpg.314 said:
Data will go via L3 - which is inclusive on Intel anyway - with any cachelines marked as M hopefully kicked downstairs. Even with an exclusive L3 like on AMD, you would want the data crossing cores to be out of L1 and L2 anyway, to save excessive coherency induced stalls. Latency will be ~25 cycles, hardly anything to worry about for large kernels.

It's not just the L3 latency. You need to send a large batch of data over, issue commands, wait for it to finish processing, and pull the results back in. Also, if multiple cores are doing this dance then you'll quickly get bandwidth limited. A lot of power and cycles go to waste doing nothing useful.

There's a OoM difference in perf/W and growing. It will be higher for well designed systems.

I doubt it's growing. Multi-core, wide vectors and FMA each enable the CPU to catch up with the GPU. For many years the GPU was able to increase performance/Watt by dedicating every more die space to shading units, but now they're out of space.

That's because developers don't have much experience.

They're only slowly starting to get used to multi-core programming. You really think they'll ever become "experienced" at programming a wide variety of heterogeneous chips? And I'm not saying the average developer wouldn't be capable of writing such software; it's just not worth all the effort. Something that looks good on paper will be an utter failure if developers are not eager to invest their time and money into writing software for it.

That way the entire core is powered on all the time, killing your perf/W.

Not the entire core. Intel has started to use fine-grained aggressive clock gating to reduce the power consumption of logic blocks when they're idle. Agner Fog discovered for instance that Sandy Bridge clock gates floating-point components, and might even power gate the upper 128-bit of the AVX units.

With realistic hw, there's an upper limit of 3-4 instructions per clock and we are already there. Serial perf is a dead end.

I'm not talking about increasing serial performance. I'm talking about keeping it while also turning the CPU into a homogeneous high throughput architecture.

Nick · Jun 15, 2011

3dilettante said:
This would be most immediately implemented by breaking the instruction into multiple uops. The allocation logic would lump together adjacent register locations, and the renamer would only care about the initial location when doing the renaming.
The uops would be dispatched in sequence, with a completion flag that would not be flipped until all have completed.

Would there really be a need to explicitly break it into multiple uops to achieve this?

hoho · Jun 15, 2011

Nick said:
I doubt it's growing. Multi-core, wide vectors and FMA each enable the CPU to catch up with the GPU. For many years the GPU was able to increase performance/Watt by dedicating every more die space to shading units, but now they're out of space.

yeah but in GPUs you have a TON of really dumb computing units that are relatively inefficient at executing spaghetti while normal CPUs have tons of die space dedicated for OoO execution and massaging that spaghetti into something that can be executed effectively. Die area dedicated to cache sizes is in a somewhat similar situation.

Yes, wide SIMD units with FMA and other fancy things definitely help but as long as cores still drag their "legacy" single-thread acceleration stuff with them GPUs can always fit more FLOPS inside similar TDP/transistor count.

Gipsel · Jun 15, 2011

Nick said:
It's not just the L3 latency. You need to send a large batch of data over, issue commands, wait for it to finish processing, and pull the results back in.

A "large batch of data" resides in the caches (or the memory) anyway. You won't transfer just some hundred bytes from registers for a few additions, of course.

3dilettante · Jun 16, 2011

Nick said:
Would there really be a need to explicitly break it into multiple uops to achieve this?

It would be the most expedient way to implement it, by leveraging the logic that is already in place.
The register allocation logic already has the knowledge of the register ID and the logic needed to increment it for each successive uop.

The logic downstream currently doesn't generate new register IDs, because that is determined when the uop was created.

It seems possible that a scheme could be created where the scheduler or execution unit has additional stateful behavior, and it knows to loop on a uop and increment the register ID sent to the register file or bypass network for each cycle. It's a more complex solution and would require more work and time to implement.
This may add some latency to the process. When the register ID is static, the hardware can send the value out in parallel to various places in the same clock, like the register file access logic and bypass checks. In a stateful case, the ID value needs to run through a small extra round trip with an adder and cycle counter before it can be routed to those places.

There are probably other ways of doing it, but that is what comes to mind.

rpg.314 · Jun 16, 2011

Nick said:
I guess some of it depends on whether the scheduler is unified or not. When executing 1024-bit instructions certain buffers would become full so things like register rename can be clock gated for a while. Note that Sandy Bridge's uop cache can cause fetch and decode stages to be clock gated. If I recall correctly it takes a couple cycles to start them up again, but it's well worth the power savings.

Clock gating would be a win if you can find 2 (and more like 4 cycles) idle. I don't think that will happen realistically.

rpg.314 · Jun 16, 2011

Nick said:
It's not just the L3 latency. You need to send a large batch of data over, issue commands, wait for it to finish processing, and pull the results back in. Also, if multiple cores are doing this dance then you'll quickly get bandwidth limited. A lot of power and cycles go to waste doing nothing useful.

In a well designed heterogeneous system, this overhead will be the same as the overhead in a homogeneous system.

I'm not talking about increasing serial performance. I'm talking about keeping it while also turning the CPU into a homogeneous high throughput architecture.

It will be a total waste of effort. Developers will soon be pushing anything might need high throughput to on die GPUs. Intel will go down this road nonetheless and drag AMD with it.

Nick · Jun 16, 2011

hoho said:
yeah but in GPUs you have a TON of really dumb computing units that are relatively inefficient at executing spaghetti while normal CPUs have tons of die space dedicated for OoO execution and massaging that spaghetti into something that can be executed effectively.

The amount of dumb FLOPS isn't all that high. The Intel HD Graphics 3000 can only do 130 GFLOPS, while the Sandy Bridge cores can do 220 GFLOPS. Add FMA support and you're looking at 440 GFLOPS (at only a minor increase in area). Sure, power consumption for the CPU cores is higher due to clock frequency, but is anyone really going to opt for pushing his calculations to a 3.5 times weaker IGP, which probably has trouble getting anywhere near its theoretical performance? GPGPU on the same die is a dead end.

Die area dedicated to cache sizes is in a somewhat similar situation.

I think you're underestimating the amount of die space dedicated to various forms of storage on a GPU. Keeping thousands of threads in flight requires lots of register and stack space, to save on RAM bandwidth ever larger data caches are needed, and running long kernels concurrently requires larger instruction caches. Heck, Sandy Bridge's IGP even has access to the entire L3 cache, and it's still bandwidth limited. Some suggest adding a slab of eDRAM. Great, but then stop pointing at the amount of die space a CPU invests in storage as a bad thing.

Besides, 8T-SRAM is quite power efficient, so as long as it increases performance/Watt over other uses, it's a good investment of the transistor budget.

Yes, wide SIMD units with FMA and other fancy things definitely help but as long as cores still drag their "legacy" single-thread acceleration stuff with them GPUs can always fit more FLOPS inside similar TDP/transistor count.

Sure, but you've got the solution right there. Manage the ratio of power consumed by control and computing logic. Ever since the Prescott architecture, Intel has been hard at work implementing ways to improve performance/Watt. Haswell will have a whopping 8-fold increase in FLOPS per core over Prescott, and there's talk of significantly reducing TDP, especially for mobile parts.

I believe eventually brute-force out-of-order execution will be replaced with alternatives that offer a better compromise between power consumption and serial performance. Intel's "Continual Flow Pipelines" research is one possible route, but I'm sure there are others worth evaluating too.

Suffice to say that there are plenty of parameters to tweak to continue scaling performance/Watt for a homogeneous architecture.

hoho · Jun 16, 2011

Nick said:
The amount of dumb FLOPS isn't all that high. The Intel HD Graphics 3000 can only do 130 GFLOPS, while the Sandy Bridge cores can do 220 GFLOPS. Add FMA support and you're looking at 440 GFLOPS (at only a minor increase in area).

Now tell me about GFLOPS per transistor for the two architectures

Nick · Jun 16, 2011

Gipsel said:
A "large batch of data" resides in the caches (or the memory) anyway.

No it doesn't. You still have to assemble the data into a batch suited for processing by a specialized heterogeneous component. A homogeneous architecture on the other hand can process smaller amounts of data much more efficiently, as the data likely never has to leave L1/L2 cache till it's done processing.

You won't transfer just some hundred bytes from registers for a few additions, of course.

Exactly! And there are tons of cases like that.

So instead of moving workloads "to the particular area of the chip capable of executing that workload the most efficiently", why not make sure the place to be is the same core, using a suitable set of instructions?

Yes, the power consumption of out-of-order execution poses a challenge, but can be managed from one generation to the next. In contrast, memory bandwidth and latency don't scale well. So things have been evolving in favor of homogeneous processing for quite a while. Heck, the earliest graphics cards processed only one or a few polygons at a time. Nowadays we have to go to great lengths to ensure that the massively parallel GPU can limp behind a couple frames! There's no way to scale that any further, especially since GPGPU applications want to read back results sooner, not later. So the GPU has to get closer to the CPU, both physically and architecturally...

The future is homogeneous, and Intel is taking a head start.

Nick · Jun 16, 2011

3dilettante said:
Another bit of work that is needed if we try a gather that is little more than N standard loads chained together is a way to evaluate the index values for the gather to see if they are neighbors on the same cache line, based on their values and the value of the base register loaded from the scalar pipeline.
If we don't rely on a stream of microcoded instructions to do this, we need a logic block that can spit out a list of cache line neighbors in order to elide duplicate loads. There is currently no guarantee that Haswell will attempt this, but it would help reduce the latency and power cost of using the instruction.
The comparison would be an N-way check of the arithmetic difference of each index value + the lowest significant bits of the base register needed to pick a cache line. The resulting values would be checked for where they lie on a cache line.
This could be done with a stream of instructions, though it would be faster if specialised hardware helped.

You don't have to compute the full arithmetic difference of each index value. You just have to compare the upper bits for equality, and check whether the sum of the lower 7 bits of the base and index cause an overflow into the next cache line. Since the results aren't needed till a later clock cycle, it seems trivial to provide small and power efficient dedicated hardware for it.

I'm slightly more concerned about the hardware needed to extract multiple elements simultaneously. Considering that cache line sizes of 64 bytes have been common for a decade it might not be that big a deal to have multiple of these shift units though. Larrabee is even supposed to be capable of gathering up to 16 elements per cycle, and has smaller cores, so I'm optimistic. Latency might be affected, but there's actually an elegant solution for that; see below.

The next question would be the number and width of Haswell's memory ports, which could change some details as well.

Two read ports, one write port, all 256-bit. It's pointless to feature two 256-bit FMA units if the caches can't provide sufficient bandwidth. x86's limited number of registers, and the fact that non-destructive AVX instructions help increase throughput, make this an absolute necessity.

Indeed this affects the gather implementation. I imagine either each load unit has a lightweight gather implementation, they cooperate in some way, or only the second one has an advanced gather implementation. The latter would be an interesting option since it allows for the second unit to have a higher latency. If most other load operations use the first low latency port, it would hardly affect legacy workloads.

If we look at Bulldozer with 2 128-bit FMAs and a vector permute block, we see an FPU that is something shy of half the size of the 2 integer cores in a module.
Within the FPU, the XBAR block is between the two register file halves, and would be about 20-25% of the area of the rest of the FPU (and it is only 128 bits). There are other ops in that block of course (horizontal ops, etc), which Haswell will have as well, so growth is not on the permute block alone.
An FMA is somewhat smaller than having 1 FMUL and 1 FADD. It goes to follow two of them is somewhat smaller than doubling the ALU pipes in the FPU of Haswell.

Bulldozer's FlexFP unit can execute up to four instructions each cycle. Sandy Bridge on the other hand borrows some of the integer data paths for the 256-bit operations. So it seems cheaper to me to equip Haswell with two 256-bit FMA units, than to extend the FlexFP unit to sustain two 256-bit operations.

Haswell is also a new core design, so the integer side would be growing as well.

According to some rumors, they might extend macro-op fusion to pairs of mov and ALU instructions, turning them into a single non-destructive operation (if applicable of course). This would merely affect the decoders. The ironic part is that current compilers actually avoid emitting such code, so legacy code may not observe much of a benefit. Recompiled code (which is still perfectly compatible with older x86 chips) could run a bit faster though, and just like test and branch macro-op fusion this technique would slightly improve performance/Watt.

3dilettante · Jun 16, 2011

Nick said:
You don't have to compute the full arithmetic difference of each index value. You just have to compare the upper bits for equality, and check whether the sum of the lower 7 bits of the base and index cause an overflow into the next cache line.

That makes more sense, I wasn't thinking it through. Equivalence checks for the high-order bits could be done in parallel with the small adds. Does this cover the case where the add from an index overflow to a value that matches the cache line for another index, or will an additional check after the adds be needed?

I'm slightly more concerned about the hardware needed to extract multiple elements simultaneously. Considering that cache line sizes of 64 bytes have been common for a decade it might not be that big a deal to have multiple of these shift units though. Larrabee is even supposed to be capable of gathering up to 16 elements per cycle, and has smaller cores, so I'm optimistic. Latency might be affected, but there's actually an elegant solution for that; see below.

This could be done using a permute instruction. If the operations are fully pipelined, a string of gathers can reach a throughput of 16 elements per cycle.

It's potentially more complex outside of that case. Do you see if there is a part of the description for the permute instruction that allows it to selectively permute values to a destination without writing over any other elements, or would the chip need to run a permute and then a blend to get the effect?

Indeed this affects the gather implementation. I imagine either each load unit has a lightweight gather implementation, they cooperate in some way, or only the second one has an advanced gather implementation. The latter would be an interesting option since it allows for the second unit to have a higher latency. If most other load operations use the first low latency port, it would hardly affect legacy workloads.

The gather could be microcoded, so the load units wouldn't need to change at all.

Bulldozer's FlexFP unit can execute up to four instructions each cycle. Sandy Bridge on the other hand borrows some of the integer data paths for the 256-bit operations. So it seems cheaper to me to equip Haswell with two 256-bit FMA units, than to extend the FlexFP unit to sustain two 256-bit operations.

It is cheaper in terms of shared data paths in a design that did not promote the INT SIMD paths to 256 bits. It hasn't been discussed how this will be arranged when they are promoted in Haswell.
Are we assuming Intel is not providing additional 256-bit data paths for the FMAs? Intel would run out of data paths to borrow in this case.
Intel is adding a permute capability that probably contributed to a very significant portion of the bulk in BD, aside from the other changes.

Nick · Jun 16, 2011

rpg.314 said:
Clock gating would be a win if you can find 2 (and more like 4 cycles) idle. I don't think that will happen realistically.

When executing many 1024-bit instructions (that take four cycles), the reservation stations for these ports will quickly fill up. The reorder buffer also becomes full, so after a short while no new instructions for other (scalar) ports can be dispatched and the reservation stations for these other ports run dry. At that point, the dispatch and register rename logic as well as the reservation station and ALUs of the scalar port, can all take a multi-cycle nap. Only once a few reorder buffer entries become available again, it's worth waking them up again.

So clock gating with a delay of 4 cycles sounds absolutely fine to me!

rpg.314 · Jun 16, 2011

I meant you'll get a window of 2-3 cycles in all to clock gate, realistically.

Nick · Jun 16, 2011

hoho said:
Now tell me about GFLOPS per transistor for the two architectures

CPU cores: 217.6 GFLOPS / 4 x 3.15 mm x 5.46 mm = 3.16 GFLOPS/mm²
IGP: 129.6 GFLOPS / 4.71 mm x 8.70 mm = 3.16 GFLOPS/mm²

That's for a Core i7-2600 as-is, at baseline frequencies. Haswell adds FMA to the mix...

hoho · Jun 16, 2011

Nick said:
IGP: 129.6 GFLOPS / 4.71 mm x 8.70 mm = 3.16 GFLOPS/mm²

Any idea how big part of the chip is taken by fixed-function stuff that isn't counted in the FLOPS like texture filtering that has to be emulated in software on CPU?

Nick · Jun 16, 2011

rpg.314 said:
I meant you'll get a window of 2-3 cycles in all to clock gate, realistically.

Why? As long as 1/4 of an SIMD reservation station consists of 1024-bit instructions, there's plenty of work and no need to top it up. That's a huge number of cycles during which much of the other power hungry logic can be clock gated.

Gipsel · Jun 17, 2011

Nick said:
No it doesn't. You still have to assemble the data into a batch suited for processing by a specialized heterogeneous component. A homogeneous architecture on the other hand can process smaller amounts of data much more efficiently, as the data likely never has to leave L1/L2 cache till it's done processing.

You advocate processing in AVX like units extended to 1024 bits at least logically. That are 32 floats, exactly the logical vector size nvidia uses in its GPUs. AMD uses a vector length of 2048 bits logically. Both are using units processing 16 floats/clock (512bits) physically. I don't see much of a difference or a need at all to "assemble the data into a batch suited for processing by a specialized heterogeneous component", it will work just the same.

Nick said:
Exactly! And there are tons of cases like that.

If you would first need to "assemble a batch" to be suited for processing by wide vector units, it is probably better to resorting to plain old and maybe even scalar SSE

Nick said:
So instead of moving workloads "to the particular area of the chip capable of executing that workload the most efficiently", why not make sure the place to be is the same core, using a suitable set of instructions?

Because you can't eat the cake and have it too? Hardware build for a special purpose will always be more efficient for this purpose. There is some barrier for extending a CPU (and the software support) in a way, that throughput oriented tasks get offloaded to a wide vector engine, but ultimately it is more efficient for sure.

22 nm Larrabee

Similar threads