22 nm Larrabee

Does anyone know how the FPU in Knights Corner is arranged? Most CPUs have two FPUs, one for Multiply and one for Add. Does Larrabee/Knights Corner only have one?

The reason I ask is because some internet sources say 32 core, 2GHz Larrabee would achieve 2 TFlops in SP with 512-bit SSE and FMA.

32 core x 2 GHz x FMA x 16 SP Flops/cycle x 2 FPUs = 4 TFlops.
 
Does anyone know how the FPU in Knights Corner is arranged? Most CPUs have two FPUs, one for Multiply and one for Add. Does Larrabee/Knights Corner only have one?

The reason I ask is because some internet sources say 32 core, 2GHz Larrabee would achieve 2 TFlops in SP with 512-bit SSE and FMA.

32 core x 2 GHz x FMA x 16 SP Flops/cycle x 2 FPUs = 4 TFlops.
It is a 512bit wide (16 floats) FMA pipe, so you can do either a MUL or an ADD or an FMA, but not one MUL and an independent ADD (as you could with separate MUL and ADD pipes).

So a 1.2 GHz MIC (afaik it's not running at 2 GHz) with 64 cores can reach
64 cores * 16 wide vector * 2 flops (FMA) * 1.2GHz = 2.46 TFlop/s
as peak throughput in single precision, far less than what you proposed for a 32 core version (with two vector units per core, but Larrabee or MIC just has as single one).
 
That would indeed seriously suck, but I don't think it makes sense for them to do that. Quad-core is pretty much mainstream now and if they want single applications to make use of an increasing number of cores/threads then HTM becomes vital. I'd even argue that it's less important for the server market since they mostly run multiple independent single-threaded applications.
A number of discussions concerning HTM bring up the cost of lock propagation across sockets. In the case of the desktop, propagation of state changes on-die is pretty fast (at least if you're Intel), and there are optimizations that can be put in place to speed up on-die synchronization. There is no strong expectation of client systems returning to multi-socket setups any time soon. It may not return to multi-die setups either.

In the case of HPC and server systems that cross multiple sockets, the propagation time is longer. In the case where hundreds of nanoseconds to microseconds are involved, the overhead of rolling back failed transactions can be masked by general latencies, and the performance upside is higher.

A single desktop CPU might be able to retry locks in much less time with less coherence overhead since it all falls back to a single shared LLC, and typically most latencies are simply shorter by default, so transactions have less idle time to hide in. Given the general burstiness of desktop loads and sensitivity to latency, the upside may be less and Intel might want a product segmentation checkbox.


I did review AMD's ASF a bit more, and while there's no data that indicates BD was intended to have it, there are elements of the design that at the very least do not put roadblocks to its implementation.

ASF promises at a minimum that 4 64-byte cache lines can be designated as speculative locations. This matches the associativity of both the L1 and WCC.
The sizing of the L1 and its associativity also avoids the virtual index aliasing problem the L1 Icache has, which for ASF can be important since a false synonym eviction could cause ASF to fail to meet its minimum capacity requirements.

ASF at least initially was planned to work in the cache miss buffers, and perhaps HTM primitives could as well.
One other possibility is that the L1 can serve as location for ASF, since it not purely write-through and is snooped along with the rest of the module's hierarchy.
 
A number of discussions concerning HTM bring up the cost of lock propagation across sockets. In the case of the desktop, propagation of state changes on-die is pretty fast (at least if you're Intel), and there are optimizations that can be put in place to speed up on-die synchronization. There is no strong expectation of client systems returning to multi-socket setups any time soon. It may not return to multi-die setups either.

In the case of HPC and server systems that cross multiple sockets, the propagation time is longer. In the case where hundreds of nanoseconds to microseconds are involved, the overhead of rolling back failed transactions can be masked by general latencies, and the performance upside is higher.

A single desktop CPU might be able to retry locks in much less time with less coherence overhead since it all falls back to a single shared LLC, and typically most latencies are simply shorter by default, so transactions have less idle time to hide in. Given the general burstiness of desktop loads and sensitivity to latency, the upside may be less and Intel might want a product segmentation checkbox.


I did review AMD's ASF a bit more, and while there's no data that indicates BD was intended to have it, there are elements of the design that at the very least do not put roadblocks to its implementation.

ASF promises at a minimum that 4 64-byte cache lines can be designated as speculative locations. This matches the associativity of both the L1 and WCC.
The sizing of the L1 and its associativity also avoids the virtual index aliasing problem the L1 Icache has, which for ASF can be important since a false synonym eviction could cause ASF to fail to meet its minimum capacity requirements.

ASF at least initially was planned to work in the cache miss buffers, and perhaps HTM primitives could as well.
One other possibility is that the L1 can serve as location for ASF, since it not purely write-through and is snooped along with the rest of the module's hierarchy.

I haven't looked at HTM in detail, but is it anything more than glorified LL/SC over "large" segments of memory.
 
ASF as described involves a partial rollback in the case of failure, and provides feedback to software as to why it failed. It seems that more arbitrary actions can be done in a speculative section, and potentially a section can be written such that it's a LL/SC where the LL is not the same location as the final SC.
 
ASF as described involves a partial rollback in the case of failure, and provides feedback to software as to why it failed. It seems that more arbitrary actions can be done in a speculative section, and potentially a section can be written such that it's a LL/SC where the LL is not the same location as the final SC.

I meant something like just lock a couple of cachelines, let a few arbitrary operations take place on them, if any conflicting transactions occur, report them and roll back everything to locked state, else store it back to memory.

The details might be somewhat different, but isn't the idea broadly the same as LL/SC?
 
Last edited by a moderator:
The overarching idea is the same that a transaction does not commit if an area in memory that is part of its footprint is accessed before writeback. Most descriptions of LL/SC seem to treat it as a primitive and restricted form of TM.
 
It is a 512bit wide (16 floats) FMA pipe, so you can do either a MUL or an ADD or an FMA, but not one MUL and an independent ADD (as you could with separate MUL and ADD pipes).

So a 1.2 GHz MIC (afaik it's not running at 2 GHz) with 64 cores can reach
64 cores * 16 wide vector * 2 flops (FMA) * 1.2GHz = 2.46 TFlop/s
as peak throughput in single precision, far less than what you proposed for a 32 core version (with two vector units per core, but Larrabee or MIC just has as single one).
What about the scalar unit?
 
What about the scalar unit?
It is a small in order x86 core (claimed to be Pentium Classic based). It's more for managing stuff and does not contribute significantly to the arithmetic computing power (almost the same purpose as the scalar unit in GCN, but more versatile). Anything specific do you want to know?
 
Are you sure the kind that Fermi has can be called "wide SIMD" compared to AVX?
You should compare against SSE instead of AVX. AMD proposed FMA support for SSE5, to compensate for Bulldozer essentially having just one 128-bit FP SSE unit per core (two shared ones per module to be exact). AVX support was bolted on later by having the decoders split the instructions into two 128-bit operations. Each core is limited to one 256-bit FP AVX instruction every two cycles.

To really focus on wider SIMD execution they'd have to extend the execution units to 256-bit each.

Fermi essentially has two 512-bit units per core, executing 1024-bit instructions. I don't expect CPUs to become that wide any time soon but support for AVX-512/1024 instructions executed on 256-bit units would be a welcome addition that seems quite feasible for the Skylake timeframe.
 
A single desktop CPU might be able to retry locks in much less time with less coherence overhead since it all falls back to a single shared LLC, and typically most latencies are simply shorter by default, so transactions have less idle time to hide in. Given the general burstiness of desktop loads and sensitivity to latency, the upside may be less and Intel might want a product segmentation checkbox.
Locks become problematic beyond quad-core even for a single die. Hardware transactional memory not only lowers the overhead but also guards against priority inversion, convoying, and deadlock.

So HTM is a vital feature to keep multi-threaded software development worthwhile both from a cost and gain perspective. It's in Intel's best interest not to segment support for it.
I did review AMD's ASF a bit more, and while there's no data that indicates BD was intended to have it, there are elements of the design that at the very least do not put roadblocks to its implementation.
It seems obvious they didn't put roadblocks in place for their own proposal, but I suspect that they got word from Intel about something more ambitious...
 
Locks become problematic beyond quad-core even for a single die.
Then that explains why it's probably not a desktop priority for Intel for Haswell.

It seems obvious they didn't put roadblocks in place for their own proposal, but I suspect that they got word from Intel about something more ambitious...
The entire SSE5/AVX/XOP/FMA4/FMA3 debacle happened because this kind of communication didn't happen. It's open to question that Intel would have disclosed much about HTM, when it is a comparatively more profound change than AVX was, and AMD didn't get good information until its FPU was well and screwed.
 
Then that explains why it's probably not a desktop priority for Intel for Haswell.
I disagree. HTM is essential for multi-core scaling for the foreseeable future, including the desktop market. Since the core count is likely to increase again after Haswell, developers need access to affordable hardware to create tools and build scalable applications sooner rather than later. It's in Intel's best interest to make HTM as broadly available as possible.

Considering the validation involved, the massive AVX2 selling point, and zero mention of HTM in the Haswell new instruction announcement, I don't really expect it to appear with Haswell, but if there's any truth to the rumors then I see no good reason why Intel would hold it back for any Broadwell part.
The entire SSE5/AVX/XOP/FMA4/FMA3 debacle happened because this kind of communication didn't happen. It's open to question that Intel would have disclosed much about HTM, when it is a comparatively more profound change than AVX was, and AMD didn't get good information until its FPU was well and screwed.
Unlike SSE extensions, AVX was announced long before hardware supporting it hit the streets. Intel probably released the spec as early as they possibly could. Moving FMA support to AVX2 and then changing it from FMA4 to FMA3 could actually be regarded as proof that the spec was released a tad too soon. So I don't think you can blame Intel for what happened to SSE5; AMD had an inferior extension and it was their decision to push SSE5's remnants into XOP and not go back to FMA3 or drop FMA support for Bulldozer. In other words I doubt the fragmentation would have been prevented by more open communication.

As for HTM; Intel, IBM and Sun founded a "Drafting Group" to communicate about it. So it still seems likely to me that after AMD laid eyes on IBM's design they realized their ASF proposal was inferior (I haven't compared them yet).
 
Last edited by a moderator:
Unlike SSE extensions, AVX was announced long before hardware supporting it hit the streets.
SSE5 was announced far ahead of any supporting product.
Previous extensions by both parties were not disclosed with similar lead times.

Intel probably released the spec as early as they possibly could. Moving FMA support to AVX2 and then changing it from FMA4 to FMA3 could actually be regarded as proof that the spec was released a tad too soon. So I don't think you can blame Intel for what happened to SSE5; AMD had an inferior extension and it was their decision to push SSE5's remnants into XOP and not go back to FMA3 or drop FMA support for Bulldozer. In other words I doubt the fragmentation would have been prevented by more open communication.
AMD's extension was inferior, in part because it was more conservative in the change it brought to the ISA. One of the key differentiators is that SSE5 was more conservative in the amount of change it made to the bit encoding for new instructions.
Intel devoted more bits and made other changes to clean up the encoding, as well as adding forward looking things as the revised register save functions.

If this more significant change had been out in the open earlier, it would have worked out differently.

As for HTM; Intel, IBM and Sun founded a "Drafting Group" to communicate about it. So it still seems likely to me that after AMD laid eyes on IBM's PowerPC A2 design they realized their ASF proposal was inferior.
Let's hope AMD got an advance look at IBM's work, since the announcement happened after BD's specifications were released and the chip had been going through the fabs for multiple quarters.
It might need to be even longer than that, since BD would have been firmed up and features disclosed to interested parties even earlier than that.
 
Back
Top