I suppose since neither Larrabee or at-cost Xeons and multi-socket boards are commercially procurable (unless you're building a top500 supercomputer), we can call this a draw.I'm not saying that at all... I'm say cost to produce is more relevant for comparing the efficiency of a processor than cost to consumer (which includes profit margins).
1/4 of the threads can execute a pair of instructions a clock, that is 64 instructions per clock on dual-issue cores. For the usage we are debating, this is more acheivable since there is so little peak issue width to waste.The different hardware threads are just to cover various latencies, etc. They cannot all execute an instruction in the same clock. See the Larrabee architecture paper:
For 8 i7 cores, some combination of threads can at most execute 32 in aggregate, maybe 40 if we get optimal macro-op fusion. This is less reachable and less relevant for a highly thread-parallel workload.
Larrabee's not sufficiently threaded to cover any significant amount of memory latency, and neither is Nehalem.To consider the benefit of these "HW thread" implementations then, you need to consider the memory architecture, which is very different between the two. Sure Larrabee theoretically has twice as many hardware threads with which to hide latencies, but GPU memory latencies are typically far more than 2x longer than CPUs.
Larrabee's hardware prefetch capabilities may not have been developed, it was not disclosed. Given the penalty prefetching can impose on bandwidth-heavy workloads, I wouldn't expect it to be as aggressive as Nehalem.
In terms of memory accesses, Nehalem has 48 entries in its load buffer. This is an aggregate 384 outstanding loads. Larrabee is in-order and would not attempt to do this level of speculation in memory. Bare minimum, I'd expect at least one outstanding load per thread, for at least 128.
On the other hand, if Larrabee has a memory bus roughly equivalent to other GPUs, it would take between 2-4 Nehalems to match its bandwidth.
I've already made note of how horrid Larrabee would be on x86 code without vector instructions, in part because doing so leaves very little else it can run.Again, I'm not really arguing with you per se, just pointing out that the real power of throughput architectures is the SIMD. They make trade-offs that make them much less impressive when running (even multi-threaded) scalar C code.
However, even inefficient usage of the VPU, like say predicating 3/4 of the lanes off, would still give it some interesting strengths.
edit:
Even 15/16 lanes off, if only because I don't know what other FP math support it would have.
Nehalem could do an FP MUL, ADD, and store per clock.
Larrabee can do one VPU op and a vector. store
If its an FMADD, then each Larrabee core has similar throughput, although in restricted circumstances.
If there is no dependent add, then it's either a FMUL and store, or FADD and store.
That's 2/3 the issue capability of a Nehalem core, and in the case of a dual-socket i7 that's with 4 times as many cores.
Last edited by a moderator: