Some thoughts:
Incoherent reads are up to 16 times slower than Nvidia.
(Coherent reads can be slower on Nvidia if data isnt patritioned properly (highly unlikely? why one use shared memory for that?). Nvidia support 1 broadcast per cycle (read the same address by many threads)).
Larrabee can hide the latency of incoherent reads as it uses L1 to accumulate the data before the thread resumes. NVidia takes incoherent reads on the chin.
Intel's price appears to be fully-manual context switching though, with VPU cycles lost - so Larrabee can't actually hide all that latency.
So, overall, it looks like we'll just have to benchmark it
Is it 33% of logic or (logic+L1+L2)?
It seems to me that a core is defined as logic+L1+L2. Each L2 is only used by its core. Cores can only access foreign L2s under the cache-coherency protocol, which is effectively a request to fetch data to make a local copy.
Ring bus:
Only 128 GB/sec at 2 GHz??
Thats lower than memory bandwidth of current cards (140 GB/s GTX 280).
What about scaling beyond 24-32 cores? Can they increase bus width in the future?
R600 had 512+512 ring bus too...
If you take account of the fact that a ring bus normally supports multiple packets per direction per clock (between non-overlapping start-end segments) then you get more bandwidth. Also, the average trip length per direction is rather less than half the cicumference.
Interestingly, with the huge amount of bandwidth that Larrabee saves in render target operations, it means that texturing will take up a far larger proportion of the overall bandwidth of each frame than we see on FF GPUs.
So the TUs, while they're likely to be equally distributed around the ring, will also incur the highest average ring-bus trip lengths. They'll be fetching texels from all MCs and providing results back to all cores. That's my impression, anyway.
The TUs have their own cache. I dare say I'm assuming this cache is distributed per TU - though they're likely to be able to share texels amongst themselves.
So, the TUs will be using the ring bus pretty heavily and lowering the effective bandwith somewhat because of the relatively long trips they'll incur - and per texture result that is:
- a request packet from the core
- a TU cache coherency request and response, if the TUs share texels
- a TU fetch command to multiple MCs (fetch + pre-fetch)
- texels fetched from memory by multiple MCs
- texture results returned to requesting core
Jawed