A few comments and thoughts - first of all your terminology is very very confused, largely due to the fact that all the different GPU guys have bizarre and non-standard terminology. Trying to straighten that out is a pretty heroic effort.
I'm mostly interested in the vector portion of each core, to make a broad comparison of implementation cost.
I'm basing the terminology on Intel's, used in describing Larrabee, the only sane terminology out there.
A core does instruction decoding, a thread has a program counter and each thread consists of strands that populate SIMD-ALU lanes with work and which take multiple issue cycles to run basic instructions (e.g. single-precision ADD is 1 issue cycle on Larrabee, 2 on GF100 and 4 on R800).
A kernel is an entire set of instructions that is independent. Though there's a terminological problem relating to scope of execution here as Larrabee seemingly supports 128 kernels, as 32 cores x 4 kernels per core, while NVidia supports 16 kernels at 1 per core, while ATI probably supports 8 kernels with any subset of those 8 (1 to 8) per core.
There's also a doubt in my mind over whether ATI truly supports multiple compute kernels per core. R600 supports 8 render states per GPU, where each render state can have a distinct VS, GS, PS etc. It'd be logical that upto 8 compute kernels can be scheduled on a core in R800, but there's only a very vague statement of multiple kernel support so far.
ATI
- ATI's cores are not the VLIWs, they are the SIMDs.
- It's not really clear how the calls work
- ATI has 128KB cache/memory controller
- It's a wavefront, not a thread.
- The shared memory is ~same as NV, the difference is shmem/vector lane.
- Instruction issue in ATI is VLIW (which is also a variable length VLIW). The SIMD is 16 lanes wide. For some reason I forgot to mention that detail. Though it turns out all 3 architectures are 16-wide, so that in itself isn't a point of distinction.
- Calls are clearly described in the ISA document - they're static and support recursion through the Sequencer's stack.
- The 128KB cache per controller is outside of the cores and since it isn't read/write doesn't really have a meaningful functionality from the point of view of the core.
- I'm not interested in "wavefront", it's just a stupid name for thread.
- Shared memory is smaller than in GF100 (upto 48KB per 32 threads/1024 strands). There are coding implications in the way LDS operates in comparison with shared memory in NVidia, but I didn't want to go into those.
Intel
- There are 16 vector lanes, not threads
- The L1D is shared memory
- The L1D can be accessed every cycle by the vector pipe
- Intel threads != NV or ATI threads
- The SIMD is 16 wide and there are 16 strands per thread, so one issue cycle per instruction.
- Lines can only be locked in L2, so L1 isn't truly shared memory. GF100 is providing locking in L1 (though granularity of locking is not very exciting).
- This is similar to how shared memory works.
- Rather meaningless statement. In terms of the vector unit, they are the same in all meaningful senses. Intel merely has only 4. In order to improve latency hiding, programmers are forced to use fibres (a purely software construct, so arbitrary in number) for sharing a thread's execution allocation.
I could also have more explicitly compared the various forms of gather and scatter.
Intel has two gather paths (texture units and direct, though the texturing path effectively overloads the cache system/ring-bus - unclear if the TUs are useful for non texturing fetches) while it appears that the gather path in ATI and NVidia is common through a core-dedicated texture unit (i.e. fetches without filtering). All rates are 16 scalars (32-bit) per clock - though we're waiting to see the LSU clock speed in GF100.
Scatter appears to be 16 per clock in both Intel and NVidia, while in ATI it appears to be 64 across the entire GPU (i.e. only 2 cores can scatter at any time at the rate of 32 scalars per clock per core). ATI caching appears to be only cursory here (only at the MCs for coalescing - though there's a question mark over how global atomics are implmented and the read-back of global read/write resources in general), whereas Intel and NVidia have dedicated caching per core. Clearly there's not enough off-die bandwidth for all Intel and NVidia cores to scatter into memory simultaneously.
I could have included constant and instruction caches in my comparison, but I decided they're probably too small and aren't key comparison points when comparing the implementation cost.
Jawed