What's the actual impact of extra read/write ports (but keeping the same total bus width) on size and performance?
I'm probably not qualified to give a really good answer on that, but perhaps the professor would have a good answer. If gather wasn't so expensive, I'm sure you would have seen it in SSE by now...
With 16-wide SIMD units, I expect it can transpose a 4x4 matrix in one instruction. This would make AoS-SoA conversion very fast, and indeed make scatter/gather less of a necessity for everything except texture sampling.
Perhaps Larrabee would have a full generalized vector permute, but you would need a 4bit*16 = 64bit value for that, so my guess is that vector permute will require an integer register (and not work from an immediate) which would be more expensive. Still most AOS code will not be able to average above 50% utilization (meaning actual ALU work done vs permutation/moving stuff around) of the 16-wide vectors over its code even with scatter/gather, so AOS is realy not a good path to think about at all.
I'm not sure I follow here. Render targets are never read directly after being written. The first read as a texture could be in a very different location than the last write. So pixel order doesn't matter, at this level.
Perhaps not immediately, but you are going to need to have texture access to render targets for obvious stuff (like any deferred shading paths or image space post processing paths), so there is no sense in doing another pass just to re-order pixels for 2D locality required by the texture fetch units.
Not necessarily. The software is not restricted to implementing one rendering pipeline implementation, so you can specialize for the type of primitive. Generic scatter/gather would be handy though.
I don't think Larrabee will have scatter, but instead will require 16 cycles to scatter, so single pixels will still be slow, especially if blending. But perhaps at least it won't be as bad as the multiple extra instructions needed to scatter on SSE.
And I think it can have an advantage in terms of efficiency. It never has a bottleneck, never has bubbles because of register pressure, doesn't have to mask away unused components, can reduce overdraw to zero, specialize for specific scenarios, etc.
Lets not get too optimistic here... for one thing, current GPUs already cache textures about as optimally as you could expect Larrabee to if it has a separate texture cache (if not separate then texture caching will probably be suboptimal on Larrabee), so Larrabee won't have any advantage in texture fetch performance, and will suffer from the exact same texture bottleneck that currents GPUs have. Masking also is a given for shaders that have any divergent branches, and as I described before, the branch granularity will probably need to be large (32,48, or 64) to have enough instructions after a texture load to hide the latency of going to L2 (and beyond) with only 4 in-order threads per core. Overdraw is application specific, and is only going to increase as GPU performance goes up, as artists and programmers can get away with more transparent effects (adding a fully programmable ROP will accelerate this).
Don't get me wrong, I still think Larrabee will be a very interesting platform to program on.
Attribute interpolation is at most two multiply-add operations, I see no issues there. And I don't think there's any part of the pipeline we haven't discussed yet that isn't obviously fairly efficient to implement on a CPU.
At least to be perspective correct you need more than two MACs right?
One thing to remember here, is that a software GPU is going to need to take ALU and memory bandwidth to do all the work and routing that the normal GPU does in hardware. At least in this simple case of just shader (after interpolation) and ROP, you probably have an extra (very rough estimate),
- 2x load from RT (get dest)
- 4x unpack
- subtract (1-alpha)
- 3x mul (1-alpha)*dest
- 3x mac (alpha)*src+dest ... finish blending for RGB
- 2x pack
- 2x store to RT
perhaps 17 instructions (average a little over 1 op/pix however) to do the ROP alone for a RGBA FP16 render target when doing simple alpha blending. Now in the case of simple particle engine (common blending case) which involves one texture lookup and perhaps a single multiply times an interpolated color value (from the vertex shader), a majority of the ALU work would be for ROP and attribute interpolation.
The point is that the extra software GPU work can easily exceed the amount of work actually done in the shader itself. Of course, TEX:ALU ratio issues might help Larrabee hide some of the extra ALU work done in software, but not all of it (in this simple particle engine case the particle texture might mostly be in L1 all the time). Even if Larrabee's peak Gflops is the same as GPUs, many clock cycles will get eaten by non-shader work.
[edit] of course this is just a simple example which doesn't cover the other important overheads like early-Z, stencil, and everything prior