I seriously doubt that Larrabee is going to have generalized scatter or gather without a performance hit (like it is with SSE), would require too many read/write ports on the cache?
What's the actual impact of extra read/write ports (but keeping the same total bus width) on size and performance?
High-performance scatter/gather makes most sense for L1 cache. If all of the elements are located there you want it to be a low latency, high throughput operation (unlike SSE4's extract/insert instructions). When some of the elements are in L2 cache latency will be higher anyway, so it doesn't matter much if it takes a few extra cycles. The next few accesses should hit L1 cache again...
But it seems as if it will for sure have to have gather from the texture units, and I can see some very good reasons to have a separate texture cache (L1).
It would be very interesting to know a bit more about Larrabee's texture samplers. Are they just 'co-processors' to the mini-cores that execute the texture sampling instructions? Are they glorified scatter/gather units and the filtering happens in the cores? How does cache coherency work with the rest of the chip? Etc.
I don't think it would be wise to count on scatter or gather for performance, but rather stick to standard SOA style coding (and fully aligned and coalesced memory access pattern).
Seems as if you had access to interleaving and format conversion instructions, the ROP could easily be built in software. With the speculative mental model I have for Larrabee's design, working on fully aligned 4x4 pixel blocks in the fragment shaders would make sense...
With 16-wide SIMD units, I expect it can transpose a 4x4 matrix in one instruction. This would make AoS-SoA conversion very fast, and indeed make scatter/gather less of a necessity for everything except texture sampling.
Larrabee as anything other than a GPU would also still benefit greatly from scatter/gather instructions though. Think of ray tracing again, and any operation that requires a lookup table. Fast scatter/gather out of L1 cache would greatly improve average latency. But maybe the texture samplers deal with all of this.
With the order this way, render target (RT) writes would be set in order for good 2D locality for future possible texture reads from RTs.
I'm not sure I follow here. Render targets are never read directly after being written. The first read as a texture could be in a very different location than the last write. So pixel order doesn't matter, at this level.
ROP blending would probably just be aligned reads/writes directly from the RT. Non FP32 formats would be messy (extra performance cost) depending on ISA. FP32 read/write would obviously be native (512bit aligned R/W), but what about FP16, INT8, INT16? Possibly 256bit aligned reads, with hardware pack/unpack + format conversion for 16bit types (perhaps done with separate instructions).
That seems likely. MMX/SSE has efficient pack/unpack instructions, so I'd expect Larrabee to have something similar.
In any case, rendering single points (or lines) is probably going to share the same poor performance (most of the parallel shader computations doing nothing) as it does on NVidia and AMD/ATI GPUs compared to rendering triangles.
Not necessarily. The software is not restricted to implementing one rendering pipeline implementation, so you can specialize for the type of primitive. Generic scatter/gather would be handy though.
I think at least for the fragment shader and ROP parts of a software pipeline, Larrabee could do quite well, if you take out my primary concern about Larrabee, hiding texture fetch latency (especially with dependent fetches).
Speculative prefetching FTW!
Obviously peek ALU performance could be around 16simd * 24cores * 2.5ghz = 960 gflops, so Larrabee won't be a slouch there (at least compared to 2008 GPUs).
Add dual-issue or multiply-add to the equation and it can play with the big boys.
And I think it can have an advantage in terms of efficiency. It never has a bottleneck, never has bubbles because of register pressure, doesn't have to mask away unused components, can reduce overdraw to zero, specialize for specific scenarios, etc.
I'm most excited about seeing the software evolve. The first version will likely be sucky, the second version could be so-so, the third implementation could be interesting, and the fourth implementation could blow us away. Then again, GPUs might choose a very similar software-oriented route...
This leaves attribute interpolation and all prior parts of the pipeline to talk about...
Attribute interpolation is at most two multiply-add operations, I see no issues there. And I don't think there's any part of the pipeline we haven't discussed yet that isn't obviously fairly efficient to implement on a CPU.