Programmable shaders existed even before Pixomatic:
Real Virtuality, a far ancestor of SwiftShader. The key to high performance software rendering is dynamic code generation. Instead of using branches everywhere the pipeline can be configured differently, generate exactly the code needed for the pipeline in a certain configuration. The rest of the challenge is to use the CPU's advanced instructions as efficiently as possible.
Neat, dynamic generation of instructions on 24 Harvard architecture cores.
Would it involve self-modifying code or some kind of code page selection?
I don't think any recent multithreaded high-end CPUs have used round-robin multithreading (including Intel). They either use SMT, which dynamically steers individual instructions to ALUs, perhaps even in the same cycle. Or they use "switch on event" multithreading (CMT) in which one thread runs until it is stalled, then it switch to another thread. I would expect that Larrabee would use one of these approaches.
High performance or high throughput, CPU or GPU?
Sun's T1 and T2 chips use a modifed variant of fine-grained multithreading based on issuing instructions from a group of threads round-robin, demoting a thread from the active group on long-latency events.
On the GPU front, R600 uses two-way FMT within its SIMD units.
I'm not sure bypassing the cache helps. DRAMs are design for consecutive bursts. The GDDR4 datasheet referenced earlier in this thread requires burst of length 8 of 4 bytes each. That is, the minimum you can read out of this DRAM is 32 bytes. Doing a scatter/gather in which you need 32-bits from all over will have similar inefficiencies just because of the way DRAM works. Once you already need to grab bursts of data, why not do block-based cache coherence?
My concern is cache pollution in the L1. It's already shared by 4 threads, and now one thread is pulling in a lot of cache lines for just one operation.
If it is known that the cache lines will be reused prior to being evicted, then storing in cache would make sense.
If a lot of the data is discarded, it might make sense to have some kind of buffer that can bypass the cache.
If a gather is a microcoded instruction, it could either be a string of loads or a string of prefetches to a prefetch buffer.
I'm not sure I quite follow you. Certainly if you mis-speculate, you might take a hiccup. The same is true for branch prediction, yet if you're mostly right, things work well.
With branch mispredictions, the scope of side effects is somewhat more contained.
Usually, the processor has to stall if the branch reaches the end of the pipeline, the bottom of the instruction window reaches the branch, or the CPU runs out of load/store entries. This might reach high double digits, worst-case.
If the limit is now the capacity of a mulit-kilobyte cache, it can be potentially longer.
For reads, it's a hiccup, but like you said, it's not something that doesn't already happen.
If Larrabee does not use a non-coherent buffer for its speculation, the following might be problems:
Writes with standard prediction aren't allowed to commit until the branch has been resolved. Speculatively writing to cache, on the other hand, has a wider reach and implications for the supposed uniqueness of the Modified state.
In a MESI protocol, it would be possible for a speculating thread to invalidate all shared copies of a cache line, then discard the line when it finds the speculation failed. I think it might be safe to keep a copy of the shared lines, but the fact that they are not unique might make this unsafe.
In a MESIF protocol, there is a possible way out, so long as the processor tracks the original value of any Forwarding lines it writes to. On roll-back it could keep the Forwarding lines to seed return values to the various cores whose shared entries were invalidated.
This saves a trip to memory, but it also makes one core a potential hot-spot on the ring bus.
The simplest case would be to just invalidate everything and start from memory.
The funny case would be a three stooges situation, where multiple threads try to speculate on the same lock.
If they follow similar code paths, it's only a matter of time until each thread picks up on the coherency traffic of another thread and all threads involved roll back.
A single speculating thread now has the ability to affect the execution of any number of the 128 threads Larrabee is running.
I'm a believer in good thread citizenship, which is why I like the idea of a buffer space or a way to keep some of the side affects from being spilled out by cache coherency until the locking has been confirmed.