What does that scheduling consist of though?The SIMD can swap threads out every four cycles. I may have been innacurate in saying sequencers, as the terms were that there are dual arbiter/scheduler pairs per SIMD.
With 16-wide, the work needed in setting up a thread is uniform for all 16 clusters. One and only one instruction group needs to be fetched.
AFAICS you need 8 flags per thread for the texture lookups, tex clause limit, 1 flag for instruction lookup (lets ignore that control flow and other instructions have separate caches for a moment, that would probably disappear) maintained for the max number of thread contexts which can be in flight at once (128 seems a nice number). On incoming events (lookup completed) you flip a flag, if all flags are set ... put the thread context (ie. program counter, clause type, register window offset for the GPRs and a simple index for the PV/SV registerset) on the "to be executed" FIFO. Once a thread is descheduled because of an instruction cache miss or texture clause pop a fresh one off the FIFO. This is not a lot of hardware compared to a 8 KB instruction cache (8 KB is small, especially since the VLIW instructions are not compact, but you would have a shared L2 too presumably).
The actual decoding of the instructions isn't necessarily a lot of work either ... the beauty of VLIW with dumb hardware, just pass the instruction word through the pipeline, at each stage do some simple boolean logic on the instruction word bits, connect straight to the muxes and your done. There are almost no interdependencies to worry about. The hardware needs to know the program counter, which registers to use and what type of clause it's in and for the rest it really doesn't care what came before or what will come after.
That's the kind of simplicity you just can't get with a portable ISA and a processor which has to worry about hazards.
Last edited by a moderator: