HD 5000 series: New architecture or more a major refresh?

HD5k: New archi or a major refresh?


  • Total voters
    49
  • Poll closed .
The SIMD can swap threads out every four cycles. I may have been innacurate in saying sequencers, as the terms were that there are dual arbiter/scheduler pairs per SIMD.
With 16-wide, the work needed in setting up a thread is uniform for all 16 clusters. One and only one instruction group needs to be fetched.
What does that scheduling consist of though?

AFAICS you need 8 flags per thread for the texture lookups, tex clause limit, 1 flag for instruction lookup (lets ignore that control flow and other instructions have separate caches for a moment, that would probably disappear) maintained for the max number of thread contexts which can be in flight at once (128 seems a nice number). On incoming events (lookup completed) you flip a flag, if all flags are set ... put the thread context (ie. program counter, clause type, register window offset for the GPRs and a simple index for the PV/SV registerset) on the "to be executed" FIFO. Once a thread is descheduled because of an instruction cache miss or texture clause pop a fresh one off the FIFO. This is not a lot of hardware compared to a 8 KB instruction cache (8 KB is small, especially since the VLIW instructions are not compact, but you would have a shared L2 too presumably).

The actual decoding of the instructions isn't necessarily a lot of work either ... the beauty of VLIW with dumb hardware, just pass the instruction word through the pipeline, at each stage do some simple boolean logic on the instruction word bits, connect straight to the muxes and your done. There are almost no interdependencies to worry about. The hardware needs to know the program counter, which registers to use and what type of clause it's in and for the rest it really doesn't care what came before or what will come after.

That's the kind of simplicity you just can't get with a portable ISA and a processor which has to worry about hazards.
 
Last edited by a moderator:
What does that scheduling consist of though?

AFAICS you need 8 flags per thread for the texture lookups, tex clause limit, 1 flag for instruction lookup (lets ignore that control flow and other instructions have separate caches for a moment, that would probably disappear) maintained for the max number of thread contexts which can be in flight at once (128 seems a nice number). On incoming events (lookup completed) you flip a flag, if all flags are set ... put the thread context (ie. program counter, clause type, register window offset for the GPRs and a simple index for the PV/SV registerset) on the "to be executed" FIFO.
As described, it is not FIFO. The arbiter and schedulers cooperate to set priorities, and the scheduling hardware has programmable load-balancing. It can pick based on whatever criteria AMD has coded in as relevant, using whatever statistical or resource data is tracked. That would need to be duplicated. I've assumed, perhaps wrongly, that the scheduling hardware is the little box just to the right of the TMU block in the RV770 die shot (since there is none for Cypress, still), so a decent portion of that would be x16.
 
Meh ... you could always classify kernels into a couple of classes and maintain one FIFO per class at the SC level, so you can tell an entire SIMD to suspend current threads and switch to a different class without having to micromanage each one.

Any way, as I said the decoupled execution would be an option. If a strictly managed kernel scheduling would say make more efficient use of buffers between the vertex shaders and pixel shaders then go ahead, just schedule them en-block.
 
Back
Top