I was just reading up on a number of post from a few days ago, and I got a bit confused. I hope someone could take the time to explain a few things to me.
From what I've read, R600 (and derivatives) looks a bit like this:
R600 has 4 SIMD arrays of 16 5-wide ALU blocks. Each SIMD unit runs a batch, which would be one thread of instructions on 64 unique data-objects (64 pixels/vertices/primitives). There are only 16 units in the SIMD, so it takes 4 loops to complete one instruction, after that either the next instruction or another thread/batch is scheduled, for instance when the thread has to wait for a texture lookup. Am I correct so far? Please correct me where I'm wrong, I'm definitely not at home in architecture land.
If the above story is correct, it's got me a bit confused. Why have batches of 64? Wouldn't it be more efficient to use 16 wide batches, thus giving better branching granularity?
Also, I read this post by Jawed:
If you don't mind me asking some more questions, how do memory latencies fit into this story? Is the TMU not directly working on the textures in memory? Does it assume the right data is allready in some cache, waiting to be used? Otherwise I cannot see how each operation could always take exactly four cycles.
From what I've read, R600 (and derivatives) looks a bit like this:
R600 has 4 SIMD arrays of 16 5-wide ALU blocks. Each SIMD unit runs a batch, which would be one thread of instructions on 64 unique data-objects (64 pixels/vertices/primitives). There are only 16 units in the SIMD, so it takes 4 loops to complete one instruction, after that either the next instruction or another thread/batch is scheduled, for instance when the thread has to wait for a texture lookup. Am I correct so far? Please correct me where I'm wrong, I'm definitely not at home in architecture land.
If the above story is correct, it's got me a bit confused. Why have batches of 64? Wouldn't it be more efficient to use 16 wide batches, thus giving better branching granularity?
Also, I read this post by Jawed:
This got me a bit confused. If the TEX instruction takes 4 clocks, does that mean no other TEX instruction can be executed for 4 clocks? If so, wouldn't that mean each batch (on the ALUs) would have to wait after each TEX instruction? That wouldn't fit with what I thought above about each batch running the same operation for four cycles. Or is a TMU pipelined somehow, accepting a new command each clock, but taking 4 clocks before it's done?I've been talking about the way the TU is constructed, hypothesising that it's a monolithic unit in RV670, with each TEX instruction running for 4 clocks. If RV770 is the same, then this enforces a batch size of 128 on the SIMDs (since a TU batch is assumed to be 32 wide * 4 clocks). So the basic design choices restrict the options for SIMD width. Only 5 SIMDs each 32 wide fits.
If you don't mind me asking some more questions, how do memory latencies fit into this story? Is the TMU not directly working on the textures in memory? Does it assume the right data is allready in some cache, waiting to be used? Otherwise I cannot see how each operation could always take exactly four cycles.