Register files have a burst length? I thought it was simply 1 32-bit read per clock per bank?
Each H clock, MAD needs 3 scalar operands for each of 8 lanes, which is 24 operands. Each T clock, which is ~2 H clocks, MAD needs 48 operands.
Similarly each H clock, MI needs 1 operand for each of 2 lanes (for transcendental), so that's 2 operands - so each T clock that's 4 operands.
If MI is doing MUL, then it's 8 operands per T clock.
So the worst case is 48 operands per T clock for MAD and 8 operands per T clock for MUL = 56 operands.
So each T clock the register file needs to produce 64 operands to cover all these cases. So each of the 16 banks in the register file produces a burst of 4 scalars per T clock.
From where I'm sitting nothing has changed in the context of that patent. There are still two pipelines and there is still a 2:1 clock ratio.
Your earlier question about removal of MI essentially affects the need to use a convoy, as the convoy is constructed specifically to twiddle batches across MAD and MI.
I've also taken this as an optimisation for operand fetching, as by pairing up two batches in a convoy you can use a burst to read "half" of each batch's operands. This means that when a batch is reading registers from all over the register file, the addressing rate (1 per T per bank) and the burst length are less likely to produce surplus operands.
The ideal case (in G80), with a burst length of 4, is four batches interleaved in the register file. This allows:
MAD r0, r1, r5, r9
RCP r13, r19
where r1, r5, r9 and r19 are fetched on four consecutive Ts. This produces 16 banks * burst length 4 * T count 4 = 256 operands. That's enough operands for 4 batches of 16, where each batch wants 52 operands.
The resulting data will feed a pair of convoys over 4 consecutive Ts:
Code:
MAD MI
(r1, r5, r9) r19
=============================
T0 A B
T1 B A
T2 C D
T3 D C
Obviously this requires that MAD and RCP are fully independent instructions (as they are in this example).
So alternating MAD+MI and interleaving of batches in register-file bursts synergistically maximises the bandwidth utilisation of register-file and operand collection.
Of course this all starts to fall apart with branch divergence...
What's interesting is that a pair of batches joined to make a convoy make a de-facto batch. If branch divergence affects one of the batches the entire convoy is affected.
Even more simply, if the batches in a convoy diverge e.g. batch A takes the THEN clause (MAD r0, r1, r2, r3) while batch B takes the ELSE clause (MAD r5, r6, r7, r8), then the operand collector is effectively trying to fetch operands for independent instructions on the same T clocks and so it will prolly run out of bandwidth.
This makes me think that GT200 is merely an enforced-convoy architecture, with a batch size of 16 - but for branch divergence and memory operations it counts as having a batch size of 32. Now it could be that GT200 actually has a baseline batch size of 32, making a convoy 64 elements. That would make for effective batch size of 64. Dunno.
Whereas I think RV770 has an effective batch size of 128.
The definition of convoy and supergroup are still the same. Even if GT200 issues everything as a supergroup the definition doesn't change as a result and there's no bearing on the behavior of future iterations that follow the G8x model.
I think you might have a point here.
I'm still trying to reconcile the scaling introduced by "supergroup" (i.e. when would it be used) with the fact that GT200 supposedly issues a single batch instruction for 4 H clocks, not 2 as in G80.
My interpretation is that GT200 has 32-wide batches that are convoyed to make a super-batch of 64.
An alternative interpretation is that a 32-wide batch is formed from a convoy, indivisibly. Some kind of internal change has been made that enforces this scheduling. G80 didn't enforce convoys, it seems (vertex shader batches seem to be truly 16 in size).
Yeah but that's based on the stride between data elements within a single half warp. Still don't get the relationship to the block size.....
Think of it as a 3-dimensional block of memory, where you're allowed to cut any plane you like as long as it only uses 2 dimensions to address. The interleaving in figure 5.6 shows how the banks work independently. If you replace the bank dimension with time (T) or registers fetched in a burst, then other useful interleavings present themselves. e.g. as I explained earlier, by interleaving batches, a burst can produce operands for 4 batches and produce zero wastage.
Banks are the "best" dimension, since they're truly granular (each bank is independent of the others). Operand collection has limited time in which to produce the operands for a batch, so you can't go crazy with distinct reads. And burst length is fixed (i.e. forces multiple reads over multiple T if the data isn't in the burst).
So it all boils down to register allocation.
What I haven't mentioned so far is that the TMUs also have to fetch operands (each T clock).
Those conditions dont seem to have any dependency on block size. Unless the stride is related to block size somehow which the docs don't mention anything about (don't see why it would be either).
So it makes sense to use blocks of 64 since that's the number of elements that can have one operand fetched from register file in one T.
Jawed