ELSA hints GT206 and GT212

Jawed said:

Cool, thx. Guess I need to read more carefully.

Jawed said:
Do you mean the software threading model and cache misses? I certainly have some qualms there, but current GPUs have large functional blocks that are idling very often and I think that that wastage is far worse than the instantaneous hardware-thread switching overhead that Larrabee will suffer.

Not really. For some reason I thought Larrabee's caches would be multiported in an attempt to mimic the broadcast of shared memory. But it looks like they'll be using multithreading to hide cache latencies as well. So I guess there'll be some work there to pack data into single cache lines as there won't be a way to explicitly set up bank aligned data sets like in cuda.

Vertex attribute access doesn't match the behaviour of register access, nor are the units writing the data the same.

When you have batches of 32 processed by 8-SIMD units, no register location needs to be read/written more than once every 4th clock, because each register is only used for one pixel, and there are savings to be had in designing the register file with this constraint in mind. Attributed data, on the other hand is used for possibly all pixels in a batch. The other difference is that the attribute data is written to each cache from the setup engine, whereas register data has no need to have an external connection except to pass the final pixel info to the ROPs.

Right that makes complete sense. I think Jawed is right though. Nvidia will probably clean up their memory structure in the next round and implement some sort of cluster level L1 (distinct from the texture cache) similiar to LRB. But how would that change operand fetch? Would they now have to support multiple inflight fetches from cache to the register file and treat those fetches as yet another latency to be hidden?
 
I see. It seems that I wasn't paying proper attention. In that case, the MI really isn't saving much space all - maybe 15% of total SF+INT space. There's a lot of multipliers in those interpolation-only sections.
Re-reading the paper:

Code:
Logic Block                    Area (full-adders)
==================================================
17b squarer                            90
CS to radix-4 SD converter             45
lookup table ROM                     1380
function overhead total              1515
--------------------------------------------------
2 optimized 17x24 mults               945
8 5x24 mults                         2040
3 24b right-shifters                  280
3 24b two’s complementers             110
4 45b right-shifters                  840
4 CSAtree                             730
4 45b CPA                             640
4 normalizers                         930
planar interpolation total           6515
==================================================
multifunction total                  8030
==================================================
I think it's fair to say the design sees transcendental functions as a small overhead on the considerably larger interpolators, which makes it much harder to justify dropping transcendental altogether.

I have a feeling that GT200 eliminates those side branches, or at the very least GT300 will if they decide to stick with the distributed vertex caches.
Another alternative is to serialise the four branches of interpolation:

Code:
==================================================
17b squarer                            90
CS to radix-4 SD converter             45
lookup table ROM                     1380
function overhead total              1515
--------------------------------------------------
2 optimized 17x24 mults               945
2 5x24 mults                          510
3 24b right-shifters                  280
3 24b two’s complementers             110
1 45b right-shifters                  210
1 CSAtree                             183
1 45b CPA                             160
1 normalizer                          233
planar interpolation total           2631
==================================================
multifunction total                  4146
==================================================
which ~halves the unit entirely. GT300 with 4:1 or higher ALU:TEX should be happy.

But now transcendental is about 37% of MI area, whereas it was 19% in the layout described originally.

Only with detailed testing of attribute throughput could we find out the actual throughput of these GPUs - I don't know of any documented tests.

You're right - as shown the interpolation rate is definately overkill. Also, is this really where the second MUL happens? It just seems really silly to try and wedge it into there, because the data paths are all wrong.
Yeah, somewhere in there - the patent application paragraph I quoted earlier, [0066] :

In some embodiments of the present invention, Execution Unit A 765 is configured to perform interpolation, reciprocal, square root, multiplication, logarithm, sine, cosine, and power function operations.
In the other thread I linked, Bob said:

http://forum.beyond3d.com/showpost.php?p=1008712&postcount=24

I think you should try writing a MUL (or MAD)-only shader that interpolates different attributes at each instructions. The SFU really will do a MUL at the rate of 1 per clock per thread, largely independent of what happens in the MAD pipe.
Implying that our ideas of MUL utility are incorrect, that there's more throughput available there.

Jawed
 
Cool, thx. Guess I need to read more carefully.
Can't blame you when the source document is patentese. And that paragraph is merely a possibility, after all.

Not really. For some reason I thought Larrabee's caches would be multiported in an attempt to mimic the broadcast of shared memory. But it looks like they'll be using multithreading to hide cache latencies as well.
Yeah, software multi-threading.

So I guess there'll be some work there to pack data into single cache lines as there won't be a way to explicitly set up bank aligned data sets like in cuda.
Obviously a lot of data will naturally fall into cache lines. They'll also have some swizzle functionality in the ALUs, so that might help. And of course for memory operations there's gather functionality, explicitly designed to make use of coalesced memory accesses and produce efficient cache lines.

Right that makes complete sense. I think Jawed is right though. Nvidia will probably clean up their memory structure in the next round and implement some sort of cluster level L1 (distinct from the texture cache) similiar to LRB. But how would that change operand fetch? Would they now have to support multiple inflight fetches from cache to the register file and treat those fetches as yet another latency to be hidden?
One sticky question I haven't resolved is how NVida handles virtualisation of the register file.

D3D10 requires that each element can access 4096 vec4 fp32s. The naive interpretation is a gargantuan register file. So in reality some kind of paging mechanism is required.

Related to this is the question of indexed register accesses, e.g. r(r3.x) (the register at the address stored in r3.x), which is a feature of D3D11 (ATI GPUs already do this - it's a recipe for waterfalling :cry: ).

As far as I can tell Larrabee virtualises the entire "D3D register file". The vector unit's registers are completely dynamic. There is absolutely no static allocation of pixels to slots in a register file like we see in GPUs (for the duration of a shader). Simply because the vector register file is small, much like a SSE register file is small (though, ahem, not quite that tiny I guess). So if the vector unit is a SIMD-16 there might be 16x 16-wide registers, where each register is 16x32 bits wide, so a total of 16x 64 byte registers.

The vector unit only needs enough registers to cover pipeline latency for all the operands it can fetch. And since a single operand can be read from L1 per clock, that will further reduce the need for a large register file attached to the vector ALU.

So the D3D register file (all 4096 vec4 fp32s per element!) is actually implemented in memory. It is merely cached through L2/L1 for use by whichever hardware thread is in context. As far as Larrabee is concerned, register names are merely an abstraction of memory addresses.

Now, it may be that NVidia goes in this direction, too. It's a hell of a big change. It would mean that there's no point in having the parallel data cache as a dedicated block for shared memory between threads.

Jawed
 
Another alternative is to serialise the four branches of interpolation
That's what I meant. Drop those branches so that you only do one pixel per clock, whether for SF or interpolation. It makes even more sense if NVidia is going to up the ALU:TEX ratio.

Implying that our ideas of MUL utility are incorrect, that there's more throughput available there.
The thing is that MUL needs a second operand, unlike all the other SF functions. It's a new path towards the register file. You also can't use 5x24 multipliers, which are possible for interpolation because they're only dealing with pixel offsets from the quad center.

I dunno. Maybe these documents only display enough functionality to get the patent, and the real thing is quite different.
 
That's what I meant. Drop those branches so that you only do one pixel per clock, whether for SF or interpolation. It makes even more sense if NVidia is going to up the ALU:TEX ratio.
It'd make great sense if SF was all that frequent either; which it isn't, AFAICT. I could be horribly wrong, but I still prefer the possibility of having one SF/MI unit per multiprocessor instead of two in the GT21x generation. As for GT3xx, who knows given we have zero idea how the shader core looks like... :)

The thing is that MUL needs a second operand, unlike all the other SF functions. It's a new path towards the register file.
Yup, but this is obviously resolved pretty easily if it's half-speed... ;) (well not really since SF is quarter-speed and would be 1/8th speed then, but the relative cost is still lower)

You also can't use 5x24 multipliers, which are possible for interpolation because they're only dealing with pixel offsets from the quad center.
Oh, the MUL isn't anywhere in that diagram for a very simple reason: it's not for the interpolation per-se, it's for the division by 1/w (the original RCP for that is done on demand, BTW, and can therefore be avoided by the driver if no interpolation is ever required - I remember testing that way back in the day...) - so it definitely has to be FP32 anyway, there's very little waste here except for the RF part of the equation.
 
In case anyone hasnt heard allready (well done nvidia, fantastic idea)
GeForce 9 rebranding may take effect next month - Nvidia will re-introduce the GeForce 9800 GTX+ as the GeForce GTS 250 and the GeForce 9800 GT as the GeForce GTS 240
 
I wonder whether there's any connection between the renaming and new 40nm GPUs - could they be delayed? Or is it that only cheaper cards won't be renamed (at least for retail) and will be superceded by new GPUs in April or March, while G92 won't have a replacement untlil Q3?
 
It'd make great sense if SF was all that frequent either; which it isn't, AFAICT. I could be horribly wrong, but I still prefer the possibility of having one SF/MI unit per multiprocessor instead of two in the GT21x generation. As for GT3xx, who knows given we have zero idea how the shader core looks like..

I figure if they drop a MI we can say goodbye to MUL co-issue. Have you guys considered the implications for warp size? Without the need or ability to issue to the MI every other core clock will we see superwarps go away and everything run in 16 warp sizes?
 
I figure if they drop a MI we can say goodbye to MUL co-issue. Have you guys considered the implications for warp size? Without the need or ability to issue to the MI every other core clock will we see superwarps go away and everything run in 16 warp sizes?
They could do that if it simplified their scheduler in any way, but I suspect they'd still expose the MUL exactly as in GT200: as I said previously, for Graphics, GT200 can only expose half a MUL per clock cycle for RF/scheduler reasons (but it can use it entirely in practice if you use it for a MI half the time and a MUL the other half). If you removed the half MUL that can't be exposed anyway, it might still make sense to expose the other.

As for warp size, I guess theoretically 24 would be possible, but I doubt NV wants to go away from multiples of 16 for backwards compatibility reasons (i.e. CUDA programs that optimized everything with 16 in mind, for example). [EDIT: Actually now that I think about it, it might really not be that easy to implement anyway...]
 
I figure if they drop a MI we can say goodbye to MUL co-issue.
It isn't co-issue. It's "issue at some arbitrary rate", where the rate is determined by the availability of the ALU, and with a timing offset from the MAD ALU.

"Issue" is a funny concept in NVidia's design, as an instruction is issued only every "x" cycles (4, 8 etc.) but operands/resultants seem to be flowing continuously.

Have you guys considered the implications for warp size? Without the need or ability to issue to the MI every other core clock will we see superwarps go away and everything run in 16 warp sizes?
Does "superwarp" refer to a pair of 16-wide batches? As far as I can tell (rusty memory alert) this was only used in G80, for pixel shading, while VS (GS too?) used 16-wide batches ("half-warps" was one name for them, though officially they are warps in the strict sense). The later GPUs have 32-wide batches. Regardless of size, NVidia seems to use a pair of batches/warps (a convoy) as it helps with register file banking.

In general I'm not sure that slowing down MI would directly impact warp size.

Jawed
 
In case anyone hasnt heard allready (well done nvidia, fantastic idea)

I dunno, that kinda makes sense to me. Those G92's look very similar to what a cut down GT2xx would anyway and I doubt there is any performance detriment in comparison.

So for me this makes the product line a lot tidier compared to what it was previously. The 9800GTX+ was a horrible name anyway!

I assume the standard 9800GTX has been dropped from the product line altogether.
 
I dunno, that kinda makes sense to me. Those G92's look very similar to what a cut down GT2xx would anyway and I doubt there is any performance detriment in comparison.
It lacks GT200's TMU efficiency gains and it has the "wrong ALU:TMU". I think G92's also lacking full VC-1 decode (introduced with G98 I think).

Jawed
 
It isn't co-issue. It's "issue at some arbitrary rate", where the rate is determined by the availability of the ALU, and with a timing offset from the MAD ALU.

Oh yeah, totally agree. But the setup is such that they could alternate issue to the MAD and the MI every other core clock if necessary. This was explicitly laid out in one of the patents as one of the reasons for 32-wide pixel batches. And to be honest I can't think of another good reason for them.

Does "superwarp" refer to a pair of 16-wide batches? As far as I can tell (rusty memory alert) this was only used in G80, for pixel shading, while VS (GS too?) used 16-wide batches ("half-warps" was one name for them, though officially they are warps in the strict sense). The later GPUs have 32-wide batches. Regardless of size, NVidia seems to use a pair of batches/warps (a convoy) as it helps with register file banking.

Yeah it seems that the unit of work is a half-warp and two of those are ganged together to reduce pressure on instruction issue. Not sure why you say super-warps help with register file accesses though. Based on the patents and CUDA documentation I've read the coalescing rules for global memory accesses all happen within the scope of a half-warp. Shared memory is 16-way banked as well.

Also, since there's no caching of global memory and any unneeded bytes in a given memory transaction seem to be discarded anyway the other half-warp doesn't appear to reap any benefit from the first half-warp's memory requests.

In terms of the register file there's no evidence that larger warps help there. Although Nvidia recommends multiples of 64 for block sizes for whatever reason:

CUDA doc said:
The compiler and thread scheduler schedule the instructions as optimally as possible to avoid register memory bank conflicts. They achieve best results when the number of threads per block is a multiple of 64. Other than following this rule, an application has no direct control over these bank conflicts. In particular, there is no need to pack data into float4 or int4 types.
 
Yeah it seems that the unit of work is a half-warp and two of those are ganged together to reduce pressure on instruction issue. Not sure why you say super-warps help with register file accesses though.
I just don't remember any reference to "super-warp", that's why I asked.

Based on the patents and CUDA documentation I've read the coalescing rules for global memory accesses all happen within the scope of a half-warp. Shared memory is 16-way banked as well.
But ALUs run ~twice as fast as registers and memory, so from the point of view of memory, the ALUs are 16-wide.

Also, since there's no caching of global memory and any unneeded bytes in a given memory transaction seem to be discarded anyway the other half-warp doesn't appear to reap any benefit from the first half-warp's memory requests.
Depends on interleaving factors and "wrap-around". Look at the examples relating to bank conflicts.

In terms of the register file there's no evidence that larger warps help there. Although Nvidia recommends multiples of 64 for block sizes for whatever reason:
The reason being that a pair of warps (each being 32 wide) run in lock step in a "convoy".

Jawed
 
I just don't remember any reference to "super-warp", that's why I asked

My bad. It was actually supergroup and the term pops up in a few patents.

[0083]In another alternative embodiment, SIMD groups containing more than P threads ("supergroups") can be defined. A supergroup is defined by associating the group index values of two (or more) of the SIMD groups (e.g., GID1 and GID2) with each other. When issue logic 424 selects a supergroup, it issues the same instruction twice on two successive cycles: on one cycle, the instruction is issued for GID1, and on the next cycle, the same instruction is issued for GID2. Thus, the supergroup is in effect a SIMD group. Supergroups can be used to reduce the number of distinct program counters, state definitions, and other per-group parameters that need to be maintained without reducing the number of concurrent threads.

But ALUs run ~twice as fast as registers and memory, so from the point of view of memory, the ALUs are 16-wide.

Yep exactly, which is why the unit of memory access is 16-wide as well.

Depends on interleaving factors and "wrap-around". Look at the examples relating to bank conflicts.

Ok, but why would "wrap-around" matter at anything above a multiple of 16? Why is 64 the magic number? Sorry if it's something really simple that I'm missing.....

The reason being that a pair of warps (each being 32 wide) run in lock step in a "convoy".

I don't think that's the definition of a convoy though.

Each slot in the instruction buffer can hold up to two instructions from a convoy (a group of 32) of threads.

By using different clock rates and providing multiple execution pipelines, a large amount of threads can be grouped together into a convoy of threads according to the formula: convoy_size=(number of execution pipelines).times.(number of data paths in each execution pipeline).times.(ratio of the clock rate of the data processing side to the clock rate of the instruction processing side).

Therefore convoy_size = 2 * 8 * 2 = 32. So if you remove the MI it becomes convoy_size = 1 * 8 * 2 = 16.

This excerpt below is what I was referring to in saying that 32 wide warps enable issuing to both the MAD and MI pipelines.

http://appft1.uspto.gov/netacgi/nph...and+spec/convoy&RS=(AN/nvidia+AND+SPEC/convoy)

[0033] In the preferred embodiment, the issue logic 320, when issuing instructions out of the instruction buffer 310, alternates between instructions of the MAD type and instructions of the SFU type. In this manner, both of the execution pipelines 222, 224 can be kept completely busy. Successive issuances of MAD type instructions or SFU type instructions may be permitted if the instruction buffer 310 contains only single type of instructions. However, a convoy of 32 threads requires 2 T clocks or 4H clocks to execute, and so, successive issuances of same-type instructions (e.g. MAD-MAD or SFU-SFU) can occur at most every other T clock. Issuing different-type instructions alternately to the two pipelines, on the other hand, permits an instruction to be issued at every T clock and provides for higher performance. The compiler can help with the scheduling of the instructions so as to ensure that different-type instructions are stored in the instruction buffer 310. Allowing different convoys to be slightly apart in the program may also improve performance.
 
The patent documentation is so old it relates to G80, where the hardware has a batch defined as 16.

Since then NVidia GPUs have changed so that the smallest batch is 32. NVidia doubled the clock count per issued instruction, essentially.

I think supergroup is referring to an instruction being issued for a single batch over successive cycles, i.e. a supergroup in GT200 consists of 2T clocks ("2 thread clocks", yay, the true meaning of thread in this architecture) or 4H clocks (hot, i.e. ALU clocks) for a single batch. So a convoy has to be re-defined to be:

SIMD-width * number-of-data-paths * ALU-throughput-multiplier * supergroup-size

In G80 the supergroup was 1, in GT200 it's 2. That's my interpretation, anyway. Of course there's a bit of a complication in GT200, because the number of data paths is really 3 if you count the double-precision ALU :p

CUDA documentation is misleading about G80 because it implies that a batch is 32-wide, but it's really "half-warp" sized. But since the GPU always runs convoys it doesn't matter - whereas in graphics G80 supposedly runs vertex shader batches un-convoyed.

---

As to wrap-around, see figure 5-6 in the CUDA 2.0 Programming Guide, to see how the banks wrap around. Additionally, by placing a convoy's data in an interleaved pattern in memory (i.e. mixing A and B registers for AAAABBBB batch pattern) you can use the burst length to fetch data for both batches in a convoy and also utilise all the banks evenly, instead of chucking away half the burst.

Obviously there are access patterns that will always cause grief, which is why the CUDA documentation goes to great lengths to explain this stuff.

Jawed
 
What about CUDA capabilities?
Yep, there's a difference there too. Double-precision, some atomicity improvements, predicate evaluation for whole warps and an increase in capacity for in-flight warps/registers. See appendix A of the CUDA 2.0 Programming Guide.

Jawed
 
The patent documentation is so old it relates to G80, where the hardware has a batch defined as 16.

Since then NVidia GPUs have changed so that the smallest batch is 32. NVidia doubled the clock count per issued instruction, essentially.

From where I'm sitting nothing has changed in the context of that patent. There are still two pipelines and there is still a 2:1 clock ratio. The definition of convoy and supergroup are still the same. Even if GT200 issues everything as a supergroup the definition doesn't change as a result and there's no bearing on the behavior of future iterations that follow the G8x model.

As to wrap-around, see figure 5-6 in the CUDA 2.0 Programming Guide, to see how the banks wrap around.
Yeah but that's based on the stride between data elements within a single half warp. Still don't get the relationship to the block size.....

A common case is for each thread to access a 32-bit word from an array indexed by
the thread ID tid and with some stride s:
__shared__ float shared[32];
float data = shared[BaseIndex + s * tid];
In this case, the threads tid and tid+n access the same bank whenever s*n is a
multiple of the number of banks m or equivalently, whenever n is a multiple of m/d
where d is the greatest common divisor of m and s. As a consequence, there will be
no bank conflict only if half the warp size is less than or equal to m/d. For devices
of compute capability 1.x, this translates to no bank conflict only if d is equal to 1,
or in other words, only if s is odd since m is a power of two.
Those conditions dont seem to have any dependency on block size. Unless the stride is related to block size somehow which the docs don't mention anything about (don't see why it would be either).

Additionally, by placing a convoy's data in an interleaved pattern in memory (i.e. mixing A and B registers for AAAABBBB batch pattern) you can use the burst length to fetch data for both batches in a convoy and also utilise all the banks evenly, instead of chucking away half the burst.
Register files have a burst length? I thought it was simply 1 32-bit read per clock per bank?
 
Register files have a burst length? I thought it was simply 1 32-bit read per clock per bank?
Each H clock, MAD needs 3 scalar operands for each of 8 lanes, which is 24 operands. Each T clock, which is ~2 H clocks, MAD needs 48 operands.

Similarly each H clock, MI needs 1 operand for each of 2 lanes (for transcendental), so that's 2 operands - so each T clock that's 4 operands.

If MI is doing MUL, then it's 8 operands per T clock.

So the worst case is 48 operands per T clock for MAD and 8 operands per T clock for MUL = 56 operands.

So each T clock the register file needs to produce 64 operands to cover all these cases. So each of the 16 banks in the register file produces a burst of 4 scalars per T clock.

From where I'm sitting nothing has changed in the context of that patent. There are still two pipelines and there is still a 2:1 clock ratio.
Your earlier question about removal of MI essentially affects the need to use a convoy, as the convoy is constructed specifically to twiddle batches across MAD and MI.

I've also taken this as an optimisation for operand fetching, as by pairing up two batches in a convoy you can use a burst to read "half" of each batch's operands. This means that when a batch is reading registers from all over the register file, the addressing rate (1 per T per bank) and the burst length are less likely to produce surplus operands.

The ideal case (in G80), with a burst length of 4, is four batches interleaved in the register file. This allows:

MAD r0, r1, r5, r9
RCP r13, r19

where r1, r5, r9 and r19 are fetched on four consecutive Ts. This produces 16 banks * burst length 4 * T count 4 = 256 operands. That's enough operands for 4 batches of 16, where each batch wants 52 operands.

The resulting data will feed a pair of convoys over 4 consecutive Ts:

Code:
        MAD           MI
   (r1, r5, r9)      r19
=============================
T0       A            B
T1       B            A
T2       C            D
T3       D            C

Obviously this requires that MAD and RCP are fully independent instructions (as they are in this example).

So alternating MAD+MI and interleaving of batches in register-file bursts synergistically maximises the bandwidth utilisation of register-file and operand collection.

Of course this all starts to fall apart with branch divergence...

What's interesting is that a pair of batches joined to make a convoy make a de-facto batch. If branch divergence affects one of the batches the entire convoy is affected.

Even more simply, if the batches in a convoy diverge e.g. batch A takes the THEN clause (MAD r0, r1, r2, r3) while batch B takes the ELSE clause (MAD r5, r6, r7, r8), then the operand collector is effectively trying to fetch operands for independent instructions on the same T clocks and so it will prolly run out of bandwidth.

This makes me think that GT200 is merely an enforced-convoy architecture, with a batch size of 16 - but for branch divergence and memory operations it counts as having a batch size of 32. Now it could be that GT200 actually has a baseline batch size of 32, making a convoy 64 elements. That would make for effective batch size of 64. Dunno.

Whereas I think RV770 has an effective batch size of 128.

The definition of convoy and supergroup are still the same. Even if GT200 issues everything as a supergroup the definition doesn't change as a result and there's no bearing on the behavior of future iterations that follow the G8x model.
I think you might have a point here.

I'm still trying to reconcile the scaling introduced by "supergroup" (i.e. when would it be used) with the fact that GT200 supposedly issues a single batch instruction for 4 H clocks, not 2 as in G80.

My interpretation is that GT200 has 32-wide batches that are convoyed to make a super-batch of 64.

An alternative interpretation is that a 32-wide batch is formed from a convoy, indivisibly. Some kind of internal change has been made that enforces this scheduling. G80 didn't enforce convoys, it seems (vertex shader batches seem to be truly 16 in size).

Yeah but that's based on the stride between data elements within a single half warp. Still don't get the relationship to the block size.....
Think of it as a 3-dimensional block of memory, where you're allowed to cut any plane you like as long as it only uses 2 dimensions to address. The interleaving in figure 5.6 shows how the banks work independently. If you replace the bank dimension with time (T) or registers fetched in a burst, then other useful interleavings present themselves. e.g. as I explained earlier, by interleaving batches, a burst can produce operands for 4 batches and produce zero wastage.

Banks are the "best" dimension, since they're truly granular (each bank is independent of the others). Operand collection has limited time in which to produce the operands for a batch, so you can't go crazy with distinct reads. And burst length is fixed (i.e. forces multiple reads over multiple T if the data isn't in the burst).

So it all boils down to register allocation.

What I haven't mentioned so far is that the TMUs also have to fetch operands (each T clock).

Those conditions dont seem to have any dependency on block size. Unless the stride is related to block size somehow which the docs don't mention anything about (don't see why it would be either).
So it makes sense to use blocks of 64 since that's the number of elements that can have one operand fetched from register file in one T.

Jawed
 
Back
Top