Nvidia GT300 core: Speculation

3dilettante · Apr 28, 2009

Jawed said:
Does NVidia have such things?

Jawed

I am not aware of forwarding or pipeline registers in current implimentations from Nvidia. I was thinking about GPUs in general with that last part and that they could not be added if such a scheme were used.

The PDC has a lot of optimizations that rely on staggering instructions to prevent bank conflicts.
Would that go the way of the dodo?

TimothyFarrar · Apr 28, 2009

Jawed said:
I'm struggling to disentangle 406 and 408, to be honest The next paragraph says "local register file 406." Paragraph 77 says "Core interface 308 allocates sufficient space for an input buffer (e.g., in shared register file 406 or local register file 404) for each processing engine 402 to execute one vertex thread, then loads the vertex data."

My hypothesis has been that shared memory is where the attribute equations live. This patent seems a bit confused and seems to be referring to G80 (specifically mentioning 24 threads per multiprocessor), so ...

Jawed

I believe "shared register file" is what CUDA refers to as "shared memory". Seems as if with current generation hardware this shared register file is banked 16-way hence the crossbar.

The "on-chip shared memory" is from my understanding what is currently the 8KB/core or so of read only cached constants and uniforms (which I think are backed by 64KB L2). Patent seems to say that other GPU generated constants are stored there as well..

Jawed · Apr 28, 2009

3dilettante said:
The PDC has a lot of optimizations that rely on staggering instructions to prevent bank conflicts.
Would that go the way of the dodo?

Don't understand what you mean by "staggering instructions" as bank conflicts are avoided by blocked/staggered addresses. The PDC feeds into the operand collector, so I don't see it changing at all.

I'm not suggesting that NVidia would do PC-blind-SIMDs, just describing an extreme configuration based upon the starting point of G80.

Jawed

TimothyFarrar · Apr 28, 2009

3dilettante said:
I am not aware of forwarding or pipeline registers in current implimentations from Nvidia. I was thinking about GPUs in general with that last part and that they could not be added if such a scheme were used.

The PDC has a lot of optimizations that rely on staggering instructions to prevent bank conflicts.
Would that go the way of the dodo?

If threads were kept in their starting lane you wouldn't have a problem with register operands in terms of banking (all registers are in one bank right?). If you went really fancy you could schedule around shared memory operand bank conflicts.

Jawed · Apr 28, 2009

TimothyFarrar said:
The "on-chip shared memory" is from my understanding what is currently the 8KB/core or so of read only cached constants and uniforms (which I think are backed by 64KB L2). Patent seems to say that other GPU generated constants are stored there as well..

The other fundamental issue then is the frequency of writes, if we're talking about a block of memory that is neither registers nor shared memory.

We can see that the access patterns for shared memory, with their high degree of flexibility, have a high cost in terms of latency and bandwidth (double the latency, 1/4 ? bandwidth).

One of the global questions that's never been covered in any detail is how these GPUs (ATI and NVidia) handle the scheduling/bandwidth of both texturing and ALU instructions. Most focus has been on just ALU instructions. For example I think ATI dedicates 1 cycle in 4 to TEX and 3 to ALUs. Does the time allocated for "TEX" also cover constructor and destructor phases of thread execution?

Jawed

Jawed · Apr 28, 2009

TimothyFarrar said:
(all registers are in one bank right?)

Registers (not shared memory) can be allocated in a variety of patterns - pattern being chosen by compiler/driver as far as I can tell.

Jawed

3dilettante · Apr 28, 2009

Jawed said:
Don't understand what you mean by "staggering instructions" as bank conflicts are avoided by blocked/staggered addresses. The PDC feeds into the operand collector, so I don't see it changing at all.

I was thinking of shared memory accesses on a SIMD being evaluated for conflicts on a per half-warp basis, since the SIMDs would switch between half warps every cycle.

I haven't looked at the slides for a long time, so I may be misremembering.

Jawed · Apr 28, 2009

I think GT200 is a "full warps only" architecture for instruction issue, and half-warps are purely for fetches/stores, matching the bank count in register file and shared memory (and constant memory?).

Then we get into the question of "convoys" where a pair of warps seem to be the minimum issue size, which I think is another dimension of operand collection, e.g. fetch operand R13 for 64 threads by: issuing a single fetch to 1 address across 16 banks in a burst of 128 bits per bank.

Also, are you referring to a cycle as a "core cycle" or an ALU cycle (what NVidia calls the hot clock)?

Jawed

3dilettante · Apr 28, 2009

I believe it is for every core cycle.
Each SIMD is a quarter-warp, and the switch happens at a half-warp.

TimothyFarrar · Apr 28, 2009

Jawed said:
Registers (not shared memory) can be allocated in a variety of patterns - pattern being chosen by compiler/driver as far as I can tell.

Jawed

Wouldn't patterns be patterns in the bank (for the associated thread) based on the register occupancy per thread (or pair of threads per warp in one bank)?

Jawed · Apr 29, 2009

TimothyFarrar said:
Wouldn't patterns be patterns in the bank (for the associated thread) based on the register occupancy per thread (or pair of threads per warp in one bank)?

I can't tell what actual patterns are being used, this is the patent application:

Operand Collector Architecture

[0049] In one embodiment of the present invention, registers for storing operands for processing a thread may be allocated within a single bank, such as a Bank 320. Such an allocation is referred to as a "thin" allocation type. FIG. 4A is an exemplary embodiment of a thin allocation of registers for processing threads in 4 banks accordance with one or more aspects of the present invention. In FIG. 4A registers to store operands for processing Thread A are allocated in Bank 0, registers to store operands for processing Thread B are allocated in Bank 1, registers to store operands for processing Thread C are allocated in Bank 2, and registers to store operands for processing Thread D are allocated in Bank 3. Registers to store operands for processing additional threads may be allocated in Bank 0, Bank 1, Bank 2, and/or Bank 3. In alternative embodiments of the present invention, fewer or more banks are used.

[0050] In another embodiment of the present invention, registers for storing operands for processing a thread may be allocated within each of the 4 banks, where each bank may be a Bank 320. Such an allocation is referred to as a "fat" allocation type. FIG. 4B is an exemplary embodiment of a fat allocation of registers for processing threads in 4 banks accordance with one or more aspects of the present invention. In FIG. 4B registers to store operands for processing Thread A are allocated in Bank 0, Bank 1, Bank 2, and Bank 3. Registers to store operands for processing Threads B, C, and D are also allocated in Bank 0, Bank 1, Bank 2, and Bank 3. In alternative embodiments of the present invention, fewer or more banks are used.

[0051] In some embodiments of the present invention, registers for processing threads are allocated in "bank count" units representing a specific number of registers in either a fat or thin allocation type. A register base address indicating the next available register for allocation may be tracked in Bank 0 and the next available registers in the other banks may be determined using the bank count, the allocation type, and the register base address.

[0052] Registers to store operands for processing additional threads may be allocated in Bank 0, Bank 1, Bank 2, and/or Bank 3 using either a fat or a thin allocation. However, mixing fat and thin allocation types may result in poor utilization of the registers available for allocation as shown in FIG. 4C. For example, when a first thread, such as Thread A uses a thin allocation in Bank 0 and a second thread, such as Thread B uses a fat allocation in each of the 4 banks, a third thread using a thin allocation may be delayed until execution of the first thread is completed. Alternatively, in some embodiments of the present invention, fat allocation types may be made from the top of each bank and thin allocation types may be made from the bottom of each bank, as shown in FIG. 4D. Threads A and B use a fat allocation and Threads C and D use a thin allocation. This "splitting" of the allocations permits packing of same allocation types to more efficiently utilize the registers available for allocation.

This came up before when we were discussing dynamic warp formation earlier in the thread:

http://forum.beyond3d.com/showthread.php?p=1259384#post1259384

It may be that fat versus thin might be related to pixel versus vertex allocations, for example. It may even be a technique that was a work-around for problems in G80 that have been solved in GT200, so it's no longer used.

In general the patterns are going to be in units of "half-warp", because all NVidia GPUs seem to have a 16-bank organisation for their register files.

Jawed

TimothyFarrar · Apr 30, 2009

On the register allocation topic, the CUDA docs talk about using a multiple of 64 threads per block to avoid either register read-after-write dependencies and bank conflicts. I wonder if the fat vs thin allocation might have something to do with using registers for texture fetch parameter storage?

Been thinking a little more about the dynamic warp formation idea, and it seems as if this would break the implicit warp level synchronization that you have with CUDA also would be a problem with the warp vote instruction.

Jawed, have any ideas on how atomic operations are implemented? Seems to me that shared-memory atomic operations enforce that the required instructions are scheduled as a packet together. Really wondering how global atomics are implemented...

Jawed · Apr 30, 2009

TimothyFarrar said:
On the register allocation topic, the CUDA docs talk about using a multiple of 64 threads per block to avoid either register read-after-write dependencies and bank conflicts. I wonder if the fat vs thin allocation might have something to do with using registers for texture fetch parameter storage?

You mean because TMUs access registers much more slowly? Presumably the 2 quad-TMUs in a GT200 cluster work on the same warp's texturing in lock-step. But maybe not?

8 texture coordinates is 16 scalars, which is one fetch from all 16 banks, say. But if the minimum burst size is 4, then a single fetch would actually produce coordinates for 32 texture results, i.e. one warp. So 2 burst fetches are actually required for a pair of warps that are paired.

Been thinking a little more about the dynamic warp formation idea, and it seems as if this would break the implicit warp level synchronization that you have with CUDA

Between synchronisation points, yes - but that's exactly what you want. Once the warp has re-synchronised at the end of the clause that produced divergence, the warp continues execution as normal.

also would be a problem with the warp vote instruction.

When DWF is operational you have a problem of maintaining what are now "scattered predicates", e.g. threads A1,A4,A17,A31 (accompanied by B..., C... etc. threads) are in a loop while the rest of the warp is sleeping, and on this iteration A17 and C24 go to sleep. So, during divergence _any is always true and _all is always false (otherwise there'd be no divergence).

I suppose what you're alluding to are predicates that aren't linked to overall clause control flow, e.g. breaking out of a loop when a value falls below a threshold for all threads.

Not sure if simple nesting of predicates takes care of this. Masking for the currently active threads may do the job. I can't remember how Fung's proposal for DWF handles this, to be honest.

Jawed, have any ideas on how atomic operations are implemented? Seems to me that shared-memory atomic operations enforce that the required instructions are scheduled as a packet together. Really wondering how global atomics are implemented...

Hmm, I've not really spent any time on these, so this is my first impression.

It looks to me as if something like a loop with a rotating (e.g. shift-left) predicate mask across all the warps in a block is used to enforce serialisation of the instructions that comprise the atomic "macro". I guess there must something in the instruction-issuer that forces the entire macro to issue in sequence. There could be a cache in the issuer (operand collector) to hold the result of each atomic operation in order to avoid the latency of literally writing to memory.

Presumably for global atomics the issuers have a communication network dedicated to sequencing amongst themselves. But of course there's no ordering, so the sequencing should be fairly trivial, i.e. entirely co-operative rather than pre-emptive.

Presumably, with a bit of luck, with something like the code on page 109 of the 2.2 Guide (I have beta version) which only does atomicInc for thread==0 of a block, the two predicates are ANDed to produce just one loop iteration (single packet issue).

Is there any analysis of atomic instruction performance out there?

Jawed

TimothyFarrar · Apr 30, 2009

Jawed said:
Is there any analysis of atomic instruction performance out there?

http://www.vis.uni-stuttgart.de/~dachsbcn/download/dphlc.pdf found atomics to be only 15% slower on average. But I haven't seen anyone publish general benchmark of atomic instruction performance. I'm thinking about getting a GT2xx (compute 1.2) card soon, so I might just have to do this.

TimothyFarrar · Apr 30, 2009

Jawed said:
It looks to me as if something like a loop with a rotating (e.g. shift-left) predicate mask across all the warps in a block is used to enforce serialisation of the instructions that comprise the atomic "macro". I guess there must something in the instruction-issuer that forces the entire macro to issue in sequence. There could be a cache in the issuer (operand collector) to hold the result of each atomic operation in order to avoid the latency of literally writing to memory.

Presumably for global atomics the issuers have a communication network dedicated to sequencing amongst themselves. But of course there's no ordering, so the sequencing should be fairly trivial, i.e. entirely co-operative rather than pre-emptive.

I don't think you have to (or want to) directly enforce serialization of threads in shared memory atomic macros unless you have bank collisions (and the hardware already serializes instructions on bank conflicts excluding the one broadcast per cycle...).

Jawed · Apr 30, 2009

TimothyFarrar said:
I don't think you have to (or want to) directly enforce serialization of threads in shared memory atomic macros unless you have bank collisions (and the hardware already serializes instructions on bank conflicts excluding the one broadcast per cycle...).

Aren't atomic operations, by definition, on a single address? So every thread in a grid or block collides (global or shared memory, respectively).

I suppose if you calculate the address, resulting in a domain, then the collision rate is reduced. e.g. if the address calculation depends on warp-thread-ID or block-ID. I wonder how that's communicated across a grid of threads

Shared memory bank conflicts on read aren't really the same as non-atomic writes to shared memory which end-up being serialised randomly, with indeterminate effect.

Jawed

Jawed · Apr 30, 2009

TimothyFarrar said:
http://www.vis.uni-stuttgart.de/~dachsbcn/download/dphlc.pdf found atomics to be only 15% slower on average. But I haven't seen anyone publish general benchmark of atomic instruction performance. I'm thinking about getting a GT2xx (compute 1.2) card soon, so I might just have to do this.

Hmm, why use atomics then? That paper's rather too brief on the subject, oh well.

Jawed

TimothyFarrar · Apr 30, 2009

Jawed said:
Aren't atomic operations, by definition, on a single address? So every thread in a grid or block collides (global or shared memory, respectively).

Not to my knowledge (however I'm just starting now with Compute 1.2). Each thread has a possibly different address. Likely if you didn't predicate the atomic operation to one thread, you'd have different addresses for all the other threads. Collisions would be something you'd like to avoid like the plaque.

Jawed · May 1, 2009

I should have re-worded that, as atomic operations are colliding on one address unless some address manipulation is performed, i.e. unless a structure of atomic variables is being used. If collisions can't occur due to addressing, then there's no need to use atomic operations - the obvious example is per thread registers which are entirely private.

The simple example on page 109 that I referred to earlier uses a single memory location to keep a count of the number of blocks that have completed execution, using atomicInc and thread 0 of each block to do the counting. So that's a "collision" for all blocks which ends up being serialised across the entire grid. It would be a collision for all threads without the threadIdx.x==0.

The paper you linked seems to use either 64 or 256 bins per patch, each of which is writable by any number of threads (source bins), with one source bin per source patch first identified by an atomicMin. So the number of collisions here is variable and in fact the family of atomic variables is huge.

Maybe I shouldn't be calling them "collisions" since with serialisation no collision actually occurs.

I'm curious to know how they implemented this without atomic operations.

Jawed

Jawed · May 1, 2009

This old thread:

http://forum.beyond3d.com/showthread.php?t=42298

talks about the likelihood that L2 is used for atomic variables. No great surprise, and no real insights on performance.

Jawed

Nvidia GT300 core: Speculation

Similar threads