Nvidia GT300 core: Speculation

Status
Not open for further replies.
Intriguing, I think GS writes to memory too though Jawed (for AMD but not NVIDIA, that is). I'll admit I didn't realize it read from it too though! Nearly enough to makes me reconsider whether 512-bit could *theoretically* have made any sense on R600... :) (but certainly still not enough, sorry! hehe)
 
I just looked at my first GS assembly code on ATI.

The ISA document does talk about this stuff too - it's just that I ignored/forgot :???: GS output is performed by MEM_RING_WRITE which writes in multiples of 32-bits to the "ring buffer".

Sections 2.1.2 and 2.1.3 detail the steps involved.


Without a GS present, 2.1.2:
  • vertex data comes from "Vertex memory"
  • vertex data goes to "parameter cache and position buffer"
  • pixel data comes from "Positions cache, parameter cache, and vertex geometry translator (VGT)"
  • pixel data goes to "Local or system memory"
(Note the inconsistency and also note that VGT is sometimes referred to as "vertex grouper and tessellator".)


With a GS present, 2.1.3:
  • vertex data comes from "Vertex memory"
  • vertex data goes to "VS ring buffer"
  • primitive data comes from "VS ring buffer"
  • primitive output goes to "GS ring buffer"
  • "any data", using what's called a DMA Copy shader, comes from "GS ring buffer"
  • "any data" using DC, goes to "Parameter cache or position buffer"
  • pixel data comes from "Positions cache, parameter cache, and vertex geometry translator (VGT)"
  • pixel data goes to "Local or system memory"
Bits from the geometry shader:

Code:
00 VTX: ADDR(368) CNT(3) 
      0  VFETCH R8, R0.w, fc159  MEGA(16) 
         FETCH_TYPE(NO_INDEX_OFFSET) 
      1  VFETCH R9, R0.x, fc159  MEGA(16) 
         FETCH_TYPE(NO_INDEX_OFFSET) 
      2  VFETCH R10, R0.y, fc159  MEGA(16) 
         FETCH_TYPE(NO_INDEX_OFFSET) 
[...]
19 MEM_RING_WRITE: DWORD_PTR[24].x___, R0, ARRAY_SIZE(0,4) ELEM_SIZE(3) 
20 MEM_RING_WRITE: DWORD_PTR[24]._y__, R0, ARRAY_SIZE(0,4) ELEM_SIZE(3)   NO_BARRIER 
21 MEM_RING_WRITE: DWORD_PTR[24].__z_, R1, ARRAY_SIZE(0,4) ELEM_SIZE(3)   NO_BARRIER 
22 MEM_RING_WRITE: DWORD_PTR[24].___w, R5, ARRAY_SIZE(0,4) ELEM_SIZE(3)   NO_BARRIER 
23 MEM_RING_WRITE: DWORD_PTR[28].xy__, R15, ARRAY_SIZE(0,4) ELEM_SIZE(3)   NO_BARRIER 
24 EMIT 
25 CUT 
26 MEM_RING_WRITE: DWORD_PTR[32], R8, ARRAY_SIZE(0,4) ELEM_SIZE(3) 
27 MEM_RING_WRITE: DWORD_PTR[36].xy__, R15, ARRAY_SIZE(0,4) ELEM_SIZE(3)   NO_BARRIER 
28 EMIT 
[...]

The DC is very simple:

Code:
; -------------- Copy Shader --------------------
; --------  Disassembly --------------------
00 VTX: ADDR(16) CNT(2) 
      0  VFETCH R1, R0.x, fc159  MEGA(32) 
         FETCH_TYPE(NO_INDEX_OFFSET) 
      1  VFETCH R2, R0.x, fc159  MINI(16) OFFSET(16) 
         FETCH_TYPE(NO_INDEX_OFFSET) 
01 EXP_DONE: POS0, R1
02 EXP_DONE: PARAM0, R2
END_OF_PROGRAM

So both VS and GS ring buffers are in memory. All the instructions that deal with memory are running in parallel with ALU instructions, so it's a question of having enough ALU work to hide these latencies.

Jawed
 
You said it could get efficiently filled warps with a divergent job ... what exactly did you mean with that then?

How about a simple example of what I am referring to as a divergent job: A kernel is processing a buffer of nodes. Some random fraction of nodes can be processed in this single pass. The other half needs to be streamed through another kernel. Then some random fraction of the 2nd pass needs another kernel, and so on. In this example node=thread.

Current options are, (a.) pack all kernels in to one kernel and do a divergent branch in which case all threads of the warp run the worst case path, (b.) DX10/GL use stream out and run multiple passes.

With what I was suggesting, (c.) kernel writes to queue, hardware sees data on queue, groups a warp (or block), and starts the the kernel associated with the queue to process the warp (or block).

I don't see this sort of thing being too much different than what currently happens in FF: grouping of verts which miss post transform cache, grouping of pixel quads, etc. Intermediate data doesn't hit main memory.
 
Jawed,

So with ATI, excluding the GS path, all intermediate data between VS and PS doesn't ever hit main memory. Just GS path is messy and hits memory because of VS output sharing between GS invocations.

For DX11 can (IA,VS,HS) be grouped into one kernel? Meaning kernel manually fetches vertex inputs, and HS outputs store in post transform cache instead of VS outputs?
 
So with ATI, excluding the GS path, all intermediate data between VS and PS doesn't ever hit main memory.
Yep, it seems the position buffer and parameter cache are on-die and load-balancing twixt VS and PS can interact directly with these buffers, i.e. percentage-full and perhaps rate-of-change.

Just GS path is messy and hits memory because of VS output sharing between GS invocations.
Yes, two ring buffers are used to balance the workload and provide queues that can independently adjust. When you have triangles in the VS input buffers producing random numbers of triangles in the PS input buffers (i.e. position buffer and parameter cache), you have a fairly fundamental ordering problem, not to mention quantity of data problem.

One thing that I've not seen is how R600 etc. deal with the volume of data produced by tessellation amplification, which is up to 15x. Maybe that hard limit means it stays on-die?

The interesting thing about the DC shader is that it can effectively be load balanced against PS consumption of the position buffer and parameter cache - i.e. DC acts as a surrogate VS from the point of view of keeping pixel shading fed. If the GS is a culling rather than amplifying process it seems there's a strong risk of running out of work for PS to do - but that risk seems fundamental to any architecture.

The obvious problem with this double-ring-buffer approach is that there are 5 memory operations. Obviously there's less than that, on average, per vertex, if amplification is the norm, but in the extreme best case there's still a minimum of 2 memory operations per vertex (hmm, should be able to work it out with the 1024-scalars per invocation limit). The bandwidth cost is obviously troubling, too. ALU:TEX is supposed to be going up, though.

On the other hand, this is scalable. You can see why AMD was miffed at the 1024-scalars per GS-invocation limit that was written into D3D10, when they'd built a dataflow that doesn't care.

Whether it's due to ordering or volume of data, it seems to me a GPU must be able to spill inter-kernel buffers to memory. With something like Larrabee it's all through the cache, so there it's a question of deciding how many cache lines to use...

Append buffers are the general case for this. It would seem reasonable to assume these buffers can be cached, much like global atomics are in NVidia. So then it's really a question of sizing and load-balancing...

For DX11 can (IA,VS,HS) be grouped into one kernel? Meaning kernel manually fetches vertex inputs, and HS outputs store in post transform cache instead of VS outputs?
My head hurts every time I look at HS->TS->DS.

It seems that HS amplifies a patch's control points that are shaded in the VS, similar to the way in which GS amplifies vertices shaded by VS, using the "constant function":

ArticleDiagramDetailedD3D11Pipeline.png


Such a nice diagram, hope Jack doesn't mind the linkage, from:

http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=316777

which also has his experimentation using a CS to generate weightings ahead of terrain tessellation.

The GDC09 presentation is also useful:

http://developer.amd.com/gpu_assets/GDC09_D3D11Tessellation.pps

(can't find this on NVidia's site - there's a PDF of this available on the GDC09 site) showing the usage of 10 control points per triangle more explicitly. I'm unclear on whether 10 is the limit here. Just don't know enough about HS/TS/DS.

Jawed
 
How about a simple example of what I am referring to as a divergent job: A kernel is processing a buffer of nodes. Some random fraction of nodes can be processed in this single pass. The other half needs to be streamed through another kernel. Then some random fraction of the 2nd pass needs another kernel, and so on. In this example node=thread.

Current options are, (a.) pack all kernels in to one kernel and do a divergent branch in which case all threads of the warp run the worst case path, (b.) DX10/GL use stream out and run multiple passes.
Couldn't you just run the first pass of the program, do a reduction to find the number of elements which need to go to the second pass and create write indices to write them to a circular 2*warp sized array in local shared memory? You could then use a non divergent data dependent branch to trigger the second pass when the number of elements is greater than a single warp.
 
Last edited by a moderator:
Couldn't you just run the first pass of the program, do a reduction to find the number of elements which need to go to the second pass and create write indices to write them to a circular 2*warp sized array in local shared memory? You could then use a non divergent data dependent branch to trigger the second pass when the number of elements is greater than a single warp.

Yes that would also work in many cases, but would likely have diminishing returns depending data distribution? This sort of thing seems to be a really common CUDA pattern of needing to manually do either a local (as in your example) or global stream compaction, either way you are manually doing "dynamic warp formation".

BTW, for those who are interested, the GRAMPS paper describes a model very similar to what I've been suggesting. They also add state-full thread stages.
 
Yes that would also work in many cases, but would likely have diminishing returns depending data distribution? This sort of thing seems to be a really common CUDA pattern of needing to manually do either a local (as in your example) or global stream compaction, either way you are manually doing "dynamic warp formation".
Which is why the idea of doing it automatically is so compelling :p If you've built a complicated operand collector and instruction issuer, how much extra does this cost?...

BTW, for those who are interested, the GRAMPS paper describes a model very similar to what I've been suggesting. They also add state-full thread stages.
Yeah, their stuff is nice. Recommend his page:

http://www-graphics.stanford.edu/~yoel/notes/

Jawed
 
So perhaps general purpose queuing isn't for this generation, too early.

Append buffers are the general case for this. It would seem reasonable to assume these buffers can be cached, much like global atomics are in NVidia. So then it's really a question of sizing and load-balancing...

BTW, thanks Jawed for the ATI info above (GS ring path is very interesting). Somehow I still see append/consume buffers always going to DRAM do to data volume and the shader/shader boundary (even if hardware provides a free on-chip stream compaction, which would be nice). Seems as if HS/TS/DS would be all on chip. Perhaps DS gets early HS outputs forwarded into shared memory while TS pipes out groups of verts for DS shading (I think lots of DS invocations share same HS output).

Still leaves the question of just what does NVidia have special for this generation? Since they have a CUDA 3.0 then I'd still bet something is changing (vs just getting an optimization like DWF). Any chance they add hardware support for CUDA's missing branch/call to register, seems useful for the DX11 linking, or virtual function calls?
 
So perhaps general purpose queuing isn't for this generation, too early.
If queuing is doomed never to be possible "on-die" unless it happens to fit into a cache (i.e. Larrabee style) then I don't see why NVidia wouldn't implement it now.

BTW, thanks Jawed for the ATI info above (GS ring path is very interesting). Somehow I still see append/consume buffers always going to DRAM do to data volume and the shader/shader boundary (even if hardware provides a free on-chip stream compaction, which would be nice).
I don't think there's any choice about writing the data to memory. But the "sentry", the singleton in GRAMPS terminology (like the primitives reorder thread twixt VS and PS), is running on-die, so that is itself fast.

Seems as if HS/TS/DS would be all on chip. Perhaps DS gets early HS outputs forwarded into shared memory while TS pipes out groups of verts for DS shading (I think lots of DS invocations share same HS output).
I have to say I'm dubious about this now. One of the issues here is the latency of the two paths from HS->DS.

The TS path is fed solely by the constant function portion of the HS and obviously then has its own latency, whatever that is.

The DS path is fed by both the constant function and an iteration/generation of the control points for the patch (still don't really understand the scope of this).

With the large amount of data coming out of TS (and we don't know what kind of rate) + the variable count of control points, the DS looks like it's feeding off two inputs that behave asynchronously.

Still leaves the question of just what does NVidia have special for this generation?
It seems to me, now, that ATI hardware already supports append/consume (the ring buffers in the GS path) and NVidia has to add them. I'm guessing that Stream Out in NVidia currently only runs on one multiprocessor/cluster and so appending is "trivial" and doesn't require general append/consume. Since general-purpose append/consume requires the whole chip to be singleton threaded for a single buffer, this would marry up with the slide I linked: "Persistent thread blocks reading and writing work queues" and "Thread block or warp as a (parallel) task".

So I would argue that NVidia is going a step beyond what ATI has implemented so far. And, arguably, the relatively high cost of this functionality is something NVidia deferred from the G80 design, with it roadmapped for a time when transistor budget allows.

So I'm still looking forward to NVidia having arbitrary shaders. OK, so NVidia might only allow a single "append/consume" singleton to run in parallel with a normal CS (or CUDA kernel).

Since they have a CUDA 3.0 then I'd still bet something is changing (vs just getting an optimization like DWF). Any chance they add hardware support for CUDA's missing branch/call to register, seems useful for the DX11 linking, or virtual function calls?
That's a subject I really don't understand. Still can't get my head round the implications for register allocation of dynamically linked functions, unless they're purely static by the time they arrive on the GPU, in which case there's no variable function addressing.

Of course NVidia may be doing something radical with register allocation that assuages my basic concern. Thinking about this, when you have VS+HS+DS+GS+PS all simultaneously running on a unified architecture, just how do you allocate registers? It seems you're forced either to stripe shader type across multiprocessors/clusters (another aspects of the "sometimes peculiar" performance of GS on NVidia?) or you need to have multiple, variably-sized, ways in the register file. So if you're doing the latter then maybe the register allocation problem associated with dynamically linked functions on the GPU disappears...

If you start having all these simultaneous ways to allocate registers then it sounds closer to being able to do DWF (which requires all sorts of funny operand access patterns from RF). Ah, nice dream...

Jawed
 
(can't find this on NVidia's site - there's a PDF of this available on the GDC09 site) showing the usage of 10 control points per triangle more explicitly. I'm unclear on whether 10 is the limit here. Just don't know enough about HS/TS/DS.
DX11 has a limit of 32 control points per patch.
 
If queuing is doomed never to be possible "on-die" unless it happens to fit into a cache (i.e. Larrabee style) then I don't see why NVidia wouldn't implement it now.

I was under the impression that queuing is already "on-die" for fixed function only (queues threads to fill warps for VS and PS shader paths and has the fixed function work distribution logic as well). All I was suggesting here is to make modifications necessary to open this up in a general way. But perhaps not likely to happen.

I have to say I'm dubious about this now. One of the issues here is the latency of the two paths from HS->DS.

From my limited understanding, HS generates a group of control point data and tessellation factors per primitive (I'm going to refer to this as the HS output packet). Tessellation factors get forwarded to TS. DS runs per TS output point with uv input and access to the HS output packet. So in theory HS output would be limited by TS output rate, and post TS the DS would just churn out points all using the same "shared memory" HS output packet.

It seems to me, now, that ATI hardware already supports append/consume (the ring buffers in the GS path) and NVidia has to add them.

Or if the append/consume path is to global memory anyway, they could emulate the queues using shared memory and atomic operations. However if one can append/consume from the same queue in a given thread group (example might be a pool of nodes for dynamic allocation/free), this type of thing gets tough to do with the atomic API in PTX/CUDA (queue write and queue head pointer update needs to be atomic if consume could possibly overlap).

That's a subject I really don't understand. Still can't get my head round the implications for register allocation of dynamically linked functions, unless they're purely static by the time they arrive on the GPU, in which case there's no variable function addressing.

If the set of possible dynamic register supplied addresses is known, then possibly register allocation is the union. Seems likely wasteful, but could be very useful to handle the shader explosion problem with traditional game engine material systems. The idea here being that you don't get direct access to branch by register, but the compiler/linker can feed in addresses into (invisible to shader) shared memory which can then be fetched into a register can branched to. I still need to go through all the DX11 linking stuff to get an idea of what is required for DX11 (which very well might just be just a more advanced patching of shaders that is already done now).
 
Yes that would also work in many cases, but would likely have diminishing returns depending data distribution?
Dunno about diminishing returns. The downsides are that the accumulation is local to the warp and the threads have to iteratively fetch data themselves (you lose the choice of loading input data via an initial texture clause). The former is only a problem if the number of elements operated on by a single warp in the first phase is not much larger than the warp size or if the later phases take a disproportionate amount of runtime. How much impact the latter has I don't know, the hardware is probably still optimized for initial texture clauses before the shader is run well and proper ... meh.
 
BTW, got the GTX 275 in yesterday and had a chance to profile CUDA global atomic operations. This post is mostly a continuation of a conversation which was in the R800 thread, but I think it has some relevance to the topic of GT300/DX11 append/consume possibly being done in "software".

The test I ran was to simulate the atomic operation required if each half-warp of a warp updated a queue head pointer. So only the first thread of each half-warp was predicated to run the atomic operation. Atomic operations were done in a forced non-unrolled loop clocked with the CUDA clock() operation, and overall min/max of all half-warp clock() start/end times used to gather clock cycles per atomic operation. So results are rough. I didn't have time to do anything more, did my best in the limited amount of time to insure compiler didn't optimize out any work...

(1.) The atomicAdd() latency between a 8600 GTS (compute 1.1) and GTX 275 (compute 1.3) was nearly the same in cases where all threads accessed the same address. Was around 330-340 cycles latency on average which requires around 84 instructions (4 clk/instruction) to hide that latency.

(2.) For a range of colliding to non-colliding address cases I ran something like this with various settings for overlap (16,8,4,2,1),

unsigned int offset = blockIdx.x * overlap;
...
value = atomicAdd(buffer + offset, value);
...

In these cases atomicAdd performance was many times better (>3x) on the Compute 1.3 cards because of the coalescing. Sustained best case in my tests was around 29 cycles on average of latency for an atomicAdd (keep in mind only the first thread of each half-warp was doing the global atomicAdd). Which is about 7 to 8 instructions (4 cycles per instruction/warp) to hide that latency.
 
I was under the impression that queuing is already "on-die" for fixed function only (queues threads to fill warps for VS and PS shader paths and has the fixed function work distribution logic as well). All I was suggesting here is to make modifications necessary to open this up in a general way. But perhaps not likely to happen.
I meant for general purpose queueing. The existing on-die queues work because of the incredibly tightly-controlled scheduling and routing of work. e.g. post transform vertex cache has a capacity measured in 10s of vertices and it's not considered a problem to re-shade vertices if they drop out of the cache slightly too early - though people do optimise vertex ordering in order to maximise vertex throughput.

http://castano.ludicon.com/blog/2009/02/02/optimal-grid-rendering/#more-860

http://developer.amd.com/GPU/TOOTLE/Pages/default.aspx

From my limited understanding, HS generates a group of control point data and tessellation factors per primitive (I'm going to refer to this as the HS output packet). Tessellation factors get forwarded to TS. DS runs per TS output point with uv input and access to the HS output packet. So in theory HS output would be limited by TS output rate, and post TS the DS would just churn out points all using the same "shared memory" HS output packet.
So the work associated with the HS output packet is associated with TS-generated vertices. 32 control points with a <=64x multiplier on the number vertices? So 64 warps' worth of work?

I suppose the real issue here is the "lumpiness" of the data-flows. Like 3 buses all arriving together.

Is TS single-threaded?...

Or if the append/consume path is to global memory anyway, they could emulate the queues using shared memory and atomic operations.
I suspect this is how geometry amplification is handled currently by GS. Though bandwidth is still troubling. But with 1024 scalars per vertex, a batch of 16 vertices will fill 16KB of shared memory, which sounds like what was chosen for D3D10...

However if one can append/consume from the same queue in a given thread group (example might be a pool of nodes for dynamic allocation/free), this type of thing gets tough to do with the atomic API in PTX/CUDA (queue write and queue head pointer update needs to be atomic if consume could possibly overlap).
Yeah, let the hardware take the strain. If you're on Larrabee hope someone's written the library function.

If the set of possible dynamic register supplied addresses is known, then possibly register allocation is the union. Seems likely wasteful, but could be very useful to handle the shader explosion problem with traditional game engine material systems. The idea here being that you don't get direct access to branch by register, but the compiler/linker can feed in addresses into (invisible to shader) shared memory which can then be fetched into a register can branched to. I still need to go through all the DX11 linking stuff to get an idea of what is required for DX11 (which very well might just be just a more advanced patching of shaders that is already done now).
Maybe it's all static, the dynamic stuff happens in the driver, using its own cache (on disk if necessary) of shader sub-functions that have been compiled already with resource descriptors (e.g. number of registers, list of side-effect registers)...

Jawed
 
(1.) The atomicAdd() latency between a 8600 GTS (compute 1.1) and GTX 275 (compute 1.3) was nearly the same in cases where all threads accessed the same address. Was around 330-340 cycles latency on average which requires around 84 instructions (4 clk/instruction) to hide that latency.
That's pretty brutal considering it should be on-die.

(2.) For a range of colliding to non-colliding address cases I ran something like this with various settings for overlap (16,8,4,2,1),

unsigned int offset = blockIdx.x * overlap;
...
value = atomicAdd(buffer + offset, value);
...

In these cases atomicAdd performance was many times better (>3x) on the Compute 1.3 cards because of the coalescing. Sustained best case in my tests was around 29 cycles on average of latency for an atomicAdd (keep in mind only the first thread of each half-warp was doing the global atomicAdd). Which is about 7 to 8 instructions (4 cycles per instruction/warp) to hide that latency.
What are the grid dimensions? How many warps per block?

Is it definitely coalescing that's increasing performance? Not merely the number of warps concurrently able to do atomics (either or both of number of clusters and number of MCs)?

Jawed
 
(1.) The atomicAdd() latency between a 8600 GTS (compute 1.1) and GTX 275 (compute 1.3) was nearly the same in cases where all threads accessed the same address. Was around 330-340 cycles latency on average which requires around 84 instructions (4 clk/instruction) to hide that latency.

(2.) For a range of colliding to non-colliding address cases I ran something like this with various settings for overlap (16,8,4,2,1),

unsigned int offset = blockIdx.x * overlap;
...
value = atomicAdd(buffer + offset, value);
...

In these cases atomicAdd performance was many times better (>3x) on the Compute 1.3 cards because of the coalescing. Sustained best case in my tests was around 29 cycles on average of latency for an atomicAdd (keep in mind only the first thread of each half-warp was doing the global atomicAdd). Which is about 7 to 8 instructions (4 cycles per instruction/warp) to hide that latency.

Hmm, interesting. It looks like Nvidia's implementation of atomic add works in such a way that repeatedly accessing the same address takes longer than accessing different addresses and the latency is roughly equal to their memory latency.

They performance here is actually opposite of what you would expect in a cpu based system. It looks like Nvidia is fully serializing the accesses to the same address but in the case of different addresses, is issuing them and then switching warps which effectively hides the latency.
 
Status
Not open for further replies.
Back
Top