Nvidia GT300 core: Speculation

Ailuros · May 7, 2009

16 is a trendier number than 32; it matches his purse's colour better in any case

Jawed · May 7, 2009

Kirkian Distortion Field Disabled

http://www.vrvis.at/presse/pressemitteilungen-pdf/vct-2009-talk-luebke

NVidia seems pretty determined that fixed function hardware isn't going anywhere soon. Then ruins it all by saying "need more flexible access, softer falloffs", slide 70.

http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Luebke.pdf

Slide 60 "Persistent thread blocks reading and writing work queues" and "Thread block or warp as a (parallel) task".

Jawed

Humus · May 7, 2009

Ailuros said:
16 is a trendier number than 32; it matches his purse's colour better in any case

But 16 is not an important number.

David Kirk said:
I believe that 16, for example, is not an important number.

TimothyFarrar · May 7, 2009

Jawed said:
Slide 60 "Persistent thread blocks reading and writing work queues" and "Thread block or warp as a (parallel) task".

Actually this runs right inline with a theory I've been thinking about for GT300,

What if GT300 and CUDA 3.0 was all about general purpose hardware accelerated on chip queuing. And still no coherent writable caches (because they are not efficient by design for massively parallel programming!, sorry couldn't help slipping that in there

). DX11 doubles the number of stages, many of which have variable amounts of output. Makes sense to not have dedicated piping for these.

Part of what I'm saying is that DX11's Append and Consume buffers would be hardware supported (instead of manual shared memory/register scan/scatter with an atomic gather for queue head update). And also might not necessarily require a round trip to memory (in fact with what I'm suggesting in the CUDA 3.0 case below, it would never hit memory).

DX11 adds indirect dispatch (so can fetch dimensions at runtime). So in theory one could fill a queue and then dispatch a new job based on the amount of the queue filled. Or even better, I suggesting CUDA 3.0 will add ability to associate a kernel with a queue, so that when the queue fills up enough for a thread block to be started, the hardware routes and starts the job for you on an available core. Remember this is all on-chip.

This kind of general purpose hardware thread queuing could be huge, really huge. Think of all the CUDA time spend on stream compaction/expansion. This would now be done in hardware (append/consume). When a job goes too divergent, you could add to a queue based on the divergence, and the hardware would regroup threads into efficient filled warps for free. Things like requeuing threads for better data locality would be possible. The opportunities are tremendous.

This is what I think NVidia had in mind when saying they are moving the GPU to a general purpose massively parallel processor. It would be an order of magnitude more efficient at solving the tough parts of parallel problems than attempting to emulate the functionality using a wide vector CPU with a coherent cache...

Also it isn't a giant leap from what already seems to be happening in the fixed function geometry units attached to the multiprocessors, and it is perfect for DX11 feature set. So it indeed might be possible for GT300.

Jawed · May 8, 2009

TimothyFarrar said:
Actually this runs right inline with a theory I've been thinking about for GT300,

What if GT300 and CUDA 3.0 was all about general purpose hardware accelerated on chip queuing.

We know pre-emptively scheduled arbitrary concurrent kernels are coming to D3D at some point, so it seems reasonable to assume that NVidia's leading D3D by one generation, much like it did with the pure-computation+shared-memory model introduced with G80.

Of course the difference is that Larrabee is already where NVidia's going. They just don't like admitting it.

And still no coherent writable caches (because they are not efficient by design for massively parallel programming!, sorry couldn't help slipping that in there ).

Bearing in mind how much CUDA struggles with weather forecasting:

http://forum.beyond3d.com/showthread.php?t=49266

which is the mother of all massively parallel workloads...

DX11 doubles the number of stages, many of which have variable amounts of output. Makes sense to not have dedicated piping for these.

Part of what I'm saying is that DX11's Append and Consume buffers would be hardware supported (instead of manual shared memory/register scan/scatter with an atomic gather for queue head update). And also might not necessarily require a round trip to memory (in fact with what I'm suggesting in the CUDA 3.0 case below, it would never hit memory).

It'd be pretty shocking if it did do a round trip through memory! Though I'm not sure how many of these buffers can be active simultaneously. Is it 8 or something like 128 or arbitrary?

DX11 adds indirect dispatch (so can fetch dimensions at runtime). So in theory one could fill a queue and then dispatch a new job based on the amount of the queue filled.

That appears to be providing a dynamically sizable execution domain, i.e. runtime defined count of thread groups, rather than statically sized. The execution domain is 3-dimensional and nested within that, each thread group is also 3-dimensional. I don't understand what dependency you're alluding to :???:

Or even better, I suggesting CUDA 3.0 will add ability to associate a kernel with a queue, so that when the queue fills up enough for a thread block to be started, the hardware routes and starts the job for you on an available core. Remember this is all on-chip.

Logically the GPUs are doing this already. The difference we want is that the count and type of kernels that can concurrently execute is arbitrary. This obviously has an impact on both load-balancing and thread-group issue policy. The scoreboards have to become generalised in the limit. Or, of course, you implement software scoreboarding.

This kind of general purpose hardware thread queuing could be huge, really huge. Think of all the CUDA time spend on stream compaction/expansion. This would now be done in hardware (append/consume). When a job goes too divergent, you could add to a queue based on the divergence, and the hardware would regroup threads into efficient filled warps for free. Things like requeuing threads for better data locality would be possible. The opportunities are tremendous.

So back-door dynamic warp formation? Of course on-die cache provides a nice fabric to handle the spillage while the nascent warps are being formed and to support data-routing and optionally, of course, migration of warps across the die for true work-stealing goodness.

What's interesting about append/consume is it's really just a short-cut past the major bottleneck of having to perform your own scan to set addresses and all that palaver.

This is what I think NVidia had in mind when saying they are moving the GPU to a general purpose massively parallel processor. It would be an order of magnitude more efficient at solving the tough parts of parallel problems than attempting to emulate the functionality using a wide vector CPU with a coherent cache...

Since task-parallelism is where Larrabee evolved from...

Also it isn't a giant leap from what already seems to be happening in the fixed function geometry units attached to the multiprocessors, and it is perfect for DX11 feature set. So it indeed might be possible for GT300.

I'm looking forward to this kind of flexibility (just as a geek observer...). I cringe at the thought of GPUs that can't execute multiple arbitrary kernels - I was shocked to discover that's how all of them have been so far, it's so bleeding obvious as a requirement.

Of course the GPUs make it hard for themselves by insisting on scoreboarding tens/hundreds of contexts (hardware-threads) per core. As ALU:TEX/fetch/atomic continues to climb as time goes by, it doesn't make sense, in my view, to continue to expand the size of the hardware-scoreboarding to cope with the increased number of threads in flight.

Jawed

Ailuros · May 8, 2009

Humus said:
But 16 is not an important number.

ROTFLMAO

you made my day....

Jawed said:
Of course the difference is that Larrabee is already where NVidia's going. They just don't like admitting it.

I wouldn't say that PowerVR's SGX is heading in the LRB direction per se, so I guess there's always a way to find a middle ground that works for most if not all scenarios. And that's of course not exclusive to NVIDIA either.

In fact I'd be very surprised (and yes it's pretty bold claim on my behalf) if LRB can stand all that proud against ATI's D3D11 solution.

TimothyFarrar · May 8, 2009

Jawed said:
We know pre-emptively scheduled arbitrary concurrent kernels are coming to D3D at some point, so it seems reasonable to assume that NVidia's leading D3D by one generation, much like it did with the pure-computation+shared-memory model introduced with G80.

Of course the difference is that Larrabee is already where NVidia's going. They just don't like admitting it.

Are they? Or is something better than preemptive scheduling going to be the NVidia GPU way?

Preemptive scheduling suffers from non-deterministic behavior, cold memory and tlb caches, expensive job start kernel calls. Also with the GPU you'd be swapping in and out large amounts of GPU state (texture registers etc).

With what I'm suggesting, general purpose hardware job routing and queuing, GPU state now includes a new first class member, the queue. Queues are associated with kernels. Queue/kernel association is setup in command buffer just like any other draw call. The hardware proactively manages jobs by reading from the queues, setting up thread blocks, routing data to an available core, and then starting execution of the thread block using the kernel associated with the queue. Queues are in hardware. No cold cache, data is always hot and ready.

This seems like a much better way to me. You don't preempt thread groups, you simply don't schedule them until the data is ready.

It'd be pretty shocking if it did do a round trip through memory! Though I'm not sure how many of these buffers can be active simultaneously. Is it 8 or something like 128 or arbitrary?

I haven't seen anything on DX11 Append buffer limits yet.

I don't understand what dependency you're alluding to

I'm not sure I understand what you are asking about here?

So back-door dynamic warp formation? Of course on-die cache provides a nice fabric to handle the spillage while the nascent warps are being formed and to support data-routing and optionally, of course, migration of warps across the die for true work-stealing goodness.

Yeah it is somewhat like back-door dynamic warp formation.

What's interesting about append/consume is it's really just a short-cut past the major bottleneck of having to perform your own scan to set addresses and all that palaver.

Exactly.

Jawed · May 8, 2009

TimothyFarrar said:
Are they? Or is something better than preemptive scheduling going to be the NVidia GPU way?

I guess I need to describe what I understand by the pre-emption that D3D is destined to gain.

The OS or, in compute terms, the application, has the ability to force running kernels (however many contexts) to defer to a named context (pipeline or kernel(s)). A subset of this functionality is that the GPU nominally supports multiple contexts, e.g. two entirely independent 3D rendering pipelines could be loaded/running concurrently, each with their own VS/PS/etc. kernels. State changes (e.g. shader code, Z-compare mode) are totally separate and concurrent.

Pre-emption is there to hoist one context/pipeline (or kernel) above the others. The idea from the OS point of view, is to guarantee responsiveness for advanced user-interface and OS functionality techniques, as I understand it.

In the more general case, the GPU has to be able to manage multiple competing contexts consisting of arbitrary independent pipelines of independent kernels. Applications can then deploy multiple competing contexts/kernels of their own. I don't know what form of scheduling-policy will be exposed and whether applications can steer this. I don't know if user applications can raise these "GPU interrupts" - they're not interrupts as such, but they sort of have an effect like an interrupt.

Preemptive scheduling suffers from non-deterministic behavior,

So does the lifetime of any individual warp, as the GPU pre-emptively schedules individual warps.

cold memory and tlb caches, expensive job start kernel calls. Also with the GPU you'd be swapping in and out large amounts of GPU state (texture registers etc).

Not in the general case when the competing kernels are already running. The competing states would already be co-existent. I guess in the extreme case a cold context could intrude on the GPU.

With what I'm suggesting, general purpose hardware job routing and queuing, GPU state now includes a new first class member, the queue. Queues are associated with kernels.

GPU state per kernel already includes queue condition (i.e. load-balancing is queue-aware). You can't fire off a new warp of fragments without a rasterised triangle's state ready to roll.

The requirement is to generalise kernel state so that we aren't stuck with only being able to run a single CS kernel at any one time, or have a single time-sliced 3D pipeline shared by competing applications, each of which has to flush the pipe in order to load its kernels and get results.

It's sort of like on-die, not so abstract, concurrent-BSGP:

http://www.kunzhou.net/2008/BSGP.pdf

I dunno why that hasn't made more of a splash.

TimothyFarrar said:
I'm not sure I understand what you are asking about here?

TimothyFarrar said:

So in theory one could fill a queue and then dispatch a new job based on the amount of the queue filled.

Click to expand...

You appear to be saying that the length of the queue affects the despatch of a kernel (the time when the kernel starts). But the kernel is despatched by defining it's thread-group dimensions and saying "go". Thereafter it runs freely.

D3D11 is allowing you to build multiple buffers of this type. You could have 10 scan kernels by declaring kernel-buffers, ScanA, ScanB etc., and then fill/fire them off in sequence with DispatchIndirect at whatever time interval or conditional suits your needs. You might decide to fire off only ScanC and ScanF after the results from other kernels are computed. The Scans might even be serially dependent and the condition is based upon some threshold

At least, that's how I understand D3D11's DispatchIndirect.

Maybe you're saying that a kernel is only despatched each time the amount of data fills the thread-group's allocated queue. In which case the kernel runs intermittently, and potentially persistently.

That's an attractive model from the point of view of task-level parallelism, but I don't think that's what D3D11 is actually allowing you to do. Need to find a D3D11-guru...

http://msdn.microsoft.com/en-us/library/dd445764.aspx

Separately, of course, GT300 might support this kind of generalised persistent-kernel-as-task-parallelism model.

GPUs already do this in a limited sense, with task parallelism that is heavily constrained: there's one VS and one PS, each sat there waiting for work to arrive. Fully general would be seriously cool and clearly keep CUDA ahead of D3D11-CS.

I'm unclear on the context-scheduling/task-parallelism model of OpenCL.

Jawed

TimothyFarrar · May 8, 2009

Sounds like what you are saying is more along the lines of associating a priority to kernels rather than preemption (stopping them after they have started execution, paging out all the context and starting a new kernel).

You can't fire off a new warp of fragments without a rasterised triangle's state ready to roll.

This is part of what I'm suggesting to change.

D3D11 is allowing you to build multiple buffers of this type. You could have 10 scan kernels by declaring kernel-buffers, ScanA, ScanB etc., and then fill/fire them off in sequence with DispatchIndirect at whatever time interval or conditional suits your needs. You might decide to fire off only ScanC and ScanF after the results from other kernels are computed. The Scans might even be serially dependent and the condition is based upon some threshold

The DX11 DispatchIndirect case likely requires that queue output hits main memory, kernel to kernel transitions are sync points (if the kernels aren't in separate independent "streams"). So while it is useful, it isn't the best solution.

Maybe you're saying that a kernel is only despatched each time the amount of data fills the thread-group's allocated queue. In which case the kernel runs intermittently, and potentially persistently.

That is exactly what I am saying.

This avoids having a "worker" kernel just sitting around holding registers and shared memory. It avoids the need to buffer queues to global memory between kernel transitions.

That's an attractive model from the point of view of task-level parallelism, but I don't think that's what D3D11 is actually allowing you to do.

Yeah the model isn't directly supported by DX11, but the DX11 pipeline could easily be build on this model. Especially the variable amount of data expansion which can happen because of tessellation.

Separately, of course, GT300 might support this kind of generalised persistent-kernel-as-task-parallelism model.

This is why I think something like CUDA has great importance. CUDA supported functionality years ahead of it getting in DX and even now, DX11 CS4 supports only a small fraction of what CUDA does on DX10 level NVidia hardware (starting post G80, compute 1.1). So NVidia has the option of moving way ahead here in a way that the GP people (like me) can immediately use via CUDA 3.0, or alternatively if in a console, would be awesome for graphics beyond what is possible with DX11 alone.

Jawed · May 8, 2009

TimothyFarrar said:
Sounds like what you are saying is more along the lines of associating a priority to kernels rather than preemption (stopping them after they have started execution, paging out all the context and starting a new kernel).

In some future D3D, multiple contexts are running concurrently, e.g.:

A might be a D3D rendering pipeline for a game, i.e. VS/HS/DS/GS/PS kernels
B could be motion-estimation CS kernel running as part of an h.264-encoding application
C is Aero, i.e. VS/PS

Any time the user does something in the desktop UI, Aero can pre-empt other work in order to guarantee responsiveness.

Pre-emption is possible in this competing contexts model. Without pre-emption you'd have no guaranteed responsiveness.

In Aero, currently, responsiveness is effected solely through cooperative applications all being forced to share the same 3D pipeline (that's my understanding). But that can go astray, e.g. you can completely lock the GPU with an errant CUDA kernel and be forced to reboot the PC...

The DX11 DispatchIndirect case likely requires that queue output hits main memory, kernel to kernel transitions are sync points (if the kernels aren't in separate independent "streams"). So while it is useful, it isn't the best solution.

Well, I'm assuming that D3D11 has a "one CS kernel at any time" rule, so there's no choice but to hit memory. RV770 has the interesting property that registers/LDS can retain data between kernels, so maybe D3D11 will work this way where possible. Obviously that only relates to fairly small amounts of data.

That is exactly what I am saying.

This avoids having a "worker" kernel just sitting around holding registers and shared memory. It avoids the need to buffer queues to global memory between kernel transitions.

Only if the queues are small enough to stay on-die and the GPU supports multiple contexts. Stream Out is the obvious example of off-die buffering currently, and append/consume are the general case in D3D11.

Yeah the model isn't directly supported by DX11, but the DX11 pipeline could easily be build on this model. Especially the variable amount of data expansion which can happen because of tessellation.

Who knows, maybe D3D11.1 does this and that's what ATI will be delivering this year.

This is why I think something like CUDA has great importance. CUDA supported functionality years ahead of it getting in DX and even now, DX11 CS4 supports only a small fraction of what CUDA does on DX10 level NVidia hardware (starting post G80, compute 1.1). So NVidia has the option of moving way ahead here in a way that the GP people (like me) can immediately use via CUDA 3.0, or alternatively if in a console, would be awesome for graphics beyond what is possible with DX11 alone.

Playing catchup with Larrabee.

Jawed

MfA · May 9, 2009

TimothyFarrar said:
This would now be done in hardware (append/consume). When a job goes too divergent, you could add to a queue based on the divergence, and the hardware would regroup threads into efficient filled warps for free.

This would take an awful lot of data storage (and movement if the the thread contexts aren't all kept local). It also completely messes up the way shared memory is used AFAICS. If you go that far I think it would be easier to just support per thread branching.

Megadrive1988 · May 9, 2009

shyam335 said:
http://www.brightsideofnews.com/new...to-continue-with-complicated-controllers.aspx

With GT300 set to possibly be a 3 TFLOPs beast, and with things like ROPS and texture units not likely to decrease, I'm not surprised it also has a 512-bit memory bus interface.

I wasn't expecting Nvidia to go from 512-bit bus to something more narrow, like AMD did when going from R600 in early 2007 to RV770 in mid 2008.

almighty · May 9, 2009

GT300 could be a beast from this thread but Nvidia have to tackle the biggest issue.

PRICE, get it wrong and all that tech will go unused as consumers will go for somthing cheaper

Ailuros · May 9, 2009

Megadrive1988 said:
With GT300 set to possibly be a 3 TFLOPs beast, and with things like ROPS and texture units not likely to decrease, I'm not surprised it also has a 512-bit memory bus interface.

I wasn't expecting Nvidia to go from 512-bit bus to something more narrow, like AMD did when going from R600 in early 2007 to RV770 in mid 2008.

Assuming the 512SPs are correct and it's still 3FLOPs/SP (MADD+MUL), then it would take a 2GHz frequency to reach those hypothetical 3TFLOPs. Chip complexity scales with each new generation but by far more than core frequency; quite often frequencies even remain in the same ballpark as former GPUs.

As for the GDDR5@512bits my sniffing nose tells me that there's another primary reason for the chip needing insane bandwidth rates far apart from arithmetic throughput and texel fillrates....

Jawed · May 9, 2009

Megadrive1988 said:
With GT300 set to possibly be a 3 TFLOPs beast, and with things like ROPS and texture units not likely to decrease, I'm not surprised it also has a 512-bit memory bus interface.

With the ~doubling of bandwidth per bit that GDDR5 gives, you could argue that 64 ROPs are likely. TMUs prolly won't increase anything like as fast, assuming that ALU:TMU increases. And since it's likely that ALU complexity (CUDA overhead) will increase, the die area ALU:TMU will tend to lower the increase in TMUs too.

Jawed

TimothyFarrar · May 9, 2009

MfA said:
This would take an awful lot of data storage (and movement if the the thread contexts aren't all kept local). It also completely messes up the way shared memory is used AFAICS. If you go that far I think it would be easier to just support per thread branching.

I was thinking that the thread contexts wouldn't get saved or moved. The new thread group would have to start like a fresh thread block starts (with nothing in shared memory). What I'm suggesting wouldn't be useful for fine granularity dynamic branching...

Jawed · May 9, 2009

Purely in terms of inter-kernel buffering, there's a decent chance that the quantity of data is reasonable enough. D3D11 requires 32 vec4 attributes per vertex, for example. So for a warp that's 32 * 32 * 4 * 4 = 16KB of data that's hanging around until it's consumed.

But there's still an issue of moving this data to the multiprocessor that will consume it. Load-balancing means this is inevitable.

I've just discovered that in ATI the data output by VS into GS actually gets written to memory.

Jawed

nAo · May 9, 2009

Jawed said:
I've just discovered that in ATI the data output by VS into GS actually gets written to memory.

That's the only thing a sane mind would do given the huge amplification possible with geometry shaders.

MfA · May 9, 2009

TimothyFarrar said:
What I'm suggesting wouldn't be useful for fine granularity dynamic branching...

You said it could get efficiently filled warps with a divergent job ... what exactly did you mean with that then?

Jawed · May 9, 2009

nAo said:
That's the only thing a sane mind would do given the huge amplification possible with geometry shaders.

No, this is the output from VS (renamed ES in this scenario). In the new document from AMD:

http://forum.beyond3d.com/showpost.php?p=1291797&postcount=9

Supports 4 thread types:

ES (Export Shader) – typically a vertex shader before a geometry shader (it only outputs to
memory).

GS (Geometry shader) – optional.

VS (Vertex Shader) – null when GS is active, a normal vertex shader when GS is off.

PS (Pixel Shader).

the ES is so called because it writes to memory. GS-amplified data going via memory isn't stated anywhere that I can discern.

I'm guessing the reason ES writes to memory is that ordering of vertex consumption by GS is non-linear.

GS optionally writes to memory with Stream Out. I don't know what the throughput of a GS that amplifies tops out at. Since the amplified data itself costs time to both calculate and export from the kernel (since new vertices are being generated and some vertices can be culled), it is presumably possible to load-balance this throughput against pixel-shading consumption.

Jawed

Nvidia GT300 core: Speculation

Ailuros

Epsilon plus three

Jawed

Humus

Crazy coder

TimothyFarrar

Jawed

Ailuros

Epsilon plus three

TimothyFarrar

Jawed

TimothyFarrar

Jawed

MfA

Megadrive1988

almighty

Ailuros

Epsilon plus three

Jawed

TimothyFarrar

Jawed

nAo

Nutella Nutellae

MfA

Jawed

Similar threads