16 is a trendier number than 32; it matches his purse's colour better in any case
David Kirk said:I believe that 16, for example, is not an important number.
Slide 60 "Persistent thread blocks reading and writing work queues" and "Thread block or warp as a (parallel) task".
We know pre-emptively scheduled arbitrary concurrent kernels are coming to D3D at some point, so it seems reasonable to assume that NVidia's leading D3D by one generation, much like it did with the pure-computation+shared-memory model introduced with G80.Actually this runs right inline with a theory I've been thinking about for GT300,
What if GT300 and CUDA 3.0 was all about general purpose hardware accelerated on chip queuing.
Bearing in mind how much CUDA struggles with weather forecasting:And still no coherent writable caches (because they are not efficient by design for massively parallel programming!, sorry couldn't help slipping that in there ).
It'd be pretty shocking if it did do a round trip through memory! Though I'm not sure how many of these buffers can be active simultaneously. Is it 8 or something like 128 or arbitrary?DX11 doubles the number of stages, many of which have variable amounts of output. Makes sense to not have dedicated piping for these.
Part of what I'm saying is that DX11's Append and Consume buffers would be hardware supported (instead of manual shared memory/register scan/scatter with an atomic gather for queue head update). And also might not necessarily require a round trip to memory (in fact with what I'm suggesting in the CUDA 3.0 case below, it would never hit memory).
That appears to be providing a dynamically sizable execution domain, i.e. runtime defined count of thread groups, rather than statically sized. The execution domain is 3-dimensional and nested within that, each thread group is also 3-dimensional. I don't understand what dependency you're alluding toDX11 adds indirect dispatch (so can fetch dimensions at runtime). So in theory one could fill a queue and then dispatch a new job based on the amount of the queue filled.
Logically the GPUs are doing this already. The difference we want is that the count and type of kernels that can concurrently execute is arbitrary. This obviously has an impact on both load-balancing and thread-group issue policy. The scoreboards have to become generalised in the limit. Or, of course, you implement software scoreboarding.Or even better, I suggesting CUDA 3.0 will add ability to associate a kernel with a queue, so that when the queue fills up enough for a thread block to be started, the hardware routes and starts the job for you on an available core. Remember this is all on-chip.
So back-door dynamic warp formation? Of course on-die cache provides a nice fabric to handle the spillage while the nascent warps are being formed and to support data-routing and optionally, of course, migration of warps across the die for true work-stealing goodness.This kind of general purpose hardware thread queuing could be huge, really huge. Think of all the CUDA time spend on stream compaction/expansion. This would now be done in hardware (append/consume). When a job goes too divergent, you could add to a queue based on the divergence, and the hardware would regroup threads into efficient filled warps for free. Things like requeuing threads for better data locality would be possible. The opportunities are tremendous.
Since task-parallelism is where Larrabee evolved from...This is what I think NVidia had in mind when saying they are moving the GPU to a general purpose massively parallel processor. It would be an order of magnitude more efficient at solving the tough parts of parallel problems than attempting to emulate the functionality using a wide vector CPU with a coherent cache...
I'm looking forward to this kind of flexibility (just as a geek observer...). I cringe at the thought of GPUs that can't execute multiple arbitrary kernels - I was shocked to discover that's how all of them have been so far, it's so bleeding obvious as a requirement.Also it isn't a giant leap from what already seems to be happening in the fixed function geometry units attached to the multiprocessors, and it is perfect for DX11 feature set. So it indeed might be possible for GT300.
But 16 is not an important number.
Of course the difference is that Larrabee is already where NVidia's going. They just don't like admitting it.
We know pre-emptively scheduled arbitrary concurrent kernels are coming to D3D at some point, so it seems reasonable to assume that NVidia's leading D3D by one generation, much like it did with the pure-computation+shared-memory model introduced with G80.
Of course the difference is that Larrabee is already where NVidia's going. They just don't like admitting it.
It'd be pretty shocking if it did do a round trip through memory! Though I'm not sure how many of these buffers can be active simultaneously. Is it 8 or something like 128 or arbitrary?
I don't understand what dependency you're alluding to
So back-door dynamic warp formation? Of course on-die cache provides a nice fabric to handle the spillage while the nascent warps are being formed and to support data-routing and optionally, of course, migration of warps across the die for true work-stealing goodness.
What's interesting about append/consume is it's really just a short-cut past the major bottleneck of having to perform your own scan to set addresses and all that palaver.
I guess I need to describe what I understand by the pre-emption that D3D is destined to gain.Are they? Or is something better than preemptive scheduling going to be the NVidia GPU way?
So does the lifetime of any individual warp, as the GPU pre-emptively schedules individual warps.Preemptive scheduling suffers from non-deterministic behavior,
Not in the general case when the competing kernels are already running. The competing states would already be co-existent. I guess in the extreme case a cold context could intrude on the GPU.cold memory and tlb caches, expensive job start kernel calls. Also with the GPU you'd be swapping in and out large amounts of GPU state (texture registers etc).
GPU state per kernel already includes queue condition (i.e. load-balancing is queue-aware). You can't fire off a new warp of fragments without a rasterised triangle's state ready to roll.With what I'm suggesting, general purpose hardware job routing and queuing, GPU state now includes a new first class member, the queue. Queues are associated with kernels.
You appear to be saying that the length of the queue affects the despatch of a kernel (the time when the kernel starts). But the kernel is despatched by defining it's thread-group dimensions and saying "go". Thereafter it runs freely.I'm not sure I understand what you are asking about here?
So in theory one could fill a queue and then dispatch a new job based on the amount of the queue filled.
You can't fire off a new warp of fragments without a rasterised triangle's state ready to roll.
D3D11 is allowing you to build multiple buffers of this type. You could have 10 scan kernels by declaring kernel-buffers, ScanA, ScanB etc., and then fill/fire them off in sequence with DispatchIndirect at whatever time interval or conditional suits your needs. You might decide to fire off only ScanC and ScanF after the results from other kernels are computed. The Scans might even be serially dependent and the condition is based upon some threshold
Maybe you're saying that a kernel is only despatched each time the amount of data fills the thread-group's allocated queue. In which case the kernel runs intermittently, and potentially persistently.
That's an attractive model from the point of view of task-level parallelism, but I don't think that's what D3D11 is actually allowing you to do.
Separately, of course, GT300 might support this kind of generalised persistent-kernel-as-task-parallelism model.
Sounds like what you are saying is more along the lines of associating a priority to kernels rather than preemption (stopping them after they have started execution, paging out all the context and starting a new kernel).
Well, I'm assuming that D3D11 has a "one CS kernel at any time" rule, so there's no choice but to hit memory. RV770 has the interesting property that registers/LDS can retain data between kernels, so maybe D3D11 will work this way where possible. Obviously that only relates to fairly small amounts of data.The DX11 DispatchIndirect case likely requires that queue output hits main memory, kernel to kernel transitions are sync points (if the kernels aren't in separate independent "streams"). So while it is useful, it isn't the best solution.
Only if the queues are small enough to stay on-die and the GPU supports multiple contexts. Stream Out is the obvious example of off-die buffering currently, and append/consume are the general case in D3D11.That is exactly what I am saying.
This avoids having a "worker" kernel just sitting around holding registers and shared memory. It avoids the need to buffer queues to global memory between kernel transitions.
Who knows, maybe D3D11.1 does this and that's what ATI will be delivering this year.Yeah the model isn't directly supported by DX11, but the DX11 pipeline could easily be build on this model. Especially the variable amount of data expansion which can happen because of tessellation.
Playing catchup with Larrabee.This is why I think something like CUDA has great importance. CUDA supported functionality years ahead of it getting in DX and even now, DX11 CS4 supports only a small fraction of what CUDA does on DX10 level NVidia hardware (starting post G80, compute 1.1). So NVidia has the option of moving way ahead here in a way that the GP people (like me) can immediately use via CUDA 3.0, or alternatively if in a console, would be awesome for graphics beyond what is possible with DX11 alone.
This would take an awful lot of data storage (and movement if the the thread contexts aren't all kept local). It also completely messes up the way shared memory is used AFAICS. If you go that far I think it would be easier to just support per thread branching.This would now be done in hardware (append/consume). When a job goes too divergent, you could add to a queue based on the divergence, and the hardware would regroup threads into efficient filled warps for free.
With GT300 set to possibly be a 3 TFLOPs beast, and with things like ROPS and texture units not likely to decrease, I'm not surprised it also has a 512-bit memory bus interface.
I wasn't expecting Nvidia to go from 512-bit bus to something more narrow, like AMD did when going from R600 in early 2007 to RV770 in mid 2008.
With the ~doubling of bandwidth per bit that GDDR5 gives, you could argue that 64 ROPs are likely. TMUs prolly won't increase anything like as fast, assuming that ALU:TMU increases. And since it's likely that ALU complexity (CUDA overhead) will increase, the die area ALU:TMU will tend to lower the increase in TMUs too.With GT300 set to possibly be a 3 TFLOPs beast, and with things like ROPS and texture units not likely to decrease, I'm not surprised it also has a 512-bit memory bus interface.
This would take an awful lot of data storage (and movement if the the thread contexts aren't all kept local). It also completely messes up the way shared memory is used AFAICS. If you go that far I think it would be easier to just support per thread branching.
That's the only thing a sane mind would do given the huge amplification possible with geometry shaders.I've just discovered that in ATI the data output by VS into GS actually gets written to memory.
You said it could get efficiently filled warps with a divergent job ... what exactly did you mean with that then?What I'm suggesting wouldn't be useful for fine granularity dynamic branching...
No, this is the output from VS (renamed ES in this scenario). In the new document from AMD:That's the only thing a sane mind would do given the huge amplification possible with geometry shaders.
the ES is so called because it writes to memory. GS-amplified data going via memory isn't stated anywhere that I can discern.Supports 4 thread types:
- ES (Export Shader) – typically a vertex shader before a geometry shader (it only outputs to
memory).- GS (Geometry shader) – optional.
- VS (Vertex Shader) – null when GS is active, a normal vertex shader when GS is off.
- PS (Pixel Shader).