Nvidia GT300 core: Speculation

silent_guy · Jan 19, 2009

KonKort said:
Before a tape out there are internal tests. These tests are necessary to check the functions and to optimize. I would call them as pre-revisions.

Top-level simulations of all chips happen usually just months after the start of a new project. If it takes two to three years to develop a new GPU, then you could say that something like G80 was already 'running' sometime beginning 2004, if not earlier.

By the same standard, it's probably correct to say that GT400 is already 'running'.

The amount of useful information in such a statement? Zero... (But I guess it helps attracting even more clueless visitors to your website, so feel free to go ahead and post it as breaking news.)

silent_guy · Jan 19, 2009

Lukfi said:
If I understand correctly, a tape-out is when you get first full wafer of your chips, while from the CyberShuttle you just get a few chips because there are more chips from several manufacturers on one wafer. So either you have a CyberShuttle prototype, or a taped out prototype, or you have nothing. What does nVidia have now?

What you're talking about is what is called 'first silicon', but there's no distinction between shuttle silicon or a dedicated run. Shuttles are rarely used to tape-out full functional chips (I haven't done one in at least 10 years) because the amount of chips you get back is often not enough to do large scale qualifications. Shuttle runs are still done for test chips, usually only with some analog blocks and some process qualification logic, but even then there are quite rare for large companies. Shuttles are (way) more popular with universities and startup's or companies that are short on cash, where time-to-market isn't as critical (e.g. a company that makes silicon for, say, large scale routes is not as schedule sensitive as one that produces customer electronics stuff.) Also, shuttles get the lowest priority in the fab, so it's easily a month or so later than a super-hot lot.

Tape-out is not when you get the first full wafer. (That's how The Inq has used the term in the past, but it's incorrect.) Tape-out quite literally refers to the old practice of sending a tape with GDS2 data (basically a huge file with just rectangles in it) to the fab to start mask making. These days, the data is transmitted over the Internet, of course, but FTP-out just doesn't have the same ring to it, so the term tape-out has stuck...

3dcgi · Jan 19, 2009

Lukfi said:
Just a thougt here: Some time ago I read about nVidia using a cluster computer to simulate their GPUs without having to manufacture actual samples. So what for would they need a physical G80 sample in Q1/2006 and a GT300 sample now?

These hardware emulators are still an order of magnitude slower than the actual chip. They're helpful for the driver team and to validate the hardware more quickly, but hardware emulators are not a replacement for the real thing.

Lukfi · Jan 19, 2009

silent_guy said:
Tape-out is not when you get the first full wafer. (That's how The Inq has used the term in the past, but it's incorrect.) Tape-out quite literally refers to the old practice of sending a tape with GDS2 data (basically a huge file with just rectangles in it) to the fab to start mask making. These days, the data is transmitted over the Internet, of course, but FTP-out just doesn't have the same ring to it, so the term tape-out has stuck...

So, when I read somewhere that a new chip has been taped out, does it mean that the foundry now has all the data it needs to start manufacturing and actual samples are weeks away? Somehow I think when people say "tape-out", they usually mean the incorrect thing (first silicon).

Dave Baumann · Jan 19, 2009

Lukfi said:
So, when I read somewhere that a new chip has been taped out, does it mean that the foundry now has all the data it needs to start manufacturing and actual samples are weeks away?

Not necessarily. Tape-outs can be staggered. As a single element the base layer (silicon) takes the most time to produce, so there can be some delay between the silicon tape-out and metal tape-out, and even the metal tape-outs my be staggered, if there are lots of layers.

silent_guy · Jan 19, 2009

Lukfi said:
So, when I read somewhere that a new chip has been taped out, does it mean that the foundry now has all the data it needs to start manufacturing and actual samples are weeks away?

In practice: yes.

Dave is correct that tape-out's are sometimes staggered, but companies will almost always try to make sure to keep the delay between base tape-out and metal tape-out short enough to avoid that the lot stalls in the fab.

TimothyFarrar · Jan 19, 2009

Jawed said:
...

By tagging each work-item with a synch-point, the windower can identify valid combinations of data and instructions to be issued. The synch-points are determined either by explicit data-share synch statements (barrier) in the instruction stream, or by control flow.

With barriers windowed, the SIMD can execute an arbitrarily constructed warp consisting of a stream of instructions that are valid and available within the window.

So in a work-group of 1024 work-items, if only 10 of them want to execute a loop, then the SIMD wastage is restricted to loop count * (clock-length of a warp (4) * SIMD-width (8) - 10), i.e. loop-count*22. It doesn't matter how randomly these 10 work-items are spread throughout the work-group of 1024. The windower will collect them all together within a single warp, for each instruction and every iteration of the loop.

I'm figuring that the "expensive", fine-grained scheduling hardware that NVidia's put together in G80 et al, could turn into "very cheap" scheduling hardware in GT300, once it starts to execute any code with intricate control flow (including barriers and atomic accesses to memory).

So, it's sort of MWMD - multiple warp multiple data

Jawed

Preamble, I'm just a software guy, so I'm making lots of assumptions here...

Clearly global memory sans cache will have limits based on the GDDR* interface. So a minimum of bus width transfers with expensive accesses when spanning different row addresses. Given this, without an intermediate cache, is there really any more that NVidia can do beyond GT2xx in terms of global memory access performance? GT2xx already handles sorting work-item accesses for a warp into a minimum number of bus transactions.

Even if a cache is added wouldn't you effectively have similar limitations on access granularity for performance reasons? Memory granularity would be something larger than 32-bits, and likely upwards of 256-bits regardless of cache, right? Or am I wrong here in this assumption? Could a cache be built with enough access ports for efficient independent 32-bit accesses from a vector of work-items. Perhaps with high latency to service requests out of order to reduce bank conflicts.

What I'm getting at is that fully scaler MIMD might not make sense in that the minimum bus transaction for independent work-items would be something like a float[8]. In this case, work-item regrouping doesn't make sense from the perspective of memory accesses. With current warp based design, one is sure after a divergent branch that computation returns to an efficient grouping.

Then what about register bank conflicts? If work-item is started in vector lane 0, then likely that work-items registers are also tied to that lane (or bank). So chances are you couldn't easily move that work-item into vector lane 1...

Jawed · Jan 20, 2009

TimothyFarrar said:
Preamble, I'm just a software guy, so I'm making lots of assumptions here...

As far as GPUs are concerned I'm not even a software guy, just a wooden chair pundit :smile:

Clearly global memory sans cache will have limits based on the GDDR* interface. So a minimum of bus width transfers with expensive accesses when spanning different row addresses. Given this, without an intermediate cache, is there really any more that NVidia can do beyond GT2xx in terms of global memory access performance? GT2xx already handles sorting work-item accesses for a warp into a minimum number of bus transactions.

In terms of the memory interface, per se, I'm not proposing a change. Instead I'm proposing that the coalesced data accesses against memory are mirrored by allowing the same (or similar) re-sorting to be performed on work items, regardless of "warp", for the purposes of execution. By dynamically constructing warps that end-up matching the accesses against memory the GPU needs less work items in flight to alleviate the latency increases caused by incoherent memory accesses - which would allow the GPU to allocate more registers per work item.

With coalesced memory fetches, say, but un-coalesced use of that data, the data has to wait around longer, on die, before it is used. This then reduces the overall effective capacity for data fetched from memory, in comparison with a GPU that reduces the life-time of that data by coalescing the use of data.

Even if a cache is added wouldn't you effectively have similar limitations on access granularity for performance reasons? Memory granularity would be something larger than 32-bits, and likely upwards of 256-bits regardless of cache, right? Or am I wrong here in this assumption? Could a cache be built with enough access ports for efficient independent 32-bit accesses from a vector of work-items. Perhaps with high latency to service requests out of order to reduce bank conflicts.

In G80/GT200 currently the windower has a private block of memory out of which it feeds the ALUs. Currently it seems this doesn't have the capacity to prevent waterfalling occurring, e.g. random fetches from constants/registers. I'm proposing an extension to this, first so that it can mitigate waterfalls and secondly (due to increased capacity) perform inter-work group coalescing of operands, regardless of where the operands are sourced: register, constant or memory.

What I'm getting at is that fully scaler MIMD might not make sense in that the minimum bus transaction for independent work-items would be something like a float[8]. In this case, work-item regrouping doesn't make sense from the perspective of memory accesses. With current warp based design, one is sure after a divergent branch that computation returns to an efficient grouping.

But what about nested branching and loops, with the worst case being a nesting of loops?

Then what about register bank conflicts? If work-item is started in vector lane 0, then likely that work-items registers are also tied to that lane (or bank). So chances are you couldn't easily move that work-item into vector lane 1...

I'm looking to the windower's operand memory as a method to insulate the ALUs from register bank conflicts (waterfalling).

Currently the windower does 16-wide fetches from registers, though it only feeds 8-wide to the ALUs. This allows it "time" to re-sort operands for a warp, thus hiding the banking latency when fetching r0 and r13, say, for an instruction. It also covers for the greed of the ALUs, since they'll happily suck in 4 operands per clock (MAD+MI : 3 operands MAD, 1 operand transcendental/interpolator).

It seems that two 16-wide fetches from registers are actually required to keep up with the MAD+MI units, since they can consume 32 operands per clock (8 lanes for each, 3 operands MAD, 1 operand MI). I presume NVidia abstracts this - e.g. the hardware is actually doing one 32-wide fetch but it's doing it for a pair of warps. It presents this to the programmer as a 16-wide fetch per warp, but of course a pair of warps is time-sliced "per instruction".

To give the windower increased capacity, as I'm suggesting, would necessarily increase the width of fetches or increase the number of parallel fetches (ports). Since NVidia uses the windower and "wider-than-the-ALU" fetches to "simulate" multi-porting, I figure they'd elect to widen the fetches.

It may be that a 32-wide fetch is as wide as practicable. In which case, ahem, the ALU bandwidth would have to be reduced - half-clocked or only 4 lanes wide. Any which way, clearly the cost of the windower/instruction-issuer increases in my proposal. No idea if this cost is actually worth paying, though - need a simulator for that

Jawed

Jawed · Jan 20, 2009

Slide 47 here:

http://www.microsoft.com/downloads/...2b-53ea-4f80-84b2-f05a360bfc6a&DisplayLang=en

explicitly refers to "Prototype DX11" performance of 42 GFLOPs for an FFT. It then indicates that "latest chips" are 100 GFLOPs, implying that the prototype DX11 has been around for a while (the presentation was given in August 2008).

Jawed

CarstenS · Jan 20, 2009

Prototype algorithm or prototype hardware?

Direct3D 11 Computer Shader More Generality for Advanced Techniques.pptx said:
Code:

Complex 1024x1024 2D FFT: Software 42ms 6 GFlops Direct3D9 15ms 17 GFlops 3x Prototype DX11 6ms 42 GFlops 6x Latest chips 3ms 100 GFlops

Jawed · Jan 20, 2009

CarstenS said:
Prototype algorithm or prototype hardware?

That's the problem, can't tell

The first line, being "Software" might hint that the other 3 lines are "hardware", but that's as close as we'll get.

But, the last line says "Shared register space and random access writes enable ~2x speedups". This refers to the Prototype DX11 line, it seems (3x -> 6x). As far as I can tell this is the nub, these are the performance improvements that derive solely from D3D11-specific features. I think it's possible to do both these things with HD4870 (LDS+GDS for shared register space and "memexport"), so theoretically they could have come up with this performance comparison simply with DX9 code and IL code (for D3D11-CS) running on HD4870.

Jawed

rpg.314 · Jan 20, 2009

Jawed said:
As far as GPUs are concerned I'm not even a software guy, just a wooden chair pundit :smile:

In terms of the memory interface, per se, I'm not proposing a change. Instead I'm proposing that the coalesced data accesses against memory are mirrored by allowing the same (or similar) re-sorting to be performed on work items, regardless of "warp", for the purposes of execution. By dynamically constructing warps that end-up matching the accesses against memory the GPU needs less work items in flight to alleviate the latency increases caused by incoherent memory accesses - which would allow the GPU to allocate more registers per work item.

With coalesced memory fetches, say, but un-coalesced use of that data, the data has to wait around longer, on die, before it is used. This then reduces the overall effective capacity for data fetched from memory, in comparison with a GPU that reduces the life-time of that data by coalescing the use of data.

In G80/GT200 currently the windower has a private block of memory out of which it feeds the ALUs. Currently it seems this doesn't have the capacity to prevent waterfalling occurring, e.g. random fetches from constants/registers. I'm proposing an extension to this, first so that it can mitigate waterfalls and secondly (due to increased capacity) perform inter-work group coalescing of operands, regardless of where the operands are sourced: register, constant or memory.

But what about nested branching and loops, with the worst case being a nesting of loops?

I'm looking to the windower's operand memory as a method to insulate the ALUs from register bank conflicts (waterfalling).

Currently the windower does 16-wide fetches from registers, though it only feeds 8-wide to the ALUs. This allows it "time" to re-sort operands for a warp, thus hiding the banking latency when fetching r0 and r13, say, for an instruction. It also covers for the greed of the ALUs, since they'll happily suck in 4 operands per clock (MAD+MI : 3 operands MAD, 1 operand transcendental/interpolator).

It seems that two 16-wide fetches from registers are actually required to keep up with the MAD+MI units, since they can consume 32 operands per clock (8 lanes for each, 3 operands MAD, 1 operand MI). I presume NVidia abstracts this - e.g. the hardware is actually doing one 32-wide fetch but it's doing it for a pair of warps. It presents this to the programmer as a 16-wide fetch per warp, but of course a pair of warps is time-sliced "per instruction".

To give the windower increased capacity, as I'm suggesting, would necessarily increase the width of fetches or increase the number of parallel fetches (ports). Since NVidia uses the windower and "wider-than-the-ALU" fetches to "simulate" multi-porting, I figure they'd elect to widen the fetches.

It may be that a 32-wide fetch is as wide as practicable. In which case, ahem, the ALU bandwidth would have to be reduced - half-clocked or only 4 lanes wide. Any which way, clearly the cost of the windower/instruction-issuer increases in my proposal. No idea if this cost is actually worth paying, though - need a simulator for that

Jawed

I think that it is right. Vasily Volkov et al in their BLAS paper did extensive (I mean it) benchmarking and they found that 64 threads (per block I think, but could have been per SM as well, will need to have another look at it) are needed at the min to keep the compute pipeline saturated.

TimothyFarrar · Jan 20, 2009

Jawed said:
In terms of the memory interface, per se, I'm not proposing a change. Instead I'm proposing that the coalesced data accesses against memory are mirrored by allowing the same (or similar) re-sorting to be performed on work items, regardless of "warp", for the purposes of execution. By dynamically constructing warps that end-up matching the accesses against memory the GPU needs less work items in flight to alleviate the latency increases caused by incoherent memory accesses - which would allow the GPU to allocate more registers per work item.

So if GT3xx goes MIMD in this way, then the advantages to a CUDA or OpenCL programmer would be a reduction in cost of constant or shared memory bank conflicts (waterfalling), as well as higher ALU utilization per register for both GPGPU and rendering.

My interpretation of the 1024x1024 2D FFT numbers were,

Direct3D9 - 15ms - 17 GFlops- 3x : GPGPU using texture fetch only

Prototype DX11 - 6ms - 42 GFlops - 6x : compute shader, using shared memory

Latest chips - 3ms - 100 GFlops : do to GT2xx global memory access improvements

3dilettante · Jan 20, 2009

Jawed said:
If a work group is 32x32 work items, in general it doesn't matter if (2,2)(2,4)(4,8)(4,12)(16,8)(16,9)(16,10)(16,11) find themselves in the SIMD at the same time on a series of clocks (instructions).

To the software entities, it might not.
To the hardware and any caches, buffers, IO, and anything else, it probably would.
Unless all items are in on-chip storage, the position could be orders of magnitude more important.
Even if the are all on-chip, the level of support needed for arbitrary access would require hardware orders of magnitude more complex.
Is it a win to improve on SIMD hardware showing average utilization of maybe half to two thirds peak with hardware that takes up 10 times as much power and heat?

Is this a hybrid software/hardware model?

With barriers windowed, the SIMD can execute an arbitrarily constructed warp consisting of a stream of instructions that are valid and available within the window.

1024*(Aribitrary + silicon) = me being unsure about this

So in a work-group of 1024 work-items, if only 10 of them want to execute a loop, then the SIMD wastage is restricted to loop count * (clock-length of a warp (4) * SIMD-width (8) - 10), i.e. loop-count*22. It doesn't matter how randomly these 10 work-items are spread throughout the work-group of 1024. The windower will collect them all together within a single warp, for each instruction and every iteration of the loop.

Given the numbers involved, this can amount to a pretty significant amount of data going back and forth across the chip.
If only referencing instructions, its a 10 bit identifier per unit and some op-code of maybe a byte or more, depending on implementation.
That's about 1 KiB of generated data per windower cycle, prior to any analysis of things such as looping, and prior to sending the linked registers and data to the SIMDs.
An 8-bit register identifier for a MADD will mean 24*1024 bits for passing register addresses to SIMDS.
32-bit operands would be 96*1024 bits if passing data to the ALUs.
Since this is a sorting scheme for the entire group, this is a chip-wide affair.

For arbitrary assignment, the SIMDs on such device would require a way to interface 1024 regions of register file with maybe 30 or more SIMDs.

A windower that performs such an analysis on a whole group like this is a potential serialization bottleneck.
A fully parallel implementation in hardware would involve multiple distributed copies of the job table, potentially per SIMD.
A serial one would constrain bit bloat, but then it's almost in like a triangle setup limitation.

So, it's sort of MWMD - multiple warp multiple data

Warps nicely enforce locality on things that would otherwise cross the entire chip. Are the wins worth it?

Jawed · Jan 20, 2009

TimothyFarrar said:
So if GT3xx goes MIMD in this way, then the advantages to a CUDA or OpenCL programmer would be a reduction in cost of constant or shared memory bank conflicts (waterfalling), as well as higher ALU utilization per register for both GPGPU and rendering.

Yeah.

My interpretation of the 1024x1024 2D FFT numbers were,

Direct3D9 - 15ms - 17 GFlops- 3x : GPGPU using texture fetch only

Prototype DX11 - 6ms - 42 GFlops - 6x : compute shader, using shared memory

Latest chips - 3ms - 100 GFlops : do to GT2xx global memory access improvements

You're reading "~2x speedups" as referring to each of "shared register space" and "random access writes" enabling a ~2x speed-up, i.e. ~4x cumulatively?

For what it's worth I certainly wouldn't rule out G80/G92 as the vehicle for prototyping as described here, as shared memory (PDC) is effectively shared register space.

Jawed

TimothyFarrar · Jan 20, 2009

Jawed said:
Yeah.
You're reading "~2x speedups" as referring to each of "shared register space" and "random access writes" enabling a ~2x speed-up, i.e. ~4x cumulatively?

For what it's worth I certainly wouldn't rule out G80/G92 as the vehicle for prototyping as described here, as shared memory (PDC) is effectively shared register space.

Jawed

I'm saying that using shared register (or CUDA shared memory) and random access reads/writes to global memory on G80/G92 vs using texture fetch only might be the first 2x speed up (ie advantages of compute shader features). Still have to deal with bank conflicts with shared register space.

I'm guessing the 2nd 2x speed up is from the increased global read/write efficiency of the GT2xx hardware (ie current hardware).

BTW, I wonder if there is anything to gather about how the GT2xx gained register space but not shared register/memory space compared to G80/G92, and the problems with efficiency of GPGPU code because of shared register bank conflicts causing performance trouble mapping algorithms to the GPU.

Jawed · Jan 20, 2009

3dilettante said:
To the software entities, it might not.
To the hardware and any caches, buffers, IO, and anything else, it probably would.
Unless all items are in on-chip storage, the position could be orders of magnitude more important.

Of course they're on-chip - otherwise the windower can't do anything with them - it's only issuing operands to the ALUs when they become available. Otherwise it stalls work items until all operands are ready. It's simply being blind to the "warp" that originally defines a work item.

When a gather instruction is issued, that raises a dependency in the windower. If the gather instruction requires incoherent fetches, the memory controller will re-order for best burst efficiency against DDR. When the results return, they'll come in over some variable period.

If 16 warps issue a single gather instruction there might be a span of 1000 ALU cycles between the return of the first and the last results. If a single issuing-warp's fetches span 1000 cycles, in GT200 that warp is forced to wait 1000 cycles.

If in this scenario, the first GT200 warp is completed after 750 cycles, then all the data that's fetched in the first 749 cycles is sat around on die waiting to be used.

I'm simply proposing that GT300 can take a 32-wide cut from the fetched data, once a 32-wide cut has arrived.

Even if the are all on-chip, the level of support needed for arbitrary access would require hardware orders of magnitude more complex.
Is it a win to improve on SIMD hardware showing average utilization of maybe half to two thirds peak with hardware that takes up 10 times as much power and heat?

The peak you're referring to is on code that is embarrassingly simple, no nesting, rare use of waterfalling, the odd gather (once in an entire application). These things aren't used, generally, because they're black holes.

Before HD4870 turned up it seemed no-one had any idea how expensive NVidia's scheduling was - G80/G92 ALU density seemed "fine" (well, I had different ideas about that, but still...). In truth we still don't know because we can't separate it out.

Clearly what I'm proposing is expensive. But the windower is already scoreboarding per-warp, per-instruction and per-operand.

There's a scoreboard for instruction dependencies:

Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators

and a scoreboard for operand readiness:

OPERAND COLLECTOR ARCHITECTURE

and stack-based tracking of per work-item predicates:

System and method for managing divergent threads in a SIMD architecture

and prioritisation of instruction issue:

Prioritized issuing of operation dedicated execution unit tagged instructions from multiple different type threads performing different set of operations

I'm suggesting a scoreboard for a single "barrier" per work item (thread). It's a bit field, 1 meaning pending barrier, 0 all clear - this amounts to 128 bytes per multiprocessor. If the windower knows that any barriers are outstanding it can scan across work items for warp-wide sets (32-wide, 4 phases of 8). If some work items happen to constitute default warp allocations, then cool. Otherwise, coalesce work items to make temp-warps.

These temp-warps may have some operands ready and waiting in the operand collector, for at least one instruction (e.g. the instruction that raised the barrier). Once the supply of ready operands is exhausted for the temp warp, its work items will have to be set in the barrier scoreboard. Clearly a temp-warp, being fragmented, is going to generate "waterfalled" reads against the register file or other places (constants, shared memory, memory). In theory these new waterfalls will add stress - but they might also occupy units that would otherwise be idle.

Separately, there's clearly a cost involved in implementing a crossbar from the windower out to the ALU lanes, as there's no associativity between a work-item and the ALU lane it'll occupy.

Is this a hybrid software/hardware model?

I'm thinking purely hardware-based - as it currently appears to be in G80 etc.

Given the numbers involved, this can amount to a pretty significant amount of data going back and forth across the chip.
If only referencing instructions, its a 10 bit identifier per unit and some op-code of maybe a byte or more, depending on implementation.

I'm not sure why you include the whole chip.

That's about 1 KiB of generated data per windower cycle, prior to any analysis of things such as looping, and prior to sending the linked registers and data to the SIMDs.
An 8-bit register identifier for a MADD will mean 24*1024 bits for passing register addresses to SIMDS.
32-bit operands would be 96*1024 bits if passing data to the ALUs.

The scoreboard doesn't need to score barriers per operand - merely per work-item.

Since this is a sorting scheme for the entire group, this is a chip-wide affair.

A group is localised to a multiprocessor. Clearly gather/scatter and constant-cache fetches go outside of the multiprocessor.

Jawed

trinibwoy · Jan 21, 2009

Jawed said:
Clearly what I'm proposing is expensive. But the windower is already scoreboarding per-warp, per-instruction and per-operand.

Thanks for the links Jawed.

I'm not sure that a 750 cycle latency for a single ready warp out of 16 is a practical scenario though. If that's going to be a common occurrence then SIMD divergence is the least of your worries no? That would imply a very large variation in operand fetch latency within a single warp - how would that possibly happen?

As complex as the scoreboarding and operand fetch are already, they're still benefiting a LOT from the coherency of the warp grouping. Once you break the warp construct I would imagine you'll eventually end up with a mess of divergent threads and then you're going to do work now to pick and choose 16 or 32 threads at a time that happen to be at the same instruction just so you can feed the SIMD? Eventually operand fetches will be all over the place and bank conflicts and cache misses will be the order of the day.

Wouldn't it be much cheaper and simpler to go true MIMD? There's already an operand collector per execution unit and per operand anyway so all it would take is a bigger scorecard (per thread instead of per warp) and individual instruction issue logic.

Maybe having a vast pool of ready threads available and constructing groups of work-items for issue each clock is doable but it sounds like a bunch of voodoo to me

Jawed · Jan 21, 2009

trinibwoy said:
I'm not sure that a 750 cycle latency for a single ready warp out of 16 is a practical scenario though. If that's going to be a common occurrence then SIMD divergence is the least of your worries no? That would imply a very large variation in operand fetch latency within a single warp - how would that possibly happen?

That scenario is gather from video memory. If the latency for a single random read from memory is 200 cycles, then a warp doing a gather that requires 4 or more reads from memory just to execute one instruction is going to introduce a lot of latency.

NVidia's designed the memory controller to aggregate these "random" reads and optimise the ordering. That cuts some of the pain.

A major feature of D3D11 is reading/writing irregular data structures in memory.

As complex as the scoreboarding and operand fetch are already, they're still benefiting a LOT from the coherency of the warp grouping. Once you break the warp construct I would imagine you'll eventually end up with a mess of divergent threads and then you're going to do work now to pick and choose 16 or 32 threads at a time that happen to be at the same instruction just so you can feed the SIMD? Eventually operand fetches will be all over the place and bank conflicts and cache misses will be the order of the day.

What I'm proposing is similar (in terms of tracking and decision making) to the handling of predication (3rd patent document linked). The windower is checking each warp to see if it "goes all one way" (either all 32 work items in a warp execute a branch or none do). It looks at the predicate bits for the 32 work items - if they're all the same then it knows there's no divergence. Depending n the condition being evaluated, this either means a series of instructions are executed, or they're skipped. Skipping is the key thing as it prevents the wastage of ALU time when predication says that no results will be generated by the instructions.

So this new scoreboard is like a barrier-predicate for all work items in a work group. In the same way that there's a predicate stack for nested control flow for each warp, which is evaluated dynamically to determine "all one way" scenarios, there'd be a barrier-predicate that's evaluated dynamically to identify waterfalling/divergence. A separate prioritiser takes the identity of "ready" work items and works out an order for issuing them.

The nature of operand fetching (whether from registers, constant cache, shared memory or video memory) is "self-righting" - these fetches are all coherent. So as long as there's enough windower memory for these in-flight operands, a mess of temp-warps and their resulting random fetches will quickly settle back to coherent fetches once the instruction(s) that caused the incoherent fetches has passed.

Also, it's worth bearing in mind that the idea is to get stuff working that would otherwise be idling - either ALUs are stalled waiting for operands or lanes are empty. When the ALUs are stalled or lane operands and resultants are masked-out (don't read and then write to register file) register file bandwidth is going spare.

Wouldn't it be much cheaper and simpler to go true MIMD? There's already an operand collector per execution unit and per operand anyway so all it would take is a bigger scorecard (per thread instead of per warp) and individual instruction issue logic.

woah, that would be insanely expensive.

Maybe having a vast pool of ready threads available and constructing groups of work-items for issue each clock is doable but it sounds like a bunch of voodoo to me

Wilson Fung interned at NVidia this past summer, and is co-author of:

http://www.microarch.org/micro40/talks/7-3.ppt

This is his thesis:

https://circle.ubc.ca/bitstream/2429/2268/1/ubc_2008_fall_fung_wilson_wai_lun.pdf

which looks like quite a manageable read and covers a lot of relevant ground, including the simulation parameters (based on NVidia's architecture) that went into making his own cycle-accurate GPU simulator, GPGPU-Sim.

Jawed

trinibwoy · Jan 21, 2009

Jawed said:
Wilson Fung interned at NVidia this past summer, and is co-author of:

http://www.microarch.org/micro40/talks/7-3.ppt

Haha, touche. Were you just keeping that in your back pocket to whip out at the right moment?

In his approach he's still keeping warps together though. So it's not so much about building warps on the fly from a pool of threads, it's more like building an issue warp from a pool of ready warps. Maybe I missed it but I didn't get the impression that there was any per thread scoreboarding (in terms of operand availability) going on. It's still the same per-warp scoreboard and it's the predication stack logic that's been extended.

Nvidia GT300 core: Speculation

silent_guy

silent_guy

3dcgi

Lukfi

Dave Baumann

Gamerscore Wh...

silent_guy

TimothyFarrar

Jawed

Jawed

CarstenS

Moderator

Jawed

rpg.314

TimothyFarrar

3dilettante

Jawed

TimothyFarrar

Jawed

trinibwoy

Meh

Jawed

trinibwoy

Meh

Similar threads