Nvidia GT300 core: Speculation

Status
Not open for further replies.
Haha, touche. Were you just keeping that in your back pocket to whip out at the right moment? :D
No, I stumbled upon it while looking for a link for Kapasi's Conditional Routing.

In his approach he's still keeping warps together though. So it's not so much about building warps on the fly from a pool of threads, it's more like building an issue warp from a pool of ready warps. Maybe I missed it but I didn't get the impression that there was any per thread scoreboarding (in terms of operand availability) going on. It's still the same per-warp scoreboard and it's the predication stack logic that's been extended.
:oops: I haven't actually read this stuff closely to find out what he's doing.

I linked his stuff not to support my proposal, per se, merely to indicate that there are solutions that look worth doing.

The final page of the PPT refers to shared memory access/bank-conflict, indicating that further fleshing-out is required. This presentation pre-dates his thesis's publication though, so until I've spent more time on it...

Overall he seems to be only attacking control flow divergence - I think that all the hiccups in ALU-lane utilisation and operand readiness should be tackled. But, as I say, I need to read his stuff more closely.

Jawed
 
If in this scenario, the first GT200 warp is completed after 750 cycles, then all the data that's fetched in the first 749 cycles is sat around on die waiting to be used.

Then the problem here is on-chip storage pressure, where long-lived data leads to starvation in terms of buffers or registers, not the idling SIMDs themselves.

Buffers probably won't allow incomplete warps to take an entry, and registers may be intractable for most of the duration.

Just how complex a deallocation procedure is present for the partially completed warps?
We could clear holes of data out of the source register sections devoted to a work group, leading to a fragmented space that might not be desirable until the warp is nearly finished anyway.

Is the scheme able to make multiple windows, or does it stall if it runs into another gather op?
The ability to continue deep is more desirable for getting work done, but the amount of persistent and dynamically generated meta data can make the cost insane.

I'm suggesting a scoreboard for a single "barrier" per work item (thread). It's a bit field, 1 meaning pending barrier, 0 all clear - this amounts to 128 bytes per multiprocessor.

If the windower knows that any barriers are outstanding it can scan across work items for warp-wide sets (32-wide, 4 phases of 8). If some work items happen to constitute default warp allocations, then cool. Otherwise, coalesce work items to make temp-warps.

What would "then cool" entail? We've already built up our scheduler to handle uncool situations, so how much would we save if we still kept standard scheduling logic around?
What probability is there that one work item in a set of 32 is not ready, thus necessitating the more involved method?

How frequently are we going to scan this kilobyte table for resolved barriers?
At a size of 1024, there is going to be a fair amount of time where we would expect at least one item to change form cycle to cycle.

The method for coalescing is a degenerate form of a sort, where we try to move the 0 flagged units to a space that becomes the issue list for the temp warps to draw from.

Does this method permit going down any deeper than one window? Is encountering another barrier and setting the work-unit's flag back to 1 ambiguous?

We will be rescanning every entry this 1K entry table every cycle, and doing the degenerate sorting up to 32 times in the lifespan of a barrier.
The worst-case example would be 32 ready units spread out evenly throughout all 1024 entries, and this happening 32 times.
(This means the work group has been beating around the issue logic for 1024 cycles, hopefully with no other useful work needing to be done on other workgroups.)

This is the scheduling cost, and it would be prior to the other costs that I suspect might pop up from breaking the association between the SIMD lane and work units.

Separately, there's clearly a cost involved in implementing a crossbar from the windower out to the ALU lanes, as there's no associativity between a work-item and the ALU lane it'll occupy.
That would be one of the large costs I am concerned about.
The windower hardware is trying to map from a set of 1024 to a warp of 32.
The lack of associativity also means register identifier translation to hardware location is done without the aid of reference values that can be derived from hardware lane position and the scheduler's cycle counter, since neither value can be assumed consistent across the units at the SIMD at the point of issue.

I'm not sure why you include the whole chip.
I misinterpreted the original description so that it appeared to relax the SIMD relationship to the point it sounded like any SIMD, regardless of multiprocessor, would do.

The scoreboard doesn't need to score barriers per operand - merely per work-item.

The math was concerning the amount of data that would have to be generated by the scheduler and sent out for instruction issue, depending on what level of the hierarchy is made aware of the coalesced warps.

With associativity kept, SIMD scheme can exploit the placement of the lanes. It might be as little as the plain register identifiers and a 2-bit cycle counter for a 4-cycle issue, with this signal being shared by the whole SIMD, with everything past that point able to derive the needed value based on SIMD lane and cycle counter.
For a MADD, that's 26 bits of operand identifiers.

edit: add 8 for the destination, was only thinking reads

Breaking the association, the next step up is sending the work unit identifier along with the register identifier. This would be calculated in the worst-case SIMD-width times.
That's 8*(10+24) bits for a MADD on a warp that has been coalesced.

edit: 8*(10+32) with destination computed


The register data total was something I was pondering if we just cut the post-issue logic down to the point that the units just got shoveled data directly when an instruction issued.
 
Last edited by a moderator:
That would be one of the large costs I am concerned about.The windower hardware is trying to map from a set of 1024 to a warp of 32.
The lack of associativity also means register identifier translation to hardware location is done without the aid of reference values that can be derived from hardware lane position and the scheduler's cycle counter, since neither value can be assumed consistent across the units at the SIMD at the point of issue.

The thesis that Jawed linked proposes a solution for this. There will be independent decoders for doing the translation for each SIMD lane since each lane may be occupied by a thread from a different warp. The offset within the register file is then calculated based on the thread ID or some similiar logic.

And the windower logic isn't completely search based. The guy's approach was based on indexing into the pool based on PC. So incoming threads are inserted into free slots in warps with the same PC and a new warp is created for any threads that can't find a slot.

It all sounds doable although the warp prioritization and scheduling part of it still seems to have some kinks. If Nvidia really is crazy about CUDA they may do something like this but unfortunately plain old games probably won't benefit at all.....
 
The thesis that Jawed linked proposes a solution for this. There will be independent decoders for doing the translation for each SIMD lane since each lane may be occupied by a thread from a different warp. The offset within the register file is then calculated based on the thread ID or some similiar logic.

Wow, that sounds totally familiar to me, as if I read or wrote something very similar to it two+ years ago.

Wasn't that I what I was talking about here: http://forum.beyond3d.com/showthread.php?p=862555#post862555


It is also very similar to this: http://forum.beyond3d.com/showthread.php?p=862983#post862983
except that, instead of reg_num2/vec_2, we have a set of batch indices w/ a PC constraint.

I fear that the actual thing I remember typing was the same thing that got lost as referenced here:
http://forum.beyond3d.com/showthread.php?p=862623#post862623

:(

I remember being unhappy about the predication thing and working out several ways to ameliorate it.

In the intervening time, I think we've learned a lot more about dispatch, to the point where I thought we had decided that register fetching wasn't necessarily as straightforward as we had originally thought, which would make it easier to juggle threads around...?

-Dave
 
The thesis that Jawed linked proposes a solution for this. There will be independent decoders for doing the translation for each SIMD lane since each lane may be occupied by a thread from a different warp.
The decoder scheme seems to be simplest when looking at the SIMD position in a different warp, not just any thread on any warp in the work group.

A decoder in this case would also suspect because the presentation looks to be coalescing warps that would otherwise be running anyway, just diverging.

A simple decoder running on the 1/0 barrier table would be inadequate, since the decoders, if they did actually issue a work unit out of warp order, would not have the state to know not to do it again the next cycle.

The 1024 work unit group size is also four times larger.

The theoretical scheme in that presentation has a given size increase that may be roughly accurate, though I didn't know CACTI modeled non-cache units.
Modeling the effects of adding a lookup table and warp pool to the instruction issue phase on a physical implementation is something that I'd be concerned about.


The offset within the register file is then calculated based on the thread ID or some similiar logic.
That was what I mentioned when I said the windower would be broadcasting the work unit number, which is same difference from how my mind approached it.

The difference is that this calculation in the absence of a tie to SIMD lane and cycle count (the pictures in the presentation seemed to make the effort to keep the lane consistent), this has to be done by taking the 10-bit work unit number and get the modulus values by SIMD width and warp width, and doing so on the fly from the results of the table lookup.

And the windower logic isn't completely search based. The guy's approach was based on indexing into the pool based on PC. So incoming threads are inserted into free slots in warps with the same PC and a new warp is created for any threads that can't find a slot.
It would follow that the incoming threads are inserted into a warp with the same PC. I'm not sure what it would mean to assign a thread to a SIMD that was executing a different instruction entirely.
 
I don't want to be a bore before heading off to bed, but the patents I linked need a closer read. e.g. you'll find the operand collector has latency (i.e. it takes more than one clock for an operand to cross from collector to ALUs) and a crossbar for feeding the ALUs. Any operand can go to any one or more ALU lanes (well, we knew this from CUDA documentation).

Also, as it turns out, register allocation within the register file is a whole kettle of fish. There's potential here for interaction with registers/blocks of registers that the compiler/driver sees as being the target of incoherent accesses (e.g. registers used within a loop).

Additionally there is not a single scoreboard per multiprocessor, but the instruction dependency (per-warp and per-instruction) and operand collection (per-warp, per-instruction and per-operand) scoreboards run independently. Available instructions from the instruction-dependency scoreboard are submitted to the operand collector (i.e. op-code + operand addresses + resultant address to handle the resultant's return). So scoreboarding is, effectively, an hierarchical process and it seems the operand collector can reject requests for instruction execution based on a pile of rules. Also of note is the lifetime of rows in the scoreboards - in the operand collector, lifetime is very short generally. Warps are effectively permanently allocated only in the instruction dependency scoreboard, as far as I can tell.

So something like a barrier scoreboard would be orthogonal, again. Warps that get picked by the barrier scorer will then be tested against the instruction-dependency scoreboard, and then the operand collector will perform final arbitration for the "random" fetching required to populate ALU lanes, where by taking a stream of warps' bit-masks of valid work items - instead of just a single warp-instruction as would be the case when barrier handling is not required.

This stream of valid work items for a group of warps is a bit like a stream of gather/scatter requests handled by the MCs. The MCs queue and sort the request stream for bank/burst optimisation. The operand collector already does most of this - just the final piece of the jigsaw, being "blind" to warp when assembling operands for ALU lanes is needed, I believe.

Tomorrow, I'll go through Fung's stuff...

Jawed
 
Crysis Maximum Editon with GT300 support?
Content:
CRYSIS + WARHEAD+ WARS


Add Content:
* Added global illumination effects
* Added dx11 replacement textures and caustics special effects
* Added ray tracing special effects
* Formal support dx11 rendering, with the 2009 Q2, Nvidia GT300 dx11 display chip release
* Support for CUDA 3.0 game physics calculations, physical visual upgrade to an unprecedented highest
http://bbs.chiphell.com/viewthread.php?tid=38436&extra=page=1
 
Wow, that's nice looking wish list!

Leaving the smell of the wild speculation aside, could that mean some ES hardware be heading for production and distribution to key dev partners?
 
How long is a piece of tape ?


I wonder whether the Q2 for GT300 above is just typo for Q4 or whether the desperate times nvidia is going through and the probable large amount of capacity TSMC has at 40nm means it could be right?

2 quarters is a big jump forward though, what is the theoretical amount you can queeze a timeframe out of need and necessity?
 
Or did they scrap GT212 all together? seeing as its demand wont be so hot due to performing at around the GTX295 level (theoretically based on leaked specs) AND more importantly, the economy being bad (this is probably an understatement :LOL:).

So instead of another GT200 rehash, they fill up the high end with GT300 and the mid range with GT214/5. Sounds plausible, but probably too good to be true.
 
2 quarters is a big jump forward though, what is the theoretical amount you can queeze a timeframe out of need and necessity?
65/55nm woes may not have had any impact on 40nm timing?...

If, with 40nm, NVidia has re-thought some of its approaches because of the difficulties encountered with 65/55nm, then maybe 40nm is being kinder to them?

Jawed
 
65/55nm woes may not have had any impact on 40nm timing?...

If, with 40nm, NVidia has re-thought some of its approaches because of the difficulties encountered with 65/55nm, then maybe 40nm is being kinder to them?

Jawed


Certainly a ponderable. I was also thinking that with TSMC not exactly awash with orders they perhaps where giving more assistance and resources to both AMD and nvidia to get them over more quickly?
 
Or did they scrap GT212 all together? seeing as its demand wont be so hot due to performing at around the GTX295 level (theoretically based on leaked specs) AND more importantly, the economy being bad (this is probably an understatement :LOL:).
Yeath, it'd be wise for them to scrap one cheap new GPU because they have a dual-chip dual-512-bit bus monster card with approximately the same performance right now.

So instead of another GT200 rehash, they fill up the high end with GT300 and the mid range with GT214/5. Sounds plausible, but probably too good to be true.
Problem is that GT212 should be what G71 was between G80 and G84/G92 introduction. Performance difference between G300 and GT214/215/216 (whichever one is the fastest of them) would probably be the same it was between G80 and G73.
So i'm quite hesitant to trust those rumours about GT212 cancellation -- GT212 seems to be the chip that simply can't be cancelled because in that case NV will have a hole of impossible size in their 2009-10 line-up.
The only way GT212 can be cancelled is if G300 is essentially the same in performance as GT212 -- not a monster chip (or maybe GT212 is a monster chip?..) but a middle-end class GPU with two chip AFR cards strategy for the top-end.
 
It reminds me NV47. When everybody believed the chip was canceled, nVidia launched it as new generation architecture - G70...
 
But back then, there weren't any upcoming Direct3D versions, so nVidia could just stick with the NV4x architecture. Now, the G8x/G9x/G2xx is getting a bit long in the tooth and D3D11 should be out by Q4. They need something new, one way or another...
 

Hmm, the fact thats its just some random dude on a forum posting it makes an already very dubious looking leak almost impossible to believe.

Still, it would be damn nice. I didn't get round to buying Warhead yet. I think I'll hold off for a little longer just in case this is real. i have a stack of top games I need to finish first anyway so nothing lost if it turns out to be fake.
 
What makes it even less beliveable is the fact that it is a march release.

Oh I see. So the game itself is real but this guy is just making up a load of bull about it.

Warhead and Crysis Wars come together anyway though don't they? So this is basically just Crysis + Warhead. And I already have Crysis so I may as well just pick up Warhead on the cheap.
 
Status
Not open for further replies.
Back
Top