If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#276 |
|
Member
|
But back then, there weren't any upcoming Direct3D versions, so nVidia could just stick with the NV4x architecture. Now, the G8x/G9x/G2xx is getting a bit long in the tooth and D3D11 should be out by Q4. They need something new, one way or another...
|
|
|
|
|
#277 |
|
Regular
|
I found Fung et al's paper presented at Microarchitecture 2007:
http://www.ece.ubc.ca/~aamodt/papers....micro2007.pdf which should make things easier to understand. Jawed |
|
|
|
|
#278 | |
|
B3D Scallywag
|
Quote:
Still, it would be damn nice. I didn't get round to buying Warhead yet. I think I'll hold off for a little longer just in case this is real. i have a stack of top games I need to finish first anyway so nothing lost if it turns out to be fake.
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
#279 | |
|
Member
Join Date: Mar 2005
Posts: 568
|
Quote:
|
|
|
|
|
|
#280 | |
|
B3D Scallywag
|
Quote:
Warhead and Crysis Wars come together anyway though don't they? So this is basically just Crysis + Warhead. And I already have Crysis so I may as well just pick up Warhead on the cheap.
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
#281 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Pre-order at the egg for $46. No thanks....
__________________
What the deuce!? |
|
|
|
|
#282 | |||
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Quote:
Quote:
Quote:
__________________
What the deuce!? |
|||
|
|
|
|
#283 | |||||||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
Quote:
Quote:
The instruction dependency scoreboard might not care too much about the coalesced warps. The operand collection scoreboard's job is measurably more complex the more arbitrary work unit assignment to a lane is. Quote:
In the barrier scheme, it might get by with operand and result addresses coupled with a work-unit identifier or at least the bits indicating which lane and which warp. Quote:
Quote:
Quote:
It's a functional idea. It's words like random, arbitration, and stream that look fine on paper or in software that worry me. As far as a physical implementation goes, I have concerns. Quote:
Quote:
They didn't like breaking lane association either. The bank conflicts from trying to move things around led to some pretty significant performance drops in simulation. The problem for branch divergence is in many ways more tidy than the one for coalescing work units based on barrier resolution. Branches are encountered and divergence are determined within a fixed and determinate period of time. The update to the needed structures needs to only happen a few times, and the accesses for that data generation are of a fixed number. The barrier table persists for the lifetime of a work group waiting on an event such as a gather operation, and it needs to be updated repeatedly. The scheme actually could be significantly simplified, I think, if we enforced some kind of lane restriction instead of trying to cover every part of the problem space.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|||||||||
|
|
|
|
#284 | |||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
Quote:
What happens when one set of units makes it to the end of the window? We don't want the issue logic to try to increment the PC and then coalesce those work units with other units still stuck on a different instruction. A single bit is not sufficient to catch that state. We can have a side pointer that tells the rest of the scheduler to stop trying to issue these units, but that's additional state. We can spawn new tables and stack them, at the cost of more actively updated state. The description of this table's function indicates we're sending a stream of masks, which hints that it is being very thorough. EDIT: And it was described as such: "32-wide, 4 phases of 8" Quote:
There's no timetable for when other gather operands will become available, and to which work units.
__________________
Dreaming of a .065 micron etch-a-sketch. Last edited by 3dilettante; 22-Jan-2009 at 17:39. |
|||
|
|
|
|
#285 | |||||||||||
|
Regular
|
Quote:
Quote:
Quote:
Quote:
A key point in linking the patent documents is that a lot of this complexity already exists. For example, one thing Fung doesn't realise is that any operand can go to any lane - there's a crossbar after the operands have been retrieved. This makes some of his modelling flawed - I'm at section 4.2 by the way, reading carefully... Quote:
Quote:
Which of course raises the question of the actual cost of the operand collector as it now stands. Dunno... Quote:
If the code has a long-ish loop with a gather inside it, then things explode. One of my concerns with my idea is that barrier and predication masks overlap in meaning and intertwine in scope. So a barrier stack seems like an obvious solution, but then... Predication must be retained because it's impossible to construct non-divergent warps and also it might be prudent to apply thresholding to DWF (varies by type?), e.g. only perform DWF if the multiprocessor is due to run out of available warps within X cycles, that kind of thing - since DWF has a start-up latency due to an increase in operand collection latency. Need to build a simulator to really figure this stuff out. Quote:
Though they can't be that minor or we'd have seen them by now. But well, better is the enemy of good enough. Quote:
Quote:
For example, Keenan Crane's Julia set shader, which is rather rich in control flow: http://graphics.cs.uiuc.edu/svn/kcra...ia_source.html has the following "average" execution statistics: HD4870 - 18382 cycles HD4670 - 14617 cycles Now I admit that's just a headline-grabbing fairly meaningless statistic (those are numbers from GPUSA - benchtesting would be much better Quote:
Put another way, if NVidia builds DWF for control-flow divergence, what's the incremental cost to implement comprehensive barrier-based DWF (bearing in mind that NVidia's scheduling is already very costly)? I haven't got a simulator lying around, so I don't even know if it would be worth doing... Jawed |
|||||||||||
|
|
|
|
#286 | |
|
Regular
|
Quote:
Jawed |
|
|
|
|
|
#287 | ||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
Quote:
Quote:
For gather operations, perhaps an optimization would be to keep operands fed by gathers as spread evenly as possible between banks, in order to minimize the number of times bank conflicts arise. Quote:
What is the throughput of the crossbar, and with what restrictions? The diagrams I saw seemed to indicate each SIMD pipe had a direct connection with its local subset, and then there were connections between the sets of unspecified capacity. Quote:
Having both modes seems akin to having both a manual and automatic transmission stuck on a car.
__________________
Dreaming of a .065 micron etch-a-sketch. |
||||
|
|
|
|
#288 | ||||||
|
Regular
|
Quote:
One thing I've realised is that the CUDA documentation doesn't steer the developer around anything the compiler might do to re-structure register allocation (e.g. make r0-r3 fat, r4 and r5 thin, r6 warp-phased). This implies to me that register allocation rules that run when a shader (kernel) are loaded into a multiprocessor are prolly constrained when running CUDA apps. But for graphics the rules have free reign. Dunno. Obviously register allocation completely breaks down in the face of truly random register fetches. Quote:
Quote:
Quote:
Maybe you mean that MAD and MI have distinct operand collectors? Quote:
It might be a question of trading latency, e.g. 2 cycles of extra windowing latency means that for the same performance an extra 4 warps' capacity has to be added and/or the register file has to grow by 10-20%. Quote:
Jawed |
||||||
|
|
|
|
#289 | |
|
Regular
|
So, overall I'm happy with the outline concepts in Fung's thesis. In section 9.3.3 he mentions how DWF could be used to ameliorate general bank conflicts:
Quote:
Jawed |
|
|
|
|
|
#290 |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Does it strike anyone else that these tricks should get even more effective with smaller batch sizes? And does it strike anyone else that smaller batch sizes become more cost-efficient when the register file becomes bigger (ala GT200 doubling it) or the units become more powerful (ala RV770: INT32 MULs & shifts for every ALU)?
I also think that what is at least as important to improve as branch coherency is the quad-related overhead of small triangles. While we're never going to be able escape those for texture operations unless we do quite expensive tricks in the rasterizer to combine triangles etc., it seems to me we can still reduce their ALU overhead. Consider what a MIMD architecture like PowerVR's SGX can do: it only has to do ALU operations for anything that determines the texture coordinates of any future texture operation. It can completely bypass everything else, it seems to me. I would suspect that a 'loose MIMD' approach could do the same if implemented properly. Another possible approach, of course, is strict MIMD ala SGX. While Jawed thinks that is very expensive, I am far from convinced that it is: it requires to be a lot smarter about your control logic, but it's not impossible. SGX's die size is proof enough of that, and the overhead would be even smaller if each ALU was 32x32 MUL/57x57 ADD instead of 24x24 MUL&ADD, it seems the overhead should be even smaller... Of course, going down that route prevents you from double pumping the ALUs and not the control logic! Unless your control logic can dual-issue, which it seems to me would only make sense if each issue was for a separate register file.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
#291 | ||
|
Senior Member
|
Quote:
Quote:
Thanks |
||
|
|
|
|
#292 | |
|
Senior Member
Join Date: Jul 2004
Location: NY, NY
Posts: 2,680
|
Quote:
[/QUOTE] |
|
|
|
|
|
#293 | |
|
Regular
|
Quote:
Separately, in OpenCL there's a work group and in D3D11-CS a thread group. These are the equivalent of a block in CUDA. While a block in CUDA can be any size, I think that D3D11-CS defines a fixed size for a block (1024 threads). This size will change (increase) with succeeding versions of D3D. Clearly a GPU can break up a work group/thread group/block into as many batches/warps/wavefronts as it needs to. The latter is a hardware constraint. Jawed |
|
|
|
|
|
#294 | |
|
Regular
|
Quote:
If you're going to invest in the control overhead for small warps (either narrower SIMDs or less phases per warp-instruction) then you have to weigh the cost of the extra control overhead versus dynamic warp formation. DWF is effectively providing smaller warps, on average, during incoherent interludes ("taking several small warps and combining them into one"). So below a certain warp size DWF becomes pointless. The granularity of DWF becomes an issue, too. e.g. do you want to be able to combine any 32 warps to make up a temp warp? Or is it better to limit this to any 16, or any 8? etc. Jawed |
|
|
|
|
|
#295 | |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
To take an extreme case, imagine an if-else case where each branch has a truly random 50% chance of being taken. In a case without DWF and a warp size of 2, you have four possible cases, each with a 25% probability: T0=True/T1=False - T0=False/T1=True - T0=True/T1=True - T0=False/T1=False. Clearly efficiency is 50% in cases 1/2, and 100% in cases 3/4, resulting in average efficiency of 75%. The greater the warp size, the nearer to 50% you get; but you can never get above 75 or below 50... With DWF, your efficiency with an infinitely long number of warps being checked is 100%. However, even with a very small number of warps being checked, it should easily be very near 100%; but this is not the case with much larger warps, where you need to check many more warps to achieve near-100% efficiency. So I should have made my comments more precise: DWF becomes more effective for smaller batches *in the case of realistic numbers of warps being checked*. It seems crazy to me to check 32 warps per clock cycle, so I didn't even consider that possibility; I guess that's where our misunderstanding comes from...
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
#296 |
|
Regular
|
![]() In other words, the possible gain from DWF as warp size decreases gets less and less. And with decreasing warp size the increase in control overhead associated with smaller warps utterly swamps the DWF control overhead. So, yes, in some abstract sense the area overhead of DWF dimishes in comparison with control overhead - but so does the performance gained with DWF. You're ignoring the total cost of the GPU with these smaller warp sizes. I will agree that there has to be a sensible limit on the number of warps. Don't forget that currently a warp only issues once every 4 clocks, so that makes the construction of a temp-warp less onerous than it first appears. This latency is also used to ease operand collection and with DWF a different latency versus SIMD-width trade-off may arise. For a variety of reasons Fung's simulator is further away from G80 than ideal. It's very much qualitative rather than quantitive. --- It's notable that the harmonic mean of IPC is, in the ideal MIMD case, about 33/34. Some attributes of the simulator cause an unexpected variation in MIMD IPC It's interesting to see how DWF actually makes HMMer faster in the 8 and 16 warp-sized cases than MIMD Apart from that, averaging these apps this GPU is, according to the simulator, only able to achieve ~1/4 of theoretical ALU utilisation in the MIMD case. I could take this as a cue to indicate that more widespread use of DWF (i.e. to solve all incoherency) is really needed to make it truly compelling. Jawed |
|
|
|
|
#297 | |
|
Member
Join Date: Nov 2007
Location: Santa Clara, CA
Posts: 427
|
Quote:
BTW, any speculation on cache for GT3xx?
__________________
Timothy Farrar :: blog |
|
|
|
|
|
#298 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
A warp is a single instruction applied to a group of work items in lockstep.
In what fashion would a warp be applicable to a MIMD setup, where forming a warp would be forcing them into lockstep?
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
#299 |
|
Regular
|
I think this is part of Arun's argument in favour of MIMD - it's possible with a MIMD machine to still operate on blocks of data, thus use bank-bursts to your advantage to gain maximum bandwidth efficiency, rather than taking a truly scatter-gun approach to each kind of memory operation (register file, constant cache, shared memory, memory), by fetching single operands.
So a burst fetch from a register file might produce 16 cycles worth of data in the normal case. This would allow the ALU to run at a much higher clock than the register file and also allow single-porting, etc. It's analogous to cache lines providing data for more than one cycle in a regular CPU. Well, that's my interpretation. Jawed |
|
|
|
|
#300 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Heh, it might be worth noting how incredibly wrong everybody was about G80. It's pretty entertaining reading those old speculation threads....
If we're anywhere near as wrong now as we were then, this upcoming generation could be very interesting. Anybody think we'll get better texture filtering this time around? Maybe a control panel option to force bicubic filtering in the shaders or something. It would be slow but probably more useful than the unnecessarily super high MSAA modes they offer now.
__________________
What the deuce!? |
|
|
![]() |
| Tags |
| nvidia, speculation |
| Thread Tools | |
| Display Modes | |
|
|