NVIDIA Fermi: Architecture discussion

Believe me, I'm aware of all these issues and how well the various solutions map to the architecture.
Cool, I guess I just don't get the point you're trying to make then.

FWIW I'm not saying that the current solution of CUDA threads not acting like tradition threads is necessarily a bad model going forward, but I am arguing that given the limitations it's more natural to think of the programming model in terms of SIMD, *not* in terms of independent threads. It's convenient to write it in a scalar fashion - no doubt - but it's fundamentally important to understand the underlying hardware to write anything nontrivial and useful.

Thus I will continue to assert that the choice of the term "thread" is a bad one compared to Khronos' more general "work item".
 
Sorry if I'm being thick, but I don't understand. Lets outlaw the use of __sync for a moment, and just reimplement barriers by having each participating work item atomically decrement a common counter value and then spin on the counter value till it becomes <= 0. In what way does the hardware (rather than buggy software) prevent forward progress?

Work item C in warp 1 needs to get a locked copy of Mem_Alpha. Work item B in warp 2 has a lock on Mem_alpha. Work item A in warp 2 needs to get a lock on Mem_beta. Work item D in Warp 1 has a lock on Mem_beta.

Slightly simplified but basically, Items C and A are in a spin loop waiting on B and D but B and D can't release cause they are in the same instruction streams as C and A respectively.

In a system that has real threads this actually works. In something trying, but failing, to emulate thread via vector lanes this does not.
 
I think we have established by now that thread is a wide enough concept with enough differing interpretations that restricting it to hardware contexts for the SIMD engine without qualification is not conducive to proper communication ... otherwise this wouldn't have generated this much text.
 
A company has to make some tough decisions. NVIDIA's decision to restrict supply of underperforming GPU's makes some sense, as the alternative makes little business sense for NVIDIA and it's partners.

I don't disagree with what you say, but then why did they just release the new very low performing cards (gt210/gt220)?
 
I think we have established by now that thread is a wide enough concept with enough differing interpretations that restricting it to hardware contexts for the SIMD engine without qualification is not conducive to proper communication ... otherwise this wouldn't have generated this much text.
Well I think what's clear is that using it "without qualification" in the manner that CUDA uses it is similarly not conducive to proper communication. This is why the OpenCL approach of renaming the SW concepts so that they are decoupled with how they are mapped to the hardware is a generally good idea IMHO. Regardless of the confusion, there are fundamental differences between CUDA-like "threads" and typical pthreads, motivating a differentiation of terminology pretty strongly IMHO.
 
aaronspink, yup I see your point. Your example and Andrew's helped clarify things for me - I was thinking that the warp scheduler would have to advance lane IPs to the next actual instruction for each branch rather than predicate, and then round robin over ready lanes, rather than processing the instruction stream serially. You'd still issue a single instruction for a warp, so hopefully the scheduler wouldn't be that much more complex (but I'm perfectly willing to take a HW guys word that it would, and I may well be overlooking other issues). The problem with my idea, as Andrew pointed out, is that once lanes diverge, they will quite likely never reconverge without further action. That's the part I wasn't thinking through properly, and I can't think of an easy way to do it.
 
I don't disagree with what you say, but then why did they just release the new very low performing cards (gt210/gt220)?

Who knows. Maybe because these cards are super cheap to manufacture (unlike GT200-based cards)? Maybe NVIDIA's testbed for the 40nm process node? I don't even know if these cards are available in large quantities right now.
 
To avoid this, you basically have to know to run that control block independently, and not predicated/SIMD in this case. You can't know this statically in the general case, so you need the hardware (or SW if targeting a SIMD ISA) to dynamically be evaluating arbitrarily different control flow paths generated by predication across the warp/SIMD lanes. Fully generally, this involves packing/unpacking masked warp lanes on the fly.
That's a bad example. Consider, for example, a simple OS that schedules context switches by inserting them into the instruction stream as opposed to using timed interrupts. On a CPU, your situation deadlocks as well. The problem is not the SIMD lanes nor the use of predication. It's the way that the scheduling is done. All you'd need is a limit on consecutive cycles spent on each branch, just like a real OS imposes a limit on contiguous time spent on a thread by a CPU.

Hell, for all we know maybe GPUs already have such limits.

Simply always splitting on predication/control flow into separate warps would solve the deadlock problem
Too fancy! Is there any problem with the method I mentioned above? I'm sure there are benefits to dynamic warp formation, but this example falls short, IMO.

EDIT: Hmm, I suppose that from another point of view I am suggesting warp splitting, with the new warps being effectively compressed into one. I see how re-syncing becomes a problem because you could just wind up deadlocking yourself again.
 
Last edited by a moderator:
All you'd need is a limit on consecutive cycles spent on each branch, just like a real OS imposes a limit on contiguous time spent on a thread by a CPU.
Interesting proposal although I'm not sure this is as simple as it sounds either. As you mention in your edit I think this still breaks down to being able to split up the "threads" in the lanes after some timeout, rather than just always doing it at control flow. In either case it does involve tracking additional program counters and potentially registers (unless you have some indirection map from the new warps to the old register lanes they came from I guess) when the split happens.

I dunno, maybe there's an easier way to do this but it seems like even the simplest schemes are complex enough to consider going to fully dynamic warp formulation, which has many additional benefits for performance of divergent control flow and completes the abstraction.

Who knows though, it all seems pretty expensive compared to the current setups. I wonder how much raw compute power you'd have to give up in terms of both size and power to get something like this...
 
Have any of the rendering pipeline gurus given any thought to the practical uses of a unified memory space or caches for 3D rendering? One benefit of Larrabee's approach is that the tile stays on-chip from start to finish. It seems that Fermi's L1/L2 won't be sufficient to handle both render targets and textures for something like this. Even 32x32 tiles (1024 work-items) will get heavy with 4 FP16 surfaces, and that's before AA. And then there's the primitive buffering and draw call dependency tracking that has to happen before all that.

So if tiling isn't feasible or practical, is there anything else that could potentially benefit? What's the path for MSAA readback (DX10.1) on current hardware? (oh depth is just mapped as another texture, nvm)
 
Fermi and Folding
Some additional explanations:

I was forced to use usb monitor as GPUs haven't any video output (this engineering samples of Fermi are Tesla like, but they have 1.5GB of memory each like GT300 will).
Because of the new MIMD architecture (they have 32 clusters of 16 shaders) i was not able to load them at 100% in any other way but to launch 1 F@H client per cluster and per card. Every client is GPU3 core Beta (Open MM library). I supose it is much more efficient then previous GPU2. In addition they need very little memory to run. Having 16GB of DDR3 and using Windows 7 Enterprise I've managed to run 200 instances of F@H GPU and 4 CPU (i7 processor HT off).
Power considerations:
He has (2) 1500W PSU's, using 2400 watts.

So lets say his i7 CPU uses 150 watts...

2400 watts - 150 = 2,250 GPU watts total.

2,250 / 7 Fermi = 321 watts per GPU...
 
Do you consider the original thread over there to be genuine?
1 Post (and that's the only one) in seven days from this guy, a website as proof which doesn't exist? Call me sceptical at best.
 
I'm fairly disappointed. After counting for several pages threads, I would had expected that some would have to started to count beeds by now. Instead all I get is some pity link of some fool which was screamingly obvious even to me that it's a fake.

I think I need to get digi's assistance to throw around some confetti and glitter to add some colour to this thread (are you sure we shouldn't call them warps in fora now?) *runs for his life* :LOL:
 
Have any of the rendering pipeline gurus given any thought to the practical uses of a unified memory space or caches for 3D rendering?
The ultimate practical usages: simplify developers life and/or improve performance.
One benefit of Larrabee's approach is that the tile stays on-chip from start to finish. It seems that Fermi's L1/L2 won't be sufficient to handle both render targets and textures for something like this. Even 32x32 tiles (1024 work-items) will get heavy with 4 FP16 surfaces, and that's before AA. And then there's the primitive buffering and draw call dependency tracking that has to happen before all that.
Ignore for a second Fermi L2 size and don't think of a cache in terms of a local store, otherwise you'd be better of using a local store (wouldn't waste area and power on something you don't need). Just because your entire data set doesn't fit in your cache it doesn't automatically mean that your cache cannot help you. Texture caches are a perfect example of that.
 
Back
Top