NVIDIA Fermi: Architecture discussion

Andrew Lauritzen · Oct 16, 2009

Bob said:
Believe me, I'm aware of all these issues and how well the various solutions map to the architecture.

Cool, I guess I just don't get the point you're trying to make then.

FWIW I'm not saying that the current solution of CUDA threads not acting like tradition threads is necessarily a bad model going forward, but I am arguing that given the limitations it's more natural to think of the programming model in terms of SIMD, *not* in terms of independent threads. It's convenient to write it in a scalar fashion - no doubt - but it's fundamentally important to understand the underlying hardware to write anything nontrivial and useful.

Thus I will continue to assert that the choice of the term "thread" is a bad one compared to Khronos' more general "work item".

aaronspink · Oct 16, 2009

psurge said:
Sorry if I'm being thick, but I don't understand. Lets outlaw the use of __sync for a moment, and just reimplement barriers by having each participating work item atomically decrement a common counter value and then spin on the counter value till it becomes <= 0. In what way does the hardware (rather than buggy software) prevent forward progress?

Work item C in warp 1 needs to get a locked copy of Mem_Alpha. Work item B in warp 2 has a lock on Mem_alpha. Work item A in warp 2 needs to get a lock on Mem_beta. Work item D in Warp 1 has a lock on Mem_beta.

Slightly simplified but basically, Items C and A are in a spin loop waiting on B and D but B and D can't release cause they are in the same instruction streams as C and A respectively.

In a system that has real threads this actually works. In something trying, but failing, to emulate thread via vector lanes this does not.

MfA · Oct 16, 2009

I think we have established by now that thread is a wide enough concept with enough differing interpretations that restricting it to hardware contexts for the SIMD engine without qualification is not conducive to proper communication ... otherwise this wouldn't have generated this much text.

flynn · Oct 16, 2009

jimmyjames123 said:
A company has to make some tough decisions. NVIDIA's decision to restrict supply of underperforming GPU's makes some sense, as the alternative makes little business sense for NVIDIA and it's partners.

I don't disagree with what you say, but then why did they just release the new very low performing cards (gt210/gt220)?

FrameBuffer · Oct 16, 2009

mmendez said:
I don't disagree with what you say, but then why did they just release the new very low performing cards (gt210/gt220)?

iirc , OEM Checkbox.. DirectX 10.1 (Check)

Andrew Lauritzen · Oct 16, 2009

MfA said:
I think we have established by now that thread is a wide enough concept with enough differing interpretations that restricting it to hardware contexts for the SIMD engine without qualification is not conducive to proper communication ... otherwise this wouldn't have generated this much text.

Well I think what's clear is that using it "without qualification" in the manner that CUDA uses it is similarly not conducive to proper communication. This is why the OpenCL approach of renaming the SW concepts so that they are decoupled with how they are mapped to the hardware is a generally good idea IMHO. Regardless of the confusion, there are fundamental differences between CUDA-like "threads" and typical pthreads, motivating a differentiation of terminology pretty strongly IMHO.

MfA · Oct 16, 2009

I don't feel like disputing that just like you don't seem to feel like disputing my point

Andrew Lauritzen · Oct 16, 2009

MfA said:
I don't feel like disputing that just like you don't seem to feel like disputing my point

I don't think that I disagree with your point

psurge · Oct 16, 2009

aaronspink, yup I see your point. Your example and Andrew's helped clarify things for me - I was thinking that the warp scheduler would have to advance lane IPs to the next actual instruction for each branch rather than predicate, and then round robin over ready lanes, rather than processing the instruction stream serially. You'd still issue a single instruction for a warp, so hopefully the scheduler wouldn't be that much more complex (but I'm perfectly willing to take a HW guys word that it would, and I may well be overlooking other issues). The problem with my idea, as Andrew pointed out, is that once lanes diverge, they will quite likely never reconverge without further action. That's the part I wasn't thinking through properly, and I can't think of an easy way to do it.

jimmyjames123 · Oct 17, 2009

mmendez said:
I don't disagree with what you say, but then why did they just release the new very low performing cards (gt210/gt220)?

Who knows. Maybe because these cards are super cheap to manufacture (unlike GT200-based cards)? Maybe NVIDIA's testbed for the 40nm process node? I don't even know if these cards are available in large quantities right now.

Mintmaster · Oct 17, 2009

Andrew Lauritzen said:
To avoid this, you basically have to know to run that control block independently, and not predicated/SIMD in this case. You can't know this statically in the general case, so you need the hardware (or SW if targeting a SIMD ISA) to dynamically be evaluating arbitrarily different control flow paths generated by predication across the warp/SIMD lanes. Fully generally, this involves packing/unpacking masked warp lanes on the fly.

That's a bad example. Consider, for example, a simple OS that schedules context switches by inserting them into the instruction stream as opposed to using timed interrupts. On a CPU, your situation deadlocks as well. The problem is not the SIMD lanes nor the use of predication. It's the way that the scheduling is done. All you'd need is a limit on consecutive cycles spent on each branch, just like a real OS imposes a limit on contiguous time spent on a thread by a CPU.

Hell, for all we know maybe GPUs already have such limits.

Simply always splitting on predication/control flow into separate warps would solve the deadlock problem

Too fancy! Is there any problem with the method I mentioned above? I'm sure there are benefits to dynamic warp formation, but this example falls short, IMO.

EDIT: Hmm, I suppose that from another point of view I am suggesting warp splitting, with the new warps being effectively compressed into one. I see how re-syncing becomes a problem because you could just wind up deadlocking yourself again.

Andrew Lauritzen · Oct 17, 2009

Mintmaster said:
All you'd need is a limit on consecutive cycles spent on each branch, just like a real OS imposes a limit on contiguous time spent on a thread by a CPU.

Interesting proposal although I'm not sure this is as simple as it sounds either. As you mention in your edit I think this still breaks down to being able to split up the "threads" in the lanes after some timeout, rather than just always doing it at control flow. In either case it does involve tracking additional program counters and potentially registers (unless you have some indirection map from the new warps to the old register lanes they came from I guess) when the split happens.

I dunno, maybe there's an easier way to do this but it seems like even the simplest schemes are complex enough to consider going to fully dynamic warp formulation, which has many additional benefits for performance of divergent control flow and completes the abstraction.

Who knows though, it all seems pretty expensive compared to the current setups. I wonder how much raw compute power you'd have to give up in terms of both size and power to get something like this...

trinibwoy · Oct 18, 2009

Have any of the rendering pipeline gurus given any thought to the practical uses of a unified memory space or caches for 3D rendering? One benefit of Larrabee's approach is that the tile stays on-chip from start to finish. It seems that Fermi's L1/L2 won't be sufficient to handle both render targets and textures for something like this. Even 32x32 tiles (1024 work-items) will get heavy with 4 FP16 surfaces, and that's before AA. And then there's the primitive buffering and draw call dependency tracking that has to happen before all that.

So if tiling isn't feasible or practical, is there anything else that could potentially benefit? What's the path for MSAA readback (DX10.1) on current hardware? (oh depth is just mapped as another texture, nvm)

fellix · Oct 18, 2009

Fermi and Folding

Some additional explanations:

I was forced to use usb monitor as GPUs haven't any video output (this engineering samples of Fermi are Tesla like, but they have 1.5GB of memory each like GT300 will).
Because of the new MIMD architecture (they have 32 clusters of 16 shaders) i was not able to load them at 100% in any other way but to launch 1 F@H client per cluster and per card. Every client is GPU3 core Beta (Open MM library). I supose it is much more efficient then previous GPU2. In addition they need very little memory to run. Having 16GB of DDR3 and using Windows 7 Enterprise I've managed to run 200 instances of F@H GPU and 4 CPU (i7 processor HT off).

Power considerations:
He has (2) 1500W PSU's, using 2400 watts.

So lets say his i7 CPU uses 150 watts...

2400 watts - 150 = 2,250 GPU watts total.

2,250 / 7 Fermi = 321 watts per GPU...

CarstenS · Oct 18, 2009

Do you consider the original thread over there to be genuine?
1 Post (and that's the only one) in seven days from this guy, a website as proof which doesn't exist? Call me sceptical at best.

rpg.314 · Oct 18, 2009

Not to mention that he apparently has all the seven working Fermi chips with him

ahu · Oct 18, 2009

fellix said:
Fermi and Folding

Confirmed as fake on the Folding Forum:

7im said:
4 Linux SMP clients, 31 GPU clients, and the rest are CPU clients, mostly on XP (at least in the last 7 days).

What got me suspicious was the claim of 32 concurrent folding cores per GPU. The Fermi is supposed to have only 16 multiprocessors.

fellix · Oct 18, 2009

Another flop!

Ailuros · Oct 18, 2009

I'm fairly disappointed. After counting for several pages threads, I would had expected that some would have to started to count beeds by now. Instead all I get is some pity link of some fool which was screamingly obvious even to me that it's a fake.

I think I need to get digi's assistance to throw around some confetti and glitter to add some colour to this thread (are you sure we shouldn't call them warps in fora now?) *runs for his life*

nAo · Oct 18, 2009

trinibwoy said:
Have any of the rendering pipeline gurus given any thought to the practical uses of a unified memory space or caches for 3D rendering?

The ultimate practical usages: simplify developers life and/or improve performance.

One benefit of Larrabee's approach is that the tile stays on-chip from start to finish. It seems that Fermi's L1/L2 won't be sufficient to handle both render targets and textures for something like this. Even 32x32 tiles (1024 work-items) will get heavy with 4 FP16 surfaces, and that's before AA. And then there's the primitive buffering and draw call dependency tracking that has to happen before all that.

Ignore for a second Fermi L2 size and don't think of a cache in terms of a local store, otherwise you'd be better of using a local store (wouldn't waste area and power on something you don't need). Just because your entire data set doesn't fit in your cache it doesn't automatically mean that your cache cannot help you. Texture caches are a perfect example of that.

NVIDIA Fermi: Architecture discussion

Andrew Lauritzen

Moderator

aaronspink

MfA

flynn

FrameBuffer

Andrew Lauritzen

Moderator

MfA

Andrew Lauritzen

Moderator

psurge

jimmyjames123

Mintmaster

Andrew Lauritzen

Moderator

trinibwoy

Meh

fellix

CarstenS

Moderator

rpg.314

ahu

fellix

Ailuros

Epsilon plus three

nAo

Nutella Nutellae

Similar threads