NVIDIA Fermi: Architecture discussion

psurge · Oct 15, 2009

Ah ok - that makes sense, thanks! In order for things to work, the scheduler would have to make sure that no one thread or group of threads in a divergent warp monopolizes the execution HW. It's not clear to me that this is difficult to do (after all, you have all the relevant IPs hanging around), but I'm no HW expert.

3dilettante · Oct 15, 2009

The scheduler is tasked with prioritizing instructions in the queues based on various factors, and I believe age has been discussed as being one of them.
I'm not sure if it keeps an age vector that tracks which ones have gotten attention most recently, or if this status is tracked per-warp or per lane.
The coarser tracking would leave open a possible problem for synchronization within a warp.

This might be a way to get around one lane hogging execution.
Other prioritization factors might also conspire to deemphasize the lane or warp that has monopolized execution (disproportionate memory traffic, register use), at least in theory.

Andrew Lauritzen · Oct 15, 2009

psurge said:
Sorry if I'm being thick, but I don't understand. Lets outlaw the use of __sync for a moment, and just reimplement barriers by having each participating work item atomically decrement a common counter value and then spin on the counter value till it becomes <= 0. In what way does the hardware (rather than buggy software) prevent forward progress?

[Edit] Yes, what dnavas said

The other "threads" in the group may not get to run *at all* until the next barrier sync - I don't believe there's any guarantee of this. Thus you can possibly/easily hang indefinitely just spinning on the first warp of threads. There may be a mechanism in place to prevent this but I do not believe that it is guaranteed in the execution model. Furthermore even a clever scheduler wouldn't solve this problem when it is internal to the SIMD lanes of one warp... which is incidentaly why putting a sync inside divergent control flow is disallowed to start with.

Obviously in trivial examples you can just add a "sync" in your spin loop, but as mentioned this does not interact at all (disallowed) with any divergent control flow, which breaks the "independence" thread abstraction.

Groo The Wanderer · Oct 15, 2009

Arun said:
Just a quick note before I go to bed: the trick is obviously that NVIDIA has the respin-ready wafers parked at TSMC. They won't need 6 weeks to get hot lots given that; and they've presumably got enough wafers parked for mass production/initial availability as Charlie himself previously reported, not just for hot lots. Of course, if they don't tape-out very soon, even that won't save them to get anything out this year...

Depends on where they parked the wafers. If they are parked at a point after which the changes called for by the new spin are needed, they have a new name, scrap. If not, then you save time. I shaved two weeks off the respin time for my estimations for that.

Given how deep they are going down to find the bugs, or in this case not find them, I would lean more towards starting fresh. Then again, they could surprise.

The flip side to all of this is that if they did park said wafers in really early, then you shave less time off the total. Either way, they don't save much time, hence the two weeks.

-Charlie

psurge · Oct 16, 2009

Andrew Lauritzen said:
The other "threads" in the group may not get to run *at all* until the next barrier sync - I don't believe there's any guarantee of this. Thus you can possibly/easily hang indefinitely just spinning on the first warp of threads. There may be a mechanism in place to prevent this but I do not believe that it is guaranteed in the execution model. Furthermore even a clever scheduler wouldn't solve this problem when it is internal to the SIMD lanes of one warp...

Even in the intra-warp case, it seems like the scheduler would have status information on the readiness of each lane; it would have to do something like round-robin over ready lanes when scheduling a particular warp to avoid the problems you and dnavas pointed out. That's in addition to making sure that no warp in a batch is starved (and if/when multi-kernel support gets introduced, that no batch is starved).

Mintmaster · Oct 16, 2009

trinibwoy said:
What does EOL mean if products are still available in retail? Hmmmmm.

It means they're not going to lower prices to compete. They'll just live with having reduced sales volume this quarter.

Andrew Lauritzen · Oct 16, 2009

psurge said:
Even in the intra-warp case, it seems like the scheduler would have status information on the readiness of each lane; it would have to do something like round-robin over ready lanes when scheduling a particular warp to avoid the problems you and dnavas pointed out. That's in addition to making sure that no warp in a batch is starved (and if/when multi-kernel support gets introduced, that no batch is starved).

Yes obviously all of the information is there to treat it as a "real thread" if required (hell, they could just screw SIMD entirely and run each "thread" in one lane which would properly motivate the term thread!), but my guess is doing this properly is on the same order of difficulty as (and not unrelated to) dynamic warp formulation.

If they take the high road and make this work properly in the future, that'll be great. My point here is just that even in the purely software realm, calling each SIMD lane a "thread" is misleading at best, since they *cannot* and *do not* run independently. This isn't just semantics, it affects the programming model and runtime (things like deadlocks) fundamentally.

trinibwoy · Oct 16, 2009

Groo The Wanderer said:
Depends on where they parked the wafers. If they are parked at a point after which the changes called for by the new spin are needed, they have a new name, scrap. If not, then you save time. I shaved two weeks off the respin time for my estimations for that.

The real question is how lucky they feel. Did they go all out and stockpile lots of wafers for a respectable launch and risk taking a bath if they have to dump them. Or did they play it safe with just enough to trickle out over Q1 like you keep predicting.

Mintmaster said:
It means they're not going to lower prices to compete. They'll just live with having reduced sales volume this quarter.

Which makes a lot more sense than the previously offered suggestions of them pulling out of the market. If Fermi delivers then it'll set the stage for a nice reboot. If not well I guess it can't get much worse than it is already.

Bob · Oct 16, 2009

Andrew Lauritzen said:
Yes obviously all of the information is there to treat it as a "real thread" if required (hell, they could just screw SIMD entirely and run each "thread" in one lane which would properly motivate the term thread!), but my guess is doing this properly is on the same order of difficulty as (and not unrelated to) dynamic warp formulation.

It's not that hard...

Andrew Lauritzen · Oct 16, 2009

Bob said:
It's not that hard...

So why don't they do it? Why doesn't the programming model allow arbitrary sync's then? To a large extent, why do we even need sync's then vs. standard locks? I think this would go a long way to actually backing up the notion of them being independent "threads".

Bob · Oct 16, 2009

Because resources are finite?

jimmyjames123 · Oct 16, 2009

ChrisRay said:
Nah.. not yet...

Its just doom and gloom. Nvidia had inventory issues with some launch cards because they over ordered. They probably didn't order as many too begin with to avoid that with Fermi's coming launch.

I look at typical european stores such as overclockers.co.uk and stores like newegg and still see plenty in stock.

This is more or less accurate. The thing that some "journalists" (and I use that term very loosely) like Charlie (et all) don't understand is that, anytime a company is getting ready to come out with a brand new product stack that has far superior performance/features at similar price points compared to prior generation products, then the company has to figure out a way to phase out prior generation products as it brings in new and more competitive products. Having excess inventory of GT200-based GPU's a few months from now when the new Geforce products are available for purchase would be disastrous. How in the world will NVIDIA's partners get rid of these prior gen cards when the new cards from both ATI and NVIDIA far surpass them in terms of performance/features per dollar? The only way to get rid of these cards would be to drastically cut prices to the point where NVIDIA's partners would have to practically give away the cards for a loss or for zero profit. That makes little business sense.

Phasing out old and in some cases underperforming models (relative to the competition) to make way for brand new cutting edge models with superior performance/features is not as easy as it sounds. It can be a delicate balancing act. There is never a "good" or "easy" or "ideal" way to do it. A company has to make some tough decisions. NVIDIA's decision to restrict supply of underperforming GPU's makes some sense, as the alternative makes little business sense for NVIDIA and it's partners.

Mintmaster · Oct 16, 2009

Bob said:
Because resources are finite?

That's the same thing as being "hard". Everyone knows that it's possible to make a chip with 512 real scalar cores, as you're already doing 32 16-SIMD "scalar" cores. What's another factor of eight besides die size? But resources are finite and it's hard - well, impossible - to get the same computation density that way.

Bob · Oct 16, 2009

There are more to "resources" than just die area...

Andrew Lauritzen · Oct 16, 2009

Bob said:
There are more to "resources" than just die area...

You're not adding anything to the conversation... either the eventual goal is to spend these finite "resources" on making threads into *real* threads or they shouldn't be called "threads". That's what this conversation evolved from, with the original point being that it was a good name because they are indistinguishable from a SW perspective from traditional threads. I think I've more than shown that that isn't currently the case.

Thus either we're part of the way on the road to them being fully featured, or we're converging towards some other programming model in which the Khronos names are much more appropriate.

That's the entire conversation as I see it... if you have some light to shed on what you perceive to be the endgame for NVIDIA then I'd be interested. Otherwise I'm not sure what point you're trying to make.

psurge · Oct 16, 2009

I can't quite wrap my head around why this feature would be that expensive to implement (or perhaps Bob is referring to engineering/testing resources), and am very curious to know if Fermi will change any of this... As for syncthreads(), my feeling is that lock performance would be abysmal in comparison because locks must go through the memory subsystem and it seems to me that syncthreads() can be handled at the scheduler level. If performance of locks were as bad as I am thinking, then they may not be a useful feature to present at all...

Edit: I do agree that without this, it is not reasonable to call a work-item a thread.

Bob · Oct 16, 2009

You're letting Perfect be the enemy of Good. Things will get better over time, yes. Obviously, there's no timeline for anything.

If it turns out no one really cares about the limitations of diverging threads, then that problem won't be addressed for a while. If diverging thread behavior is more important than (for example) double-precision support, then the issues will be addressed sooner.

In the real world, there are real constraints.

seahawk · Oct 16, 2009

Arun said:
Just a quick note before I go to bed: the trick is obviously that NVIDIA has the respin-ready wafers parked at TSMC. They won't need 6 weeks to get hot lots given that; and they've presumably got enough wafers parked for mass production/initial availability as Charlie himself previously reported, not just for hot lots. Of course, if they don't tape-out very soon, even that won't save them to get anything out this year...

That would only make sense if A! had only minor problems, but if you look at the rumored yoields and frequency problems it seems to have many big problems.

So if they do like you suggest the chances are high that those wafers will become scrap.

Andrew Lauritzen · Oct 16, 2009

Well maybe it wouldn't be hard to implement, but it's certainly not trivial unless I'm missing something. Furthermore if it was easy, I doubt they would have had the restriction in the first place.

Consider a kernel something like:

Code:

groupshared float x = 0;

{
  if (threadID == 0) {  // Consumer
    while (x == 0) {}
  } else if (threadID == 1) {  // Producer
    x = 1;
  }
}

Now as mapped to SIMD lanes, this doesn't necessarily work properly. If the vectorized code decides to predicate the threadID if and evaluate the consumer block first, it will never get out of the resulting while(), and hence never get into the producer block... deadlock - ouch.

To avoid this, you basically have to know to run that control block independently, and not predicated/SIMD in this case. You can't know this statically in the general case, so you need the hardware (or SW if targeting a SIMD ISA) to dynamically be evaluating arbitrarily different control flow paths generated by predication across the warp/SIMD lanes. Fully generally, this involves packing/unpacking masked warp lanes on the fly.

Simply always splitting on predication/control flow into separate warps would solve the deadlock problem, but would be pretty inefficient as it would imply that whenever a single lane diverges, it will never converge again even for simple, reducible control flow. Thus for a "proper" implementation, you also need to be able to detect re-convergence and pack separate warps back into a single one.

Note also that the sync() operations become ill-defined in this context... there's no longer any guarantee that a given "thread" will hit a given sync() (or any sync()) so it's unclear what the barrier should mean... all threads that take that control flow path? You end up with more utility from more traditional shared memory and atomic threading constructs.

That's why I say that this problem is fairly equivalent to dynamic warp formation, which appears to be a fairly difficult one to do efficiently since although it has had obviously huge benefits back to the first time someone ran a ray tracer on the GPU, it still has yet to show up in any hardware that I know of, and I'm pretty sure that if Fermi had it, they'd be making more noise about it.

Bob · Oct 16, 2009

Believe me, I'm aware of all these issues and how well the various solutions map to the architecture.

NVIDIA Fermi: Architecture discussion

psurge

3dilettante

Andrew Lauritzen

Moderator

Groo The Wanderer

psurge

Mintmaster

Andrew Lauritzen

Moderator

trinibwoy

Meh

Bob

Andrew Lauritzen

Moderator

Bob

jimmyjames123

Mintmaster

Bob

Andrew Lauritzen

Moderator

psurge

Bob

seahawk

Andrew Lauritzen

Moderator

Bob

Similar threads