NVIDIA Fermi: Architecture discussion

pcchen · Oct 13, 2009

Bob said:
*shrug*. What's a thread then? Explain what is missing from the NVIDIA architecture that would make threads be "predicated SIMD lanes".

Well, for example, NVIDIA's hardware can't handle unreducible flow graph, if the "threads" belongs to the same warp. This is a functional restriction, because it's actually a SIMD machine.

Even if we consider only "CUDA/OpenCL programs" their behavior is clearly more similar to a SIMD machine rather than a multiple threads machine. For example, inside a warp, these "threads" can only execute one branch at a time. You don't find that restriction on normal threaded processors.

An advantage of using well-known nomemcaltures is that everyone who is familiar with common computer science terms can understand the behavior and characteristics immediately. People get a clear idea of G8X/GT200's behavior if I tell them "it's a SIMD machine with predication and gather/scatter." (note that older HPC SIMD machines do have predication and gather/scatter, it's the contemporary MMX/SSE that lacks these)

Consider if someone develops a new mathematics geometric theorem, but he uses "zine" instead of "cosine" and "zane" instead of "sine." Wouldn't you think it'd be very confusing?

3dilettante · Oct 13, 2009

Andrew Lauritzen said:
What?? NVIDIA's definition of "threads" doesn't even meet the POSIX "definition" and they certainly don't agree with the majority of the wikipedia article on threads.

This weakest term for thread I've seen is a "loci of execution", but this would be the probable floor of how imprecisely it can be taken.
A component of execution within a SPMD context could potentially align with that.
My one question is that it was qualified that such a loci can run on any processor, which may not happen for many or all GPUs.

With the intervention of the setup hardware and schedulers, each lane has ownership of a program counter and private registers. With Fermi, a stack or something like it would be emulated by the hardware as well.
The OS is not aware of the threads, however. Though not all threads necessarily need to be exposed at that level, either.

Conversely, they do agree precisely with a predicated SIMD lane, or more generally the SPMD model.

If the hardware didn't also handle a certain amount of separation of lane-specific context, it would be just a predicated SIMD lane.
However, the hardware does maintain this state and handles it without the software intervening or handling this state explicitely.

That has been well known for a long time and the more technical reviewers called out NVIDIA for introducing the nonsensical "SIMT" nomenclature when a perfectly valid term already existed. To quote AnandTech:

I agree that SIMT seems like an unnecessary addition, it seems to describe something little different from SPMD.

Given a definition of thread broad enough to fit what NVIDIA calls a "thread", I might as well start calling the separate bits in each of the ALU's "threads" and multiply the marketing numbers by another 32x and talk about all the fancy atomic *single-cycle* coherent shared memory operations I can do across "threads" like ADD, MUL, etc.

Those bits wouldn't be capable of addressing their own state or driving program execution independently without explicit software management.

It just gets ridiculous if you expand the term to mean "any program written in a scalar fashion that may or may not be run concurrently, predicated, given dedicated hardware resources, in a SIMD lane, ... but really guys you write it like it was an independent thread that gets launched a million times..."

Perhaps things have gotten that ridiculous.
The way the hardware brings about a certain end result, whether by separate pipelines, multiple hardware threads, or a SIMD unit with internal state juggling, may not matter as long as it is software-transparent.

Andrew Lauritzen · Oct 13, 2009

3dilettante said:
This weakest term for thread I've seen is a "loci of execution", but this would be the probable floor of how imprecisely it can be taken.

Sure we can obviously make the term as general as we want, but I think the balance of usage and understanding is with respect to tradition OSes and POSIX-like threads.

3dilettante said:
With the intervention of the setup hardware and schedulers, each lane has ownership of a program counter and private registers. With Fermi, a stack or something like it would be emulated by the hardware as well.

Fermi definitely brings things closer (theoretically), but there are still missing capabilities like irreducible control flow, the ability to spawn/fork "threads" on the fly, etc. that are expected out of the term "thread" and the traditional understanding of it. If they can do an efficient pthreads implementation in terms of Fermi, I'll be ecstatic, but I don't think they're there yet...

Now I'm not arguing that pthreads are the most efficient way to write massively data-parallel code going forward, but this is the typical understanding of the term "thread" and I see no reason to confound that.

3dilettante said:
Those bits wouldn't be capable of addressing their own state or driving program execution independently without explicit software management.

Sure but neither are CUDA "threads" capable of spawning additional work or rescheduling SIMD lanes on the fly between kernels. All of that requires "explicit software management" and in this case, that management only happens from the CPU!

Don't underestimate the importance of this... right now with any amount of "braided" parallelism, you're faced with the ridiculous choice of dumping all "shared memory" state to VRAM, or simply idling on some large number of "threads" and wasting processor resources. Either that you have to implement work stealing in software at which point you begin to write "threads" as 1:1 with the *real* hardware threads, further arguing the misuse of the term in the first place. Even then, in practice you still end up having to dump all your state out to VRAM when switching to another kernel even if that kernel inputs the exact same data set...

3dilettante said:
Perhaps things have gotten that ridiculous.
The way the hardware brings about a certain end result, whether by separate pipelines, multiple hardware threads, or a SIMD unit with internal state juggling, may not matter as long as it is software-transparent.

To some extent that's definitely true, but not only are we still in a different programming model than typical pthreads (both in terms of implementation and functionality), but to write efficient code on these machines you have to have a good understanding of how it maps to the hardware SIMD units. OpenCL nicely separates these concepts by abstracting the concept of a "work item" from that of a traditional CPU thread, which is the right thing to do IMO. While you could argue that things like hyper-threading are similar concepts in that they aren't giving you two 100% independent 2x throughput threads, but with hyper-threading you still write code as if you have two completely separate execution resources. With SIMD you don't - the reality of the execution model greatly affects everything from control flow to data structures. HPC people at least are used to dealing with SIMD though, so even if you don't expose it with the typical programming models, it's worth being clear and up-front about what is happening so that people can apply their experience properly to the given hardware. Terminology is a big part of this, and reusing the term "thread" is just confusing.

Mintmaster · Oct 13, 2009

Jawed said:
Yeah, OpenCL at least brings some sanity.

A work group is not equivalent to a warp though. A work group is a set of work items that can all share local memory. OpenCL doesn't have a concept like "warp".

Jawed

Is this going to pose problems in trying to organize work items into groups that branch and fetch as coherently as possible?

For example, if you have a million work items for raytracing that you enumerate left to right, then top to bottom, how is OpenCL going to know that 8x8 blocks will branch and access data much more coherently than 64x1?

Or do we just have to program it so that items with nearby IDs should be as coherent as possible?

aaronspink · Oct 13, 2009

3dilettante said:
This weakest term for thread I've seen is a "loci of execution", but this would be the probable floor of how imprecisely it can be taken.

Try "Independent loci of execution". The Independent part is the important part...

With the intervention of the setup hardware and schedulers, each lane has ownership of a program counter and private registers.

Not really. They might SAY it does. Doesn't mean it does.

If the hardware didn't also handle a certain amount of separation of lane-specific context, it would be just a predicated SIMD lane.

What? You mean it has vector CONSTANTS? OMG. Call the patent office.

so outside of a vector register that has data on a per lane basis, um wait. Thats like every SIMD ever designed.

However, the hardware does maintain this state and handles it without the software intervening or handling this state explicitely.

So you've seen the internal code that runs on the G80?

3dilettante · Oct 13, 2009

Andrew Lauritzen said:
Sure we can obviously make the term as general as we want, but I think the balance of usage and understanding is with respect to tradition OSes and POSIX-like threads.

I'd hate to appeal to authority, but at least as far as Patterson and Hennessy are concerned, the term "thread" sans additional clarification has devolved to a shorthand.

A CPU that handles 4 "threads" uses a very basic definition of the term, as it doesn't discriminate as to whether it's a kernel thread or a pthread, or some other specific variety of thread. In this case, a thread is an atom of program execution state. There is some amount of essential "threadness" for which other distinctions are a collection of clutter and special-cases.

As far as SIMT goes as a categorization, it is focused at the level of the the hardware execution model.

Fermi definitely brings things closer (theoretically), but there are still missing capabilities like irreducible control flow, the ability to spawn/fork "threads" on the fly, etc. that are expected out of the term "thread" and the traditional understanding of it. If they can do an efficient pthreads implementation in terms of Fermi, I'll be ecstatic, but I don't think they're there yet...

I don't know if Fermi's control flow handling has changed with regards to irreducibility.
The idea that an implementation needs to be efficient runs into a basic question of whether there is some arbitrary cut-off where a thread isn't a thread if it doesn't have a certain amount of performance.
From a practical standpoint, there are usability concerns.
From an algorithmic standpoint, how quickly such cases are handled is beside the point if the hardware somehow can handle it internally.

As far as standard threads go, is there really any way to fork or spawn new threads without some procedure call or explicit instruction in software?

What would be missing would be a procedure call that is capable of setting up an additional compute kernel or component thereof.
Whether it was particularly fast or efficient wouldn't impact the existence or non/existence of this missing piece.

Sure but neither are CUDA "threads" capable of spawning additional work or rescheduling SIMD lanes on the fly between kernels. All of that requires "explicit software management" and in this case, that management only happens from the CPU!

Can you elaborate the point about rescheduling SIMD lanes on the fly between kernels?

Don't underestimate the importance of this... right now with any amount of "braided" parallelism, you're faced with the ridiculous choice of dumping all "shared memory" state to VRAM, or simply idling on some large number of "threads" and wasting processor resources.

If we are debating about what is the essential nature of a thread, this is a implementation detail of significant practical importance, but little theoretical relevance.
The thread isn't going to know how many other execution units or warp slots went idle, or how much was flushed to memory.
Algorithmically, it won't impact the result.

To some extent that's definitely true, but not only are we still in a different programming model than typical pthreads (both in terms of implementation and functionality), but to write efficient code on these machines you have to have a good understanding of how it maps to the hardware SIMD units.

There may be other deficiencies, like irreducible control flow, but efficiency doesn't lend itself to non-arbitrary distinctions.
Maybe I can live with an IPC of .00000001 for certain parts of thread behavior, whilst you wouldn't.

While you could argue that things like hyper-threading are similar concepts in that they aren't giving you two 100% independent 2x throughput threads,

The program stream treats each thread as it always has treated it: a sequential series of operations. Throughput involves a time dimension, for which there is no hard limit of what is acceptable. Other than algorithmic steps or a timer function, the instruction streams are essentially timeless.

but with hyper-threading you still write code as if you have two completely separate execution resources. With SIMD you don't - the reality of the execution model greatly affects everything from control flow to data structures.

CPUs are considered embodiments of the Von Neumann machine, but they suck at self-modifying code, with multiple orders of magnitude drops in performance.
Is there a non-arbitrary distinction drawn in this instance other than "this way is not fast enough"?

Terminology is a big part of this, and reusing the term "thread" is just confusing.

As far as a physical implementation of an execution engine goes, we've probably long passed this point.

Jawed · Oct 13, 2009

Mintmaster said:
Is this going to pose problems in trying to organize work items into groups that branch and fetch as coherently as possible?

Potentially. I'm sure there are CUDA programmers who get a queasy feeling from losing warp-based indexing - and some of the Scan related stuff NVidia does is based on warp size and its interaction with shared memory.

(Indeed that Scan patent uses a 32-bit predicate - so it's very convenient for warps to be 32-wide.)

For example, if you have a million work items for raytracing that you enumerate left to right, then top to bottom, how is OpenCL going to know that 8x8 blocks will branch and access data much more coherently than 64x1?

Fermi might deal a deathblow to that distinction... (Doesn't CUDA already provide efficient linear access to memory? Can't remember.)

From the little playing I've done, Z-order optimisation is a pain in the neck. prunedtree got his efficient matrix multiplication only by tackling that head-on. AMD didn't get there and they're the ones who built the damn thing.

There are attributes of the OpenCL device you can query:

http://forum.beyond3d.com/showpost.php?p=1320595&postcount=117

and you can write self-hardware-profiling applications that tune parameters of the kernel to the device in question. That is, of course, if you don't decide merely to tune for the few processors you have to hand and leave it at that.

How many different processor architectures is a given OpenCL application ever going to appear on? AMD is helping things along nicedy by cutting off everything earlier than RV770.

Or do we just have to program it so that items with nearby IDs should be as coherent as possible?

Or just have hardware-DWF?...

Oh, you could make the work group match the dimensions/layout that suits the hardware device, e.g. 8x8 on ATI, or 32 work items for NVidia (4x8? 1x32? etc.).

Jawed

3dilettante · Oct 13, 2009

aaronspink said:
Try "Independent loci of execution". The Independent part is the important part...

Independent to what extent or in what way?

They are capable upon hitting divergent branch outcomes of reaching a state in each lane consistent with what would happen if they were running separately.
With Fermi's more fully realized pointer support, the set of points each lane can go to in instruction memory independently is even wider.

As long as this is consistent from the POV of one instruction to the next, it doesn't matter if they had to take turns or one lane was predicated off for half the run.

Not really. They might SAY it does. Doesn't mean it does.

You're right. I do have to take them on their word at some level.
They say the hardware handles this.
It could be an elaborate ruse where there are magical gnomes who push little levers whenever a status register is flipped too.
There might be unicorns running in my Pentium 4, as well, but I won't hold it against Intel if they don't suffer from confined conditions.

What? You mean it has vector CONSTANTS? OMG. Call the patent office.

The registers as being defined for a particular thread context are not accessible by other lanes, barring explicit sharing through some other method, unless you think it would be valid while running a shader for a GP register access for every pixel to pull from the exact same physical register.

Even then, if that value is determined to be the same value and the results are valid, it wouldn't matter algorithmically.

so outside of a vector register that has data on a per lane basis, um wait. Thats like every SIMD ever designed.

A hardware-managed set of data or one requiring dedicated instructions and algorithmic modification to manage it?
Should I start bitching about how that stupid P4 keeps putting all my neatly ordered instructions out of sequence, but then sneakily puts them all back?
I hate lying silicon.

So you've seen the internal code that runs on the G80?

I guess everything ever stated about the architecture could be a lie.
I'm not sure I can live with that level of extreme distrust.
I'm not even sure now if what you've just written is English, or merely a foreign language with this post just happening to look the same as English, and in actuality you are posting a recipe for a nice beef stew.

I have to start from somewhere.

Karoshi · Oct 13, 2009

Pfffft, I laugh at fermi with it's laughable 30k threads.
Linux was supporting HUNDREDS OF THOUSANDS of threads on x86 years ago.
Yes, some people tested that.

aaronspink · Oct 14, 2009

3dilettante said:
Independent to what extent or in what way?

All other threads. Nvidia fails this one by the way...

They are capable upon hitting divergent branch outcomes of reaching a state in each lane consistent with what would happen if they were running separately.

So is any other SIMD, but that is only with SEVERE restrictions on the code flow. If I however want to start doing divergent code flows PER datum without re-convergence it won't work because is isn't actually doing anything but predicating results and running the same code on every datum. ie, independent implies independence which requires that each can do something entirely different with no convergence. In other words, I should be able to have one thread sitting in a polling wait loop without effecting the forward execution of any other thread. This is clearly something that current Nvidia hardware isn't capable of.

With Fermi's more fully realized pointer support, the set of points each lane can go to in instruction memory independently is even wider.

Data pointers != instruction pointers. And once again, I highly doubt that different lanes will be executing different instructions.

You're right. I do have to take them on their word at some level.

no, you don't, not really.

The registers as being defined for a particular thread context are not accessible by other lanes, barring explicit sharing through some other method, unless you think it would be valid while running a shader for a GP register access for every pixel to pull from the exact same physical register.

you mean a reg[X][LANE_INDEX_Y] isn't accessible by LANE[LAND_INDEX_Z]? Once again call the patent office. Considering their compiler is explicitly generating the code, unless their is a need for it, I would doubt they'd allow per lane constants to be used out of lane. You seem to be under the impression this is hard.

I guess everything ever stated about the architecture could be a lie.

Well it is from Nvidia marketing and we do know that a large percentage of what Nvidia originally said about the G80 architecture WAS A LIE! Its not like people are going out of their way to second guess Nvidia since they started it all with their wild obfuscation at the G80 launch.

I'm not sure I can live with that level of extreme distrust.
I'm not even sure now if what you've just written is English, or merely a foreign language with this post just happening to look the same as English, and in actuality you are posting a recipe for a nice beef stew.

I have to start from somewhere.

Then start here: who has gone out of their way to not only flat out LIE about their architecture but to also do their best to confuse established terminology to paint their architecture in a more favorable light?

The viewpoint that we cannot trust what Nvidia actually says about their hardware was only created because Nvidia itself taught us that we could not trust what they say about their hardware.

You are trying to portray it as if I'm asking you to doubt the truthfulness of Mother Teresa where the reality is I'm saying you need to be careful about trusting campaign promises.

Blazkowicz · Oct 14, 2009

Karoshi said:
Pfffft, I laugh at fermi with it's laughable 30k threads.
Linux was supporting HUNDREDS OF THOUSANDS of threads on x86 years ago.
Yes, some people tested that.

It didn't like my fork bomb though.

Andrew Lauritzen · Oct 14, 2009

3dilettante said:
I'd hate to appeal to authority, but at least as far as Patterson and Hennessy are concerned, the term "thread" sans additional clarification has devolved to a shorthand.

While that may be true, this confounding of the terms is responsible in no small part...

3dilettante said:
The idea that an implementation needs to be efficient runs into a basic question of whether there is some arbitrary cut-off where a thread isn't a thread if it doesn't have a certain amount of performance.

Sure, although even without that, can Fermi run a full pthreads implementation at all? Can I spawn/fork a new thread in a kernel? Real function calls get us into the space of being able to nicely implement coroutines, but what about real HW "threads" that get scheduled and preempted by the hardware? This can definitely involve some software support, but I'm not sure it's possible at all without dumping all local memories and dispatching a completely separate kernel, which hardly fits the typical definition of fork/join.

3dilettante said:
Can you elaborate the point about rescheduling SIMD lanes on the fly between kernels?

Sure - one often needs to do some data-parallel work across an N element array, and then do some work across M elements, and then back to N, etc. etc. When N != M running this "efficiently" on the current "inside-the-inner-loop" threading/programming models involves either dumping all state and re-dispatching a new kernel (from the CPU) with a different number of "threads", or writing code like:

Code:

if (threadID < N) { ... }
sync()
if (threadID < M) { ... }
sync()
...

This is both ugly and fairly inefficient particularly when it starts to combine with other control flow, which cannot contain sync() instructions within it for obvious reasons.

Now obviously a lot of this is a programming model problem, but that's actually entirely the point if thread is intended to be redefined in terms of the interface to the programmer. Sure the above code operates as expected, but it's a hell of a lot less efficient than writing an actual 1:1 mapping of kernel execution to hardware "core"/thread/cluster. You can kind of do this within the above programming model with some cleverness, but then the things that you're writing end up being called "groups" in the API terminology and your "threads" become abstract entities that get repartitioned as your parallelism braids... this is clearly not the same thing as writing typical CPU code which would equate threads to those groups and SIMD lanes to those threads. This is confusing.

3dilettante said:
If we are debating about what is the essential nature of a thread, this is a implementation detail of significant practical importance, but little theoretical relevance.

That would be true if it were hidden by the programming abstraction but it is not. If you indeed do want to dump all of your local data and re-dispatch, you have to write all that code yourself, and it is not easily wrapped up in any sort of library, as it is kernel-specific.

3dilettante said:
The thread isn't going to know how many other execution units or warp slots went idle, or how much was flushed to memory.

Ah but see the way you have to write code above; it's very clear from the code that max(N,M) - {N,M} resources are "going idle" for at least the duration of evaluating the branch condition and any scheduling overhead. In fact given the "threading" model, there is no way to avoid this.

Andrew Lauritzen · Oct 14, 2009

aaronspink said:
ie, independent implies independence which requires that each can do something entirely different with no convergence. In other words, I should be able to have one thread sitting in a polling wait loop without effecting the forward execution of any other thread. This is clearly something that current Nvidia hardware isn't capable of.

Yes incidentally this is equivalent to the aforementioned restriction that you can't have a sync() inside (potentially) divergent control flow.

DemoCoder · Oct 14, 2009

I think "lie" is a strong word. Historically, marchitecture in the industry has always been somewhat loopy and loose with terminology. I've even noticed people in the forums being fuzzy and making the same mistake ("I need to run X 'threads' to cover the latency of the texture fetch") Basically, companies seek to enter markets by selling them using familar terms people already heard about. There have been huge arguments over RISC/CISC marketing terminology for example, and the dishonesty in the relational database market is legendary.

It's regrettable, but I think unavoidable. For example, the average person has a vague notion of what "cores" are in their CPU, so GPU vendors are going to try to analogize GPU ALUs to cores, rather than explain the difference to the average joe.

3dilettante · Oct 14, 2009

aaronspink said:
So is any other SIMD, but that is only with SEVERE restrictions on the code flow. If I however want to start doing divergent code flows PER datum without re-convergence it won't work because is isn't actually doing anything but predicating results and running the same code on every datum. ie, independent implies independence which requires that each can do something entirely different with no convergence.

If the outcome from running code over every datum with divergent lanes predicated off is not different from a single step progression of as many independent threads, that alone wouldn't be enough to discount it.

In other words, I should be able to have one thread sitting in a polling wait loop without effecting the forward execution of any other thread. This is clearly something that current Nvidia hardware isn't capable of.

However, I would accept that these are not threads if there are certain operations defined for independent threads that fail in the SIMD case.
The mentioned sync restrictions would be a failure of the hardware to hide the implementation details from the nominaly independent instruction streams.

To work around this, the implementation possibly could do something like a round-robin switch between each divergent path to allow forward progress on each, though it doesn't seem that Nvidia has done this.

Data pointers != instruction pointers. And once again, I highly doubt that different lanes will be executing different instructions.

Fermi has support for indirect branches. Depending on the sequence, we could see different lanes wandering off far afield.
That they can't exist in the same EX stage at the same instant is an implementation detail that is not necessarily software-visible.

no, you don't, not really.

I'm debating the version of the hardware Nvidia put forward to the public.
If it turns out that they lied, I will admit that the debate has little practical relevance, but that the points I am making with regards to a design following Nvidia's claims could do what I am discussing.

you mean a reg[X][LANE_INDEX_Y] isn't accessible by LANE[LAND_INDEX_Z]? Once again call the patent office.

I don't see why I'd need to. The concept of a thread is public domain.
There is a architecturally-defined software-visible state. Part of it is that each "thread" has certain private contexts other threads cannot access.
GPR1 for thread0 should not be subject to interference from thread1.

I don't need it to be patent-worthy to state that if the operand collectors and schedulers maintain this independence even if these contexts are stored in physical proximity, they are independent contexts.

SMT OOE processors that use RATs to point to some unique architectural register in a sea of physical registers do the same thing. As long as the external view is consistent, I do not see a problem.

Considering their compiler is explicitly generating the code, unless their is a need for it, I would doubt they'd allow per lane constants to be used out of lane. You seem to be under the impression this is hard.

The compiler generates an instruction with an op code and register addresses that gets broadcast to all the lanes in a SIMD.
If every lane is told to access register 5, they all don't access the exact same physical register, the succession of units from the scheduler to the operand collectors translates this to the correct physical location.
It has to. Going by Realworldtech's description of the instruction queues in Fermi, they are likely not large enough to contain 3*16*7bit (assuming 128 software-addressible registers per lane context) separate register addresses per issue.

Well it is from Nvidia marketing and we do know that a large percentage of what Nvidia originally said about the G80 architecture WAS A LIE!

Can I have a list of the elements of the warp issue and instruction behaviors Nvidia lied about?
The general scheme has been described rather consistently by Nvidia and those who have analyzed the architecture.

The marketing BS with cores I am aware of, but what parts of their description of the actual implementation of the execution engine are lies?

You are trying to portray it as if I'm asking you to doubt the truthfulness of Mother Teresa where the reality is I'm saying you need to be careful about trusting campaign promises.

It's a waste of time for me to debate anything about an architecture if everyone has to step back and wonder what the definition of "is" is.

I've drawn the line where I am taking the generally accepted idea of a SIMD hardware unit residing within a complex scheduler and register access scheme that tries to make implicit the handling of divergent execution through hardware.
You are right that I have to take Nvidia's word on this, including the architectural papers, patents, and the tech press which admittedly varies in quality but has a number of places I do trust to do a quality job of analysis.

Andrew Lauritzen said:
While that may be true, this confounding of the terms is responsible in no small part...

The primary problem I see, after having time to sleep on it, is that there have been two separate conventions for the use of the word thread. One from the software and OS point of view, and one from the hardware point of view.
A kernel thread or userspace thread runs with little distinction from the point of view of the execution engine. A lot of the major distinctions do not reside in hardware but involve conventions or structures maintained at a higher level.

Given that there are processors out there that run in very different or embedded environments that may not be able to run a full Unix-style thread, it would seem incongruous to describe them as being 0-thread architectures.

Sure, although even without that, can Fermi run a full pthreads implementation at all? Can I spawn/fork a new thread in a kernel? Real function calls get us into the space of being able to nicely implement coroutines, but what about real HW "threads" that get scheduled and preempted by the hardware?

I'm not sure that it can. A significant amount of the framework is there.
Is preemption necessary, though? There are probably simple cores that don't do this but we don't relegate them to a category of 0-threads. The Cell SPEs are allowed to run until they finish their one appulet/program/shader.

This can definitely involve some software support, but I'm not sure it's possible at all without dumping all local memories and dispatching a completely separate kernel, which hardly fits the typical definition of fork/join.

Given how the register files are likely allocated, it may need to happen this way in at least some cases. There's no guarantee that there will be any location not already partitioned for another thread in the kernel.

Maybe if instead there were an internal convention for the scheduler where a fork sent a thread's context to a buffer that was structured so that the global scheduler could generate a new warp, but I don't think that has been mooted for Fermi.

Sure - one often needs to do some data-parallel work across an N element array, and then do some work across M elements, and then back to N, etc. etc. When N != M running this "efficiently" on the current "inside-the-inner-loop" threading/programming models involves either dumping all state and re-dispatching a new kernel (from the CPU) with a different number of "threads", or writing code like:

Thanks.
I was coming at it from the starting point of "what if you had N threads and hit a phase where you only need M?" which would have required some additional wrangling in each situation.

Code:
Code:

if (threadID < N) { ... } sync() if (threadID < M) { ... } sync() ...

This is both ugly and fairly inefficient particularly when it starts to combine with other control flow, which cannot contain sync() instructions within it for obvious reasons.

The ugly part I can accept as not being a problem so long as the individual threads see a consistent behavior. The fact there are problems with synchronization would be a fault in the hardware's charade.

That would be true if it were hidden by the programming abstraction but it is not. If you indeed do want to dump all of your local data and re-dispatch, you have to write all that code yourself, and it is not easily wrapped up in any sort of library, as it is kernel-specific.

The thing is that Fermi's exception handling does back up context. If the hardware is able to track this, can't this be leveraged?
If not in this yet-to-be product, it might be possible as a next step.

Mintmaster · Oct 14, 2009

Jawed said:
Fermi might deal a deathblow to that distinction... (Doesn't CUDA already provide efficient linear access to memory? Can't remember.)

Efficient access is easy, but caching is another story. If you have 16 warps forming a 1024x1 scanline, your data is going to be much less cache friendly than if it formed a 32x32 block. Similarly, a 64x1 scanline will cross more structures than an 8x8 block, resulting in less coherent branches. But I suppose if NVidia and ATI agree on a numbering method for thread IDs in warps then we could write a program that has a tiling method for converting IDs into ray origins (or matrix indices, etc).

Looking at NVidia's document, I think that's how it's going to be done:
http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingOverview.pdf (page 18)

Or just have hardware-DWF?...

Oh, you could make the work group match the dimensions/layout that suits the hardware device, e.g. 8x8 on ATI, or 32 work items for NVidia (4x8? 1x32? etc.).

Sorry, I don't know what DWF is, aside from the Autocad format

I thought about that, but if you use local memory then such a method is likely to be very inefficient. I may have something in local memory that I want thousands of other threads to use (read and write). If my code results in 32 work groups in flight, then your method means I can only use 1/32 of the local memory and need to initialize it (if necessary) 32 times.

3dilettante · Oct 14, 2009

Dynamic Warp Formation is what I think DWF stands for.

It would singificantly reduce the SIMD throughput drops due to branch divergence.
If a warp or warps gets split up too much, reform the warps to match.

dnavas · Oct 14, 2009

3dilettante said:
Dynamic Warp Formation is what I think DWF stands for.

It would singificantly reduce the SIMD throughput drops due to branch divergence.
If a warp or warps gets split up too much, reform the warps to match.

Although, now thinking about it in the context of this whole "real threads" .. umm .. discussion, I wonder if DWF isn't breathing life into the quaint and aging assumption that everyone and their brother is always running the same instruction. Seems like the cost of DWF and MIMD might be similar (difference between finding 32 runnable "threads" and finding 32 runnable "threads" all at the same PC is ..?), and we'd get better advantages from MIMD, even if we do need larger instruction caches?

-Dave

RoOoBo · Oct 14, 2009

In my opinion the middle ground is that in the limited programmed environments used to write programs for GPUs (be OpenGL, Direct3D or even CUDA or OpenCL) the processing elements can interpreted as running in different threads/contexts. There is nothing in the different shader languages that defines a dependence or aggregation between the different elements that are executing the same shader/kernel/program (even if it's required for example for normal implementations of mipmaped texture sampling).

The current programming environment would allow for a hardware implementation where each element is effectively processed as a thread.

However that's not how the actual existing hardware works (well, may be but for whatever 'magic' SGX uses for ultra-low end graphics which it makes a lot of sense to work on a single pixel ... however I wonder what they do to compute the texture lod). All GPUs pack those processing elements in vectors of a given length (16, 32, 64, ...) and process them as a single monolythic group. GPUs threads process vectors of elements, not single elements and that's a fact.

From a software point of view, in the programming environments we are working with, a pixel can be understood as a thread, but when executed in the actual hardware where those programs are executed it isn't processed as such.

trinibwoy · Oct 14, 2009

RoOoBo said:
From a software point of view, in the programming environments we are working with, a pixel can be understood as a thread, but when executed in the actual hardware where those programs are executed it isn't processed as such.

Well sure, but we have folks dismissing the software perspective as irrelevant. Also, it's not clear that there is a strict independence among "threads". For example, a sort based on the scan primitive outlined in that patent Jawed linked a while back is certainly predicated on co-operation between SIMD lanes.

NVIDIA Fermi: Architecture discussion

pcchen

Moderator

3dilettante

Andrew Lauritzen

Moderator

Mintmaster

aaronspink

3dilettante

Jawed

3dilettante

Karoshi

aaronspink

Blazkowicz

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

DemoCoder

3dilettante

Mintmaster

3dilettante

dnavas

RoOoBo

trinibwoy

Meh

Similar threads