So is any other SIMD, but that is only with SEVERE restrictions on the code flow. If I however want to start doing divergent code flows PER datum without re-convergence it won't work because is isn't actually doing anything but predicating results and running the same code on every datum. ie, independent implies independence which requires that each can do something entirely different with no convergence.
If the outcome from running code over every datum with divergent lanes predicated off is not different from a single step progression of as many independent threads, that alone wouldn't be enough to discount it.
In other words, I should be able to have one thread sitting in a polling wait loop without effecting the forward execution of any other thread. This is clearly something that current Nvidia hardware isn't capable of.
However, I would accept that these are not threads if there are certain operations defined for independent threads that fail in the SIMD case.
The mentioned sync restrictions would be a failure of the hardware to hide the implementation details from the nominaly independent instruction streams.
To work around this, the implementation possibly could do something like a round-robin switch between each divergent path to allow forward progress on each, though it doesn't seem that Nvidia has done this.
Data pointers != instruction pointers. And once again, I highly doubt that different lanes will be executing different instructions.
Fermi has support for indirect branches. Depending on the sequence, we could see different lanes wandering off far afield.
That they can't exist in the same EX stage at the same instant is an implementation detail that is not necessarily software-visible.
no, you don't, not really.
I'm debating the version of the hardware Nvidia put forward to the public.
If it turns out that they lied, I will admit that the debate has little practical relevance, but that the points I am making with regards to a design following Nvidia's claims could do what I am discussing.
you mean a reg[X][LANE_INDEX_Y] isn't accessible by LANE[LAND_INDEX_Z]? Once again call the patent office.
I don't see why I'd need to. The concept of a thread is public domain.
There is a architecturally-defined software-visible state. Part of it is that each "thread" has certain private contexts other threads cannot access.
GPR1 for thread0 should not be subject to interference from thread1.
I don't need it to be patent-worthy to state that if the operand collectors and schedulers maintain this independence even if these contexts are stored in physical proximity, they are independent contexts.
SMT OOE processors that use RATs to point to some unique architectural register in a sea of physical registers do the same thing. As long as the external view is consistent, I do not see a problem.
Considering their compiler is explicitly generating the code, unless their is a need for it, I would doubt they'd allow per lane constants to be used out of lane. You seem to be under the impression this is hard.
The compiler generates an instruction with an op code and register addresses that gets broadcast to all the lanes in a SIMD.
If every lane is told to access register 5, they all don't access the exact same physical register, the succession of units from the scheduler to the operand collectors translates this to the correct physical location.
It has to. Going by Realworldtech's description of the instruction queues in Fermi, they are likely not large enough to contain 3*16*7bit (assuming 128 software-addressible registers per lane context) separate register addresses per issue.
Well it is from Nvidia marketing and we do know that a large percentage of what Nvidia originally said about the G80 architecture WAS A LIE!
Can I have a list of the elements of the warp issue and instruction behaviors Nvidia lied about?
The general scheme has been described rather consistently by Nvidia and those who have analyzed the architecture.
The marketing BS with cores I am aware of, but what parts of their description of the actual implementation of the execution engine are lies?
You are trying to portray it as if I'm asking you to doubt the truthfulness of Mother Teresa where the reality is I'm saying you need to be careful about trusting campaign promises.
It's a waste of time for me to debate anything about an architecture if everyone has to step back and wonder what the definition of "is" is.
I've drawn the line where I am taking the generally accepted idea of a SIMD hardware unit residing within a complex scheduler and register access scheme that tries to make implicit the handling of divergent execution through hardware.
You are right that I have to take Nvidia's word on this, including the architectural papers, patents, and the tech press which admittedly varies in quality but has a number of places I do trust to do a quality job of analysis.
While that may be true, this confounding of the terms is responsible in no small part...
The primary problem I see, after having time to sleep on it, is that there have been two separate conventions for the use of the word thread. One from the software and OS point of view, and one from the hardware point of view.
A kernel thread or userspace thread runs with little distinction from the point of view of the execution engine. A lot of the major distinctions do not reside in hardware but involve conventions or structures maintained at a higher level.
Given that there are processors out there that run in very different or embedded environments that may not be able to run a full Unix-style thread, it would seem incongruous to describe them as being 0-thread architectures.
Sure, although even without that, can Fermi run a full pthreads implementation at all? Can I spawn/fork a new thread in a kernel? Real function calls get us into the space of being able to nicely implement coroutines, but what about real HW "threads" that get scheduled and preempted by the hardware?
I'm not sure that it can. A significant amount of the framework is there.
Is preemption necessary, though? There are probably simple cores that don't do this but we don't relegate them to a category of 0-threads. The Cell SPEs are allowed to run until they finish their one appulet/program/shader.
This can definitely involve some software support, but I'm not sure it's possible at all without dumping all local memories and dispatching a completely separate kernel, which hardly fits the typical definition of fork/join.
Given how the register files are likely allocated, it may need to happen this way in at least some cases. There's no guarantee that there will be any location not already partitioned for another thread in the kernel.
Maybe if instead there were an internal convention for the scheduler where a fork sent a thread's context to a buffer that was structured so that the global scheduler could generate a new warp, but I don't think that has been mooted for Fermi.
Sure - one often needs to do some data-parallel work across an N element array, and then do some work across M elements, and then back to N, etc. etc. When N != M running this "efficiently" on the current "inside-the-inner-loop" threading/programming models involves either dumping all state and re-dispatching a new kernel (from the CPU) with a different number of "threads", or writing code like:
Thanks.
I was coming at it from the starting point of "what if you had N threads and hit a phase where you only need M?" which would have required some additional wrangling in each situation.
Code:
if (threadID < N) { ... }
sync()
if (threadID < M) { ... }
sync()
...
This is both ugly and fairly inefficient particularly when it starts to combine with other control flow, which cannot contain sync() instructions within it for obvious reasons.
The ugly part I can accept as not being a problem so long as the individual threads see a consistent behavior. The fact there are problems with synchronization would be a fault in the hardware's charade.
That would be true if it were hidden by the programming abstraction but it is not. If you indeed do want to dump all of your local data and re-dispatch, you have to write all that code yourself, and it is not easily wrapped up in any sort of library, as it is kernel-specific.
The thing is that Fermi's exception handling does back up context. If the hardware is able to track this, can't this be leveraged?
If not in this yet-to-be product, it might be possible as a next step.