NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,741
    Likes Received:
    105
    Location:
    Taiwan
    Well, for example, NVIDIA's hardware can't handle unreducible flow graph, if the "threads" belongs to the same warp. This is a functional restriction, because it's actually a SIMD machine.

    Even if we consider only "CUDA/OpenCL programs" their behavior is clearly more similar to a SIMD machine rather than a multiple threads machine. For example, inside a warp, these "threads" can only execute one branch at a time. You don't find that restriction on normal threaded processors.

    An advantage of using well-known nomemcaltures is that everyone who is familiar with common computer science terms can understand the behavior and characteristics immediately. People get a clear idea of G8X/GT200's behavior if I tell them "it's a SIMD machine with predication and gather/scatter." (note that older HPC SIMD machines do have predication and gather/scatter, it's the contemporary MMX/SSE that lacks these)

    Consider if someone develops a new mathematics geometric theorem, but he uses "zine" instead of "cosine" and "zane" instead of "sine." Wouldn't you think it'd be very confusing?
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    This weakest term for thread I've seen is a "loci of execution", but this would be the probable floor of how imprecisely it can be taken.
    A component of execution within a SPMD context could potentially align with that.
    My one question is that it was qualified that such a loci can run on any processor, which may not happen for many or all GPUs.

    With the intervention of the setup hardware and schedulers, each lane has ownership of a program counter and private registers. With Fermi, a stack or something like it would be emulated by the hardware as well.
    The OS is not aware of the threads, however. Though not all threads necessarily need to be exposed at that level, either.

    If the hardware didn't also handle a certain amount of separation of lane-specific context, it would be just a predicated SIMD lane.
    However, the hardware does maintain this state and handles it without the software intervening or handling this state explicitely.

    I agree that SIMT seems like an unnecessary addition, it seems to describe something little different from SPMD.

    Those bits wouldn't be capable of addressing their own state or driving program execution independently without explicit software management.

    Perhaps things have gotten that ridiculous.
    The way the hardware brings about a certain end result, whether by separate pipelines, multiple hardware threads, or a SIMD unit with internal state juggling, may not matter as long as it is software-transparent.
     
  3. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Sure we can obviously make the term as general as we want, but I think the balance of usage and understanding is with respect to tradition OSes and POSIX-like threads.

    Fermi definitely brings things closer (theoretically), but there are still missing capabilities like irreducible control flow, the ability to spawn/fork "threads" on the fly, etc. that are expected out of the term "thread" and the traditional understanding of it. If they can do an efficient pthreads implementation in terms of Fermi, I'll be ecstatic, but I don't think they're there yet...

    Now I'm not arguing that pthreads are the most efficient way to write massively data-parallel code going forward, but this is the typical understanding of the term "thread" and I see no reason to confound that.

    Sure but neither are CUDA "threads" capable of spawning additional work or rescheduling SIMD lanes on the fly between kernels. All of that requires "explicit software management" and in this case, that management only happens from the CPU!

    Don't underestimate the importance of this... right now with any amount of "braided" parallelism, you're faced with the ridiculous choice of dumping all "shared memory" state to VRAM, or simply idling on some large number of "threads" and wasting processor resources. Either that you have to implement work stealing in software at which point you begin to write "threads" as 1:1 with the *real* hardware threads, further arguing the misuse of the term in the first place. Even then, in practice you still end up having to dump all your state out to VRAM when switching to another kernel even if that kernel inputs the exact same data set...

    To some extent that's definitely true, but not only are we still in a different programming model than typical pthreads (both in terms of implementation and functionality), but to write efficient code on these machines you have to have a good understanding of how it maps to the hardware SIMD units. OpenCL nicely separates these concepts by abstracting the concept of a "work item" from that of a traditional CPU thread, which is the right thing to do IMO. While you could argue that things like hyper-threading are similar concepts in that they aren't giving you two 100% independent 2x throughput threads, but with hyper-threading you still write code as if you have two completely separate execution resources. With SIMD you don't - the reality of the execution model greatly affects everything from control flow to data structures. HPC people at least are used to dealing with SIMD though, so even if you don't expose it with the typical programming models, it's worth being clear and up-front about what is happening so that people can apply their experience properly to the given hardware. Terminology is a big part of this, and reusing the term "thread" is just confusing.
     
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Is this going to pose problems in trying to organize work items into groups that branch and fetch as coherently as possible?

    For example, if you have a million work items for raytracing that you enumerate left to right, then top to bottom, how is OpenCL going to know that 8x8 blocks will branch and access data much more coherently than 64x1?

    Or do we just have to program it so that items with nearby IDs should be as coherent as possible?
     
  5. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Try "Independent loci of execution". The Independent part is the important part...

    Not really. They might SAY it does. Doesn't mean it does.


    What? You mean it has vector CONSTANTS? OMG. Call the patent office.

    so outside of a vector register that has data on a per lane basis, um wait. Thats like every SIMD ever designed.

    So you've seen the internal code that runs on the G80?
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    I'd hate to appeal to authority, but at least as far as Patterson and Hennessy are concerned, the term "thread" sans additional clarification has devolved to a shorthand.

    A CPU that handles 4 "threads" uses a very basic definition of the term, as it doesn't discriminate as to whether it's a kernel thread or a pthread, or some other specific variety of thread. In this case, a thread is an atom of program execution state. There is some amount of essential "threadness" for which other distinctions are a collection of clutter and special-cases.

    As far as SIMT goes as a categorization, it is focused at the level of the the hardware execution model.

    I don't know if Fermi's control flow handling has changed with regards to irreducibility.
    The idea that an implementation needs to be efficient runs into a basic question of whether there is some arbitrary cut-off where a thread isn't a thread if it doesn't have a certain amount of performance.
    From a practical standpoint, there are usability concerns.
    From an algorithmic standpoint, how quickly such cases are handled is beside the point if the hardware somehow can handle it internally.

    As far as standard threads go, is there really any way to fork or spawn new threads without some procedure call or explicit instruction in software?

    What would be missing would be a procedure call that is capable of setting up an additional compute kernel or component thereof.
    Whether it was particularly fast or efficient wouldn't impact the existence or non/existence of this missing piece.

    Can you elaborate the point about rescheduling SIMD lanes on the fly between kernels?

    If we are debating about what is the essential nature of a thread, this is a implementation detail of significant practical importance, but little theoretical relevance.
    The thread isn't going to know how many other execution units or warp slots went idle, or how much was flushed to memory.
    Algorithmically, it won't impact the result.

    There may be other deficiencies, like irreducible control flow, but efficiency doesn't lend itself to non-arbitrary distinctions.
    Maybe I can live with an IPC of .00000001 for certain parts of thread behavior, whilst you wouldn't.

    The program stream treats each thread as it always has treated it: a sequential series of operations. Throughput involves a time dimension, for which there is no hard limit of what is acceptable. Other than algorithmic steps or a timer function, the instruction streams are essentially timeless.

    CPUs are considered embodiments of the Von Neumann machine, but they suck at self-modifying code, with multiple orders of magnitude drops in performance.
    Is there a non-arbitrary distinction drawn in this instance other than "this way is not fast enough"?

    As far as a physical implementation of an execution engine goes, we've probably long passed this point.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Potentially. I'm sure there are CUDA programmers who get a queasy feeling from losing warp-based indexing - and some of the Scan related stuff NVidia does is based on warp size and its interaction with shared memory.

    (Indeed that Scan patent uses a 32-bit predicate - so it's very convenient for warps to be 32-wide.)

    Fermi might deal a deathblow to that distinction... (Doesn't CUDA already provide efficient linear access to memory? Can't remember.)

    From the little playing I've done, Z-order optimisation is a pain in the neck. prunedtree got his efficient matrix multiplication only by tackling that head-on. AMD didn't get there and they're the ones who built the damn thing.

    There are attributes of the OpenCL device you can query:

    http://forum.beyond3d.com/showpost.php?p=1320595&postcount=117

    and you can write self-hardware-profiling applications that tune parameters of the kernel to the device in question. That is, of course, if you don't decide merely to tune for the few processors you have to hand and leave it at that.

    How many different processor architectures is a given OpenCL application ever going to appear on? AMD is helping things along nicedy by cutting off everything earlier than RV770.

    Or just have hardware-DWF?...

    Oh, you could make the work group match the dimensions/layout that suits the hardware device, e.g. 8x8 on ATI, or 32 work items for NVidia (4x8? 1x32? etc.).

    Jawed
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    Independent to what extent or in what way?

    They are capable upon hitting divergent branch outcomes of reaching a state in each lane consistent with what would happen if they were running separately.
    With Fermi's more fully realized pointer support, the set of points each lane can go to in instruction memory independently is even wider.

    As long as this is consistent from the POV of one instruction to the next, it doesn't matter if they had to take turns or one lane was predicated off for half the run.

    You're right. I do have to take them on their word at some level.
    They say the hardware handles this.
    It could be an elaborate ruse where there are magical gnomes who push little levers whenever a status register is flipped too.
    There might be unicorns running in my Pentium 4, as well, but I won't hold it against Intel if they don't suffer from confined conditions.

    The registers as being defined for a particular thread context are not accessible by other lanes, barring explicit sharing through some other method, unless you think it would be valid while running a shader for a GP register access for every pixel to pull from the exact same physical register.

    Even then, if that value is determined to be the same value and the results are valid, it wouldn't matter algorithmically.

    A hardware-managed set of data or one requiring dedicated instructions and algorithmic modification to manage it?
    Should I start bitching about how that stupid P4 keeps putting all my neatly ordered instructions out of sequence, but then sneakily puts them all back?
    I hate lying silicon.

    I guess everything ever stated about the architecture could be a lie.
    I'm not sure I can live with that level of extreme distrust.
    I'm not even sure now if what you've just written is English, or merely a foreign language with this post just happening to look the same as English, and in actuality you are posting a recipe for a nice beef stew.

    I have to start from somewhere.
     
  9. Karoshi

    Newcomer

    Joined:
    Aug 31, 2005
    Messages:
    181
    Likes Received:
    0
    Location:
    Mars
    Pfffft, I laugh at fermi with it's laughable 30k threads.
    Linux was supporting HUNDREDS OF THOUSANDS of threads on x86 years ago.
    Yes, some people tested that.
     
  10. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    All other threads. Nvidia fails this one by the way...

    So is any other SIMD, but that is only with SEVERE restrictions on the code flow. If I however want to start doing divergent code flows PER datum without re-convergence it won't work because is isn't actually doing anything but predicating results and running the same code on every datum. ie, independent implies independence which requires that each can do something entirely different with no convergence. In other words, I should be able to have one thread sitting in a polling wait loop without effecting the forward execution of any other thread. This is clearly something that current Nvidia hardware isn't capable of.

    Data pointers != instruction pointers. And once again, I highly doubt that different lanes will be executing different instructions.


    no, you don't, not really.


    you mean a reg[X][LANE_INDEX_Y] isn't accessible by LANE[LAND_INDEX_Z]? Once again call the patent office. Considering their compiler is explicitly generating the code, unless their is a need for it, I would doubt they'd allow per lane constants to be used out of lane. You seem to be under the impression this is hard.

    Well it is from Nvidia marketing and we do know that a large percentage of what Nvidia originally said about the G80 architecture WAS A LIE! Its not like people are going out of their way to second guess Nvidia since they started it all with their wild obfuscation at the G80 launch.

    Then start here: who has gone out of their way to not only flat out LIE about their architecture but to also do their best to confuse established terminology to paint their architecture in a more favorable light?

    The viewpoint that we cannot trust what Nvidia actually says about their hardware was only created because Nvidia itself taught us that we could not trust what they say about their hardware.

    You are trying to portray it as if I'm asking you to doubt the truthfulness of Mother Teresa where the reality is I'm saying you need to be careful about trusting campaign promises.
     
  11. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    It didn't like my fork bomb though.
     
  12. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    While that may be true, this confounding of the terms is responsible in no small part...

    Sure, although even without that, can Fermi run a full pthreads implementation at all? Can I spawn/fork a new thread in a kernel? Real function calls get us into the space of being able to nicely implement coroutines, but what about real HW "threads" that get scheduled and preempted by the hardware? This can definitely involve some software support, but I'm not sure it's possible at all without dumping all local memories and dispatching a completely separate kernel, which hardly fits the typical definition of fork/join.

    Sure - one often needs to do some data-parallel work across an N element array, and then do some work across M elements, and then back to N, etc. etc. When N != M running this "efficiently" on the current "inside-the-inner-loop" threading/programming models involves either dumping all state and re-dispatching a new kernel (from the CPU) with a different number of "threads", or writing code like:

    Code:
    if (threadID < N) { ... }
    sync()
    if (threadID < M) { ... }
    sync()
    ...
    This is both ugly and fairly inefficient particularly when it starts to combine with other control flow, which cannot contain sync() instructions within it for obvious reasons.

    Now obviously a lot of this is a programming model problem, but that's actually entirely the point if thread is intended to be redefined in terms of the interface to the programmer. Sure the above code operates as expected, but it's a hell of a lot less efficient than writing an actual 1:1 mapping of kernel execution to hardware "core"/thread/cluster. You can kind of do this within the above programming model with some cleverness, but then the things that you're writing end up being called "groups" in the API terminology and your "threads" become abstract entities that get repartitioned as your parallelism braids... this is clearly not the same thing as writing typical CPU code which would equate threads to those groups and SIMD lanes to those threads. This is confusing.

    That would be true if it were hidden by the programming abstraction but it is not. If you indeed do want to dump all of your local data and re-dispatch, you have to write all that code yourself, and it is not easily wrapped up in any sort of library, as it is kernel-specific.

    Ah but see the way you have to write code above; it's very clear from the code that max(N,M) - {N,M} resources are "going idle" for at least the duration of evaluating the branch condition and any scheduling overhead. In fact given the "threading" model, there is no way to avoid this.
     
  13. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yes incidentally this is equivalent to the aforementioned restriction that you can't have a sync() inside (potentially) divergent control flow.
     
  14. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    I think "lie" is a strong word. Historically, marchitecture in the industry has always been somewhat loopy and loose with terminology. I've even noticed people in the forums being fuzzy and making the same mistake ("I need to run X 'threads' to cover the latency of the texture fetch") Basically, companies seek to enter markets by selling them using familar terms people already heard about. There have been huge arguments over RISC/CISC marketing terminology for example, and the dishonesty in the relational database market is legendary.

    It's regrettable, but I think unavoidable. For example, the average person has a vague notion of what "cores" are in their CPU, so GPU vendors are going to try to analogize GPU ALUs to cores, rather than explain the difference to the average joe.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    If the outcome from running code over every datum with divergent lanes predicated off is not different from a single step progression of as many independent threads, that alone wouldn't be enough to discount it.

    However, I would accept that these are not threads if there are certain operations defined for independent threads that fail in the SIMD case.
    The mentioned sync restrictions would be a failure of the hardware to hide the implementation details from the nominaly independent instruction streams.

    To work around this, the implementation possibly could do something like a round-robin switch between each divergent path to allow forward progress on each, though it doesn't seem that Nvidia has done this.

    Fermi has support for indirect branches. Depending on the sequence, we could see different lanes wandering off far afield.
    That they can't exist in the same EX stage at the same instant is an implementation detail that is not necessarily software-visible.

    I'm debating the version of the hardware Nvidia put forward to the public.
    If it turns out that they lied, I will admit that the debate has little practical relevance, but that the points I am making with regards to a design following Nvidia's claims could do what I am discussing.

    I don't see why I'd need to. The concept of a thread is public domain.
    There is a architecturally-defined software-visible state. Part of it is that each "thread" has certain private contexts other threads cannot access.
    GPR1 for thread0 should not be subject to interference from thread1.

    I don't need it to be patent-worthy to state that if the operand collectors and schedulers maintain this independence even if these contexts are stored in physical proximity, they are independent contexts.

    SMT OOE processors that use RATs to point to some unique architectural register in a sea of physical registers do the same thing. As long as the external view is consistent, I do not see a problem.

    The compiler generates an instruction with an op code and register addresses that gets broadcast to all the lanes in a SIMD.
    If every lane is told to access register 5, they all don't access the exact same physical register, the succession of units from the scheduler to the operand collectors translates this to the correct physical location.
    It has to. Going by Realworldtech's description of the instruction queues in Fermi, they are likely not large enough to contain 3*16*7bit (assuming 128 software-addressible registers per lane context) separate register addresses per issue.

    Can I have a list of the elements of the warp issue and instruction behaviors Nvidia lied about?
    The general scheme has been described rather consistently by Nvidia and those who have analyzed the architecture.

    The marketing BS with cores I am aware of, but what parts of their description of the actual implementation of the execution engine are lies?

    It's a waste of time for me to debate anything about an architecture if everyone has to step back and wonder what the definition of "is" is.

    I've drawn the line where I am taking the generally accepted idea of a SIMD hardware unit residing within a complex scheduler and register access scheme that tries to make implicit the handling of divergent execution through hardware.
    You are right that I have to take Nvidia's word on this, including the architectural papers, patents, and the tech press which admittedly varies in quality but has a number of places I do trust to do a quality job of analysis.


    The primary problem I see, after having time to sleep on it, is that there have been two separate conventions for the use of the word thread. One from the software and OS point of view, and one from the hardware point of view.
    A kernel thread or userspace thread runs with little distinction from the point of view of the execution engine. A lot of the major distinctions do not reside in hardware but involve conventions or structures maintained at a higher level.

    Given that there are processors out there that run in very different or embedded environments that may not be able to run a full Unix-style thread, it would seem incongruous to describe them as being 0-thread architectures.

    I'm not sure that it can. A significant amount of the framework is there.
    Is preemption necessary, though? There are probably simple cores that don't do this but we don't relegate them to a category of 0-threads. The Cell SPEs are allowed to run until they finish their one appulet/program/shader.

    Given how the register files are likely allocated, it may need to happen this way in at least some cases. There's no guarantee that there will be any location not already partitioned for another thread in the kernel.

    Maybe if instead there were an internal convention for the scheduler where a fork sent a thread's context to a buffer that was structured so that the global scheduler could generate a new warp, but I don't think that has been mooted for Fermi.

    Thanks.
    I was coming at it from the starting point of "what if you had N threads and hit a phase where you only need M?" which would have required some additional wrangling in each situation.

    The ugly part I can accept as not being a problem so long as the individual threads see a consistent behavior. The fact there are problems with synchronization would be a fault in the hardware's charade.

    The thing is that Fermi's exception handling does back up context. If the hardware is able to track this, can't this be leveraged?
    If not in this yet-to-be product, it might be possible as a next step.
     
  16. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Efficient access is easy, but caching is another story. If you have 16 warps forming a 1024x1 scanline, your data is going to be much less cache friendly than if it formed a 32x32 block. Similarly, a 64x1 scanline will cross more structures than an 8x8 block, resulting in less coherent branches. But I suppose if NVidia and ATI agree on a numbering method for thread IDs in warps then we could write a program that has a tiling method for converting IDs into ray origins (or matrix indices, etc).

    Looking at NVidia's document, I think that's how it's going to be done:
    http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingOverview.pdf (page 18)

    Sorry, I don't know what DWF is, aside from the Autocad format ;)

    I thought about that, but if you use local memory then such a method is likely to be very inefficient. I may have something in local memory that I want thousands of other threads to use (read and write). If my code results in 32 work groups in flight, then your method means I can only use 1/32 of the local memory and need to initialize it (if necessary) 32 times.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,867
    Location:
    Well within 3d
    Dynamic Warp Formation is what I think DWF stands for.

    It would singificantly reduce the SIMD throughput drops due to branch divergence.
    If a warp or warps gets split up too much, reform the warps to match.
     
  18. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Although, now thinking about it in the context of this whole "real threads" .. umm .. discussion, I wonder if DWF isn't breathing life into the quaint and aging assumption that everyone and their brother is always running the same instruction. Seems like the cost of DWF and MIMD might be similar (difference between finding 32 runnable "threads" and finding 32 runnable "threads" all at the same PC is ..?), and we'd get better advantages from MIMD, even if we do need larger instruction caches?

    -Dave
     
  19. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    In my opinion the middle ground is that in the limited programmed environments used to write programs for GPUs (be OpenGL, Direct3D or even CUDA or OpenCL) the processing elements can interpreted as running in different threads/contexts. There is nothing in the different shader languages that defines a dependence or aggregation between the different elements that are executing the same shader/kernel/program (even if it's required for example for normal implementations of mipmaped texture sampling).

    The current programming environment would allow for a hardware implementation where each element is effectively processed as a thread.

    However that's not how the actual existing hardware works (well, may be but for whatever 'magic' SGX uses for ultra-low end graphics which it makes a lot of sense to work on a single pixel ... however I wonder what they do to compute the texture lod). All GPUs pack those processing elements in vectors of a given length (16, 32, 64, ...) and process them as a single monolythic group. GPUs threads process vectors of elements, not single elements and that's a fact.

    From a software point of view, in the programming environments we are working with, a pixel can be understood as a thread, but when executed in the actual hardware where those programs are executed it isn't processed as such.
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,418
    Likes Received:
    411
    Location:
    New York
    Well sure, but we have folks dismissing the software perspective as irrelevant. Also, it's not clear that there is a strict independence among "threads". For example, a sort based on the scan primitive outlined in that patent Jawed linked a while back is certainly predicated on co-operation between SIMD lanes.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...